Poster
SPEX: Scaling Feature Interaction Explanations for LLMs
Justin S. Kang · Landon Butler · Abhineet Agarwal · Yigit Efe Erginbas · Ramtin Pedarsani · Bin Yu · Kannan Ramchandran
West Exhibition Hall B2-B3 #W-1102
Large language models (LLMs) excel by understanding complex interactions between input features (like words). However, explaining which combinations of features drive an LLM's decision is difficult, especially for long inputs, as current methods are limited. SPEX is a new algorithm that efficiently identifies these crucial feature interactions even in large inputs (around 1000 features). It leverages the idea that LLM outputs are often driven by a few sparse interactions. SPEX uses a technique called a sparse Fourier transform with a channel decoding algorithm to pinpoint these key interactions without exhaustive searching. Experiments show SPEX reconstructs LLM outputs up to 20% more faithfully than methods ignoring interactions. It successfully identifies influential features and their combinations, and on the HotpotQA dataset, its findings align with human annotations. SPEX can also explain reasoning in advanced models like GPT-4o mini and vision-language models.