ICML Poster SPEX: Scaling Feature Interaction Explanations for LLMs

Poster

SPEX: Scaling Feature Interaction Explanations for LLMs

Justin S. Kang · Landon Butler · Abhineet Agarwal · Yigit Efe Erginbas · Ramtin Pedarsani · Bin Yu · Kannan Ramchandran

West Exhibition Hall B2-B3 #W-1102

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide *marginal* feature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We propose *Spectral Explainer* (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($\approx 1000)$. SPEX exploits underlying natural sparsity among interactions—common in real-world data—and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20\% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, *HotpotQA*, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (*GPT-4o mini*) and compositional reasoning in vision-language models.

Lay Summary:

Large language models (LLMs) excel by understanding complex interactions between input features (like words). However, explaining which combinations of features drive an LLM's decision is difficult, especially for long inputs, as current methods are limited. SPEX is a new algorithm that efficiently identifies these crucial feature interactions even in large inputs (around 1000 features). It leverages the idea that LLM outputs are often driven by a few sparse interactions. SPEX uses a technique called a sparse Fourier transform with a channel decoding algorithm to pinpoint these key interactions without exhaustive searching. Experiments show SPEX reconstructs LLM outputs up to 20% more faithfully than methods ignoring interactions. It successfully identifies influential features and their combinations, and on the HotpotQA dataset, its findings align with human annotations. SPEX can also explain reasoning in advanced models like GPT-4o mini and vision-language models.

Chat is not available.