ICML Spectral Scaling Laws in Language Models: \\ {\em How Effectively Do Feed-Forward Networks Use Their Latent Space?}

Poster
in
Workshop: Actionable Interpretability

Spectral Scaling Laws in Language Models: \\ {\em How Effectively Do Feed-Forward Networks Use Their Latent Space?}

Nandan Kumar Jha · Brandon Reagen

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 19 Jul 10:40 a.m. PDT — 11:40 a.m. PDT

Abstract: As Large Language Models (LLMs) continue to scale, understanding how effectively their internal capacity is utilized becomes increasingly important, especially for inference-time efficiency. While existing scaling laws relate model size to loss and compute, they offer little insight into the representational dynamics of individual components. In this work, we focus on the Feed-Forward Network (FFN), a dominant sub-block in decoder-only transformers, and recast FFN width selection as a {\em spectral utilization} problem. We introduce a lightweight, differentiable diagnostic suite comprising: {\em Hard Rank} (Participation Ratio), {\em Soft Rank} (spectral entropy), Spectral Concentration, and the composite Spectral Utilization Index (SUI), designed to quantify how many latent directions are meaningfully activated. Our spectral audit across GPT-2, LLaMA, and nGPT models reveals that spectral utilization grows with model size but not monotonically with width, often peaking at intermediate dimensions (e.g., $D=2048$). We identify clear instances of \emph{spectral collapse}, where wider FFNs concentrate variance into a narrow subspace, leaving much of the latent space unused.

Chat is not available.

Poster in Workshop: Actionable Interpretability

Spectral Scaling Laws in Language Models: \\ {\em How Effectively Do Feed-Forward Networks Use Their Latent Space?}

Nandan Kumar Jha · Brandon Reagen

Poster
in
Workshop: Actionable Interpretability