Poster
in
Workshop: The Impact of Memorization on Trustworthy Foundation Models
Language models’ activations linearly encode training-order recency
Dmitrii Krasheninnikov · Richard E Turner · David Krueger
Language models' activations appear to linearly encode the recency of training data exposure. Our experimental setup involves sequentially fine-tuning Llama-3.2-1B on two disjoint but otherwise similar aliased entity datasets, followed by training linear probes on the activations of this fine-tuned model. We find that probes can accurately (90%) distinguish 'early' vs. 'late' entities, generalizing to entities unseen during the probes' own training. Furthermore, the model can be explicitly fine-tuned to report an unseen entity's training stage (80% accuracy). Similar experiments involving sequential finetuning on three or six disjoint datasets confirm a linear direction tracking the order of learning. Notably, this robust linear signal does not seem attributable to simple differences in activation magnitudes or output logit statistics. This inherent encoding of the learning sequence illuminates a fundamental mechanism enabling models to differentiate information by its acquisition time, carrying significant implications for how they might form beliefs, manage conflicting data, and respond to targeted knowledge modifications.