Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Methods and Opportunities at Small Scale (MOSS)

Decomposed Learning: An Avenue for Mitigating Grokking

Gabryel Mason-Williams · Israel Mason-Williams

Keywords: [ compression ] [ grokking ] [ SVD ] [ linear algebra ] [ optimisation ]


Abstract: Grokking is a delayed transition from memorisation to generalisation in neural networks. It challenges perspectives on efficient learning, particularly in structured tasks and small-data regimes. We explore grokking in modular arithmetic from the perspective of a training pathology. We use Singular Value Decomposition (SVD) to modify the weight matrices of neural networks by changing the representation of the weight matrix, $W$, into the product of three matrices, $U$, $\Sigma$ and $V^T$. Through empirical evaluations on the modular addition task, we show that this representation significantly reduces the effect of grokking and, in some cases, eliminates it.

Chat is not available.