ICML Poster Function-Space Learning Rates

Poster

Function-Space Learning Rates

Edward Milsom · Ben Anson · Laurence Aitchison

East Exhibition Hall A-B #E-2005

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

We consider layerwise function-space learning rates, which measure the magnitude of the change in a neural network's output function in response to an update to a parameter tensor. This contrasts with traditional learning rates, which describe the magnitude of changes in parameter space. We develop efficient methods to measure and set function-space learning rates in arbitrary neural networks, requiring only minimal computational overhead through a few additional backward passes that can be performed at the start of, or periodically during, training. We demonstrate two key applications: (1) analysing the dynamics of standard neural network optimisers in function space, rather than parameter space, and (2) introducing FLeRM (Function-space Learning Rate Matching), a novel approach to hyperparameter transfer across model scales. FLeRM records function-space learning rates while training a small, cheap base model, then automatically adjusts parameter-space layerwise learning rates when training larger models to maintain consistent function-space updates. FLeRM gives hyperparameter transfer across model width, depth, initialisation scale, and LoRA rank in various architectures including MLPs with residual connections and transformers with different layer normalisation schemes.

Lay Summary:

We propose a way to estimate how changes to individual parts of an AI system affect its operation as a whole. This will help AI researchers better understand why current AI systems work well and how to improve them in the future. For example, in our paper we utilise our method to tune very large AI systems by tuning a small, cheap model and then copying the settings to the large model (which would be very computationally expensive to tune by itself). This is not usually possible, because the optimal settings change between small and large AI systems, but with our method, we can predict and therefore correct these changes.

Chat is not available.