ICML Poster LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently

Spotlight Poster

LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently

Yuanhe Zhang · Fanghui Liu · Yudong Chen

West Exhibition Hall B2-B3 #W-905

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Oral presentation: Oral 6C Learning Dynamics 2
Thu 17 Jul 3:30 p.m. PDT — 4:30 p.m. PDT

Abstract:

This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA) (Hu et al., 2022) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately—applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation. Code is available at: https://github.com/YuanheZ/LoRA-One.

Lay Summary:

Large language models like ChatGPT are very popular in recent years. For example, they can understand what you say, write like humans, and even solve maths problems! But these models are huge, e.g., in the size of hundred billions. They need a lot of computing power and huge storage space. This work attempts to make these big models work well on a new task with few time and memory cost. LoRA-One, introduced in our paper, takes one careful look at how the full model wants to change in one step, and then setting this change at begining. In cooking terms, imagine you’re making soup, LoRA-One’s idea is to first taste the soup once to know exactly what flavor it needs, then add the perfect amount of spice all at once, instead of trial-and-error pinches. We build LoRA-One in mathematical way and empieirally evaluate it in practical tasks, even with more than 10x speedup comparing to other popular methods.

Chat is not available.