ICML Poster Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics

Poster

Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics

Shiwei Li · Xiandi Luo · Xing Tang · Haozhao Wang · Hao Chen · weihongluo · Yuhua Li · xiuqiang He · Ruixuan Li

East Exhibition Hall A-B #E-2108

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method. In standard LoRA layers, one of the matrices, $A$ or $B$, is initialized to zero, ensuring that fine-tuning starts from the pretrained model. However, there is no theoretical support for this practice.In this paper, we investigate the impact of non-zero initialization on LoRA's fine-tuning dynamics from an infinite-width perspective. Our analysis reveals that, compared to zero initialization, simultaneously initializing $A$ and $B$ to non-zero values improves LoRA's robustness to suboptimal learning rates, particularly smaller ones. Further analysis indicates that although the non-zero initialization of $AB$ introduces random noise into the pretrained weight, it generally does not affect fine-tuning performance. In other words, fine-tuning does not need to strictly start from the pretrained model.The validity of our findings is confirmed through extensive experiments across various models and datasets. The code is available at https://github.com/Leopold1423/non_zero_lora-icml25.

Lay Summary:

Fine-tuning large language models is extremely expensive, so researchers often turn to a technique called Low-Rank Adaptation (LoRA), which approximates the update of the pretrained weight matrix using two smaller low-rank matrices. Typically, one of these matrices is initialized to zero to ensure that fine-tuning starts exactly from the pretrained model. However, there is no theoretical reason why zero initialization should be the optimal choice.This raises a simple question: what if we initialize both low-rank matrices with small, non-zero values? Through theoretical analysis and extensive experiments, we find that this non-zero initialization makes LoRA more robust to suboptimal, especially smaller learning rates. These findings challenge two long-standing assumptions in LoRA fine-tuning: first, that one of the low-rank matrices must be initialized to zero, and second, that fine-tuning must begin exactly from the pretrained model. Instead, we show that carefully scaled non-zero initialization not only works, but can improve robustness and overall accuracy.

Chat is not available.