Poster
Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics
Shiwei Li · Xiandi Luo · Xing Tang · Haozhao Wang · Hao Chen · weihongluo · Yuhua Li · xiuqiang He · Ruixuan Li
East Exhibition Hall A-B #E-2108
Fine-tuning large language models is extremely expensive, so researchers often turn to a technique called Low-Rank Adaptation (LoRA), which approximates the update of the pretrained weight matrix using two smaller low-rank matrices. Typically, one of these matrices is initialized to zero to ensure that fine-tuning starts exactly from the pretrained model. However, there is no theoretical reason why zero initialization should be the optimal choice.This raises a simple question: what if we initialize both low-rank matrices with small, non-zero values? Through theoretical analysis and extensive experiments, we find that this non-zero initialization makes LoRA more robust to suboptimal, especially smaller learning rates. These findings challenge two long-standing assumptions in LoRA fine-tuning: first, that one of the low-rank matrices must be initialized to zero, and second, that fine-tuning must begin exactly from the pretrained model. Instead, we show that carefully scaled non-zero initialization not only works, but can improve robustness and overall accuracy.