Poster
SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
Samir Khaki · Xiuyu Li · Junxian Guo · Ligeng Zhu · Konstantinos N (Kostas) Plataniotis · Amir Yazdanbakhsh · Kurt Keutzer · Song Han · Zhijian Liu
East Exhibition Hall A-B #E-3004
Abstract:
Fine-tuning LLMs is both computationally andmemory-intensive. While parameter-efficient fine-tuning methods, such as QLoRA and DoRA,reduce the number of trainable parameters andlower memory usage, they do not decrease computational cost. In some cases, they may evenslow down fine-tuning. In this paper, we introduceSparseLoRA, a method that accelerates LLM fine-tuning through contextual sparsity. We proposea lightweight, training-free SVD sparsity estimator that dynamically selects a sparse subset ofweights for loss and gradient computation. Also,we systematically analyze and address sensitivityacross layers, tokens, and training steps. Our experimental results show that SparseLoRA reducescomputational cost by up to $2.0\times$ and a measuredspeedup of up to $1.5\times$ while maintaining accuracy across various downstream tasks, includingcommonsense and arithmetic reasoning, code generation, and instruction following.
Lay Summary:
Fine-tuning large language models (LLMs) for new tasks typically requires significant compute and memory. Recent techniques, like QLoRA and DoRA, make fine-tuning more memory-efficient by reducing how many model parameters change during training. However, these methods are often less runtime efficient, slowing down fine-tuning.In our work, we introduce SparseLoRA, a new approach that makes fine-tuning faster by carefully choosing only a small, important subset of parameters to activate at each training step. We use a lightweight, training-free estimator based on singular value decomposition (SVD) to efficiently predict which parts of the model can be skipped during training based on the input activations and weight characteristics. We also thoroughly analyze how this method behaves differently across layers, input tokens, and training phases to ensure stability.Experiments show that SparseLoRA halves computational costs and achieves up to $1.5\times$ faster fine-tuning and $2\times$ computational savings, all without sacrificing model accuracy on various tasks like reasoning, coding, and following instructions. Our work offers a scalable and computationally efficient solution for fine-tuning modern LLMs.
Chat is not available.