ICML Poster CLOVER: Cross-Layer Orthogonal Vectors Pruning

Poster

CLOVER: Cross-Layer Orthogonal Vectors Pruning

Fanxu Meng · Pingzhi Tang · Fan Jiang · Muhan Zhang

East Exhibition Hall A-B #E-2812

[ Abstract ] [ Lay Summary ]

[ Slides] [ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bounded. To address this challenge, we introduce CLOVER (Cross-Layer Orthogonal Vectors) pruning, a novel approach that treats pairs of components of the attention mechanism as low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K and V-O pairs within each attention head. The resulting singular values, in turn, guide pruning and further serve as trainable parameters for efficient fine-tuning, ultimately enabling the model to recover its performance to the level before pruning.After pruning and fine-tuning, these values are reintegrated into the model without increasing its parameter count. Visualizations across various models show that CLOVER effectively removes linear redundancies within attention heads, greatly improving pruning efficiency. For example, pruning 70\% of the Q-K head dimension in GPT-2 XL results in a perplexity comparable to that of pruning just 8\% using vanilla pruning. The combination of CLOVER and TransMLA achieves a speedup of up to 11.1$\times$ over LLaMA-2-7B.

Lay Summary:

Large language models (LLMs) like GPT-2 and GPT-3 have revolutionized many fields by generating human-like text, but they face challenges as they grow in size. One of the main issues is the memory needed for their key-value (KV) caching system, which can slow down performance, especially in models with long contexts. To tackle this, we introduce a new method called CLOVER (Cross-Layer Orthogonal Vectors), which helps make these models more efficient without compromising their ability to generate accurate text.CLOVER works by rethinking how attention mechanisms in LLMs handle their internal data. It focuses on reducing unnecessary redundancy in the model’s memory usage, particularly in the key-value pairs used for attention. By applying a mathematical technique called Singular Value Decomposition (SVD), CLOVER identifies and removes unimportant components from the model’s attention heads. This allows the model to maintain strong performance while using less memory.The key benefit of CLOVER is its ability to prune or remove parts of the model without losing accuracy. In fact, we show that CLOVER can prune up to 70% of a model’s key-value memory while keeping the performance almost identical to using just 8% pruning with traditional methods. Additionally, CLOVER is highly efficient in both fine-tuning models and improving their inference speed, even making models run up to 11 times faster on some tasks.Overall, CLOVER offers a new way to make large AI models more efficient, making them faster and less resource-hungry without sacrificing their ability to generate high-quality outputs.

Chat is not available.