Poster
CLOVER: Cross-Layer Orthogonal Vectors Pruning
Fanxu Meng · Pingzhi Tang · Fan Jiang · Muhan Zhang
East Exhibition Hall A-B #E-2812
Large language models (LLMs) like GPT-2 and GPT-3 have revolutionized many fields by generating human-like text, but they face challenges as they grow in size. One of the main issues is the memory needed for their key-value (KV) caching system, which can slow down performance, especially in models with long contexts. To tackle this, we introduce a new method called CLOVER (Cross-Layer Orthogonal Vectors), which helps make these models more efficient without compromising their ability to generate accurate text.CLOVER works by rethinking how attention mechanisms in LLMs handle their internal data. It focuses on reducing unnecessary redundancy in the model’s memory usage, particularly in the key-value pairs used for attention. By applying a mathematical technique called Singular Value Decomposition (SVD), CLOVER identifies and removes unimportant components from the model’s attention heads. This allows the model to maintain strong performance while using less memory.The key benefit of CLOVER is its ability to prune or remove parts of the model without losing accuracy. In fact, we show that CLOVER can prune up to 70% of a model’s key-value memory while keeping the performance almost identical to using just 8% pruning with traditional methods. Additionally, CLOVER is highly efficient in both fine-tuning models and improving their inference speed, even making models run up to 11 times faster on some tasks.Overall, CLOVER offers a new way to make large AI models more efficient, making them faster and less resource-hungry without sacrificing their ability to generate high-quality outputs.