Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Jang-Hyun Kim · Jinuk Kim · Sangwoo Kwon · Jae W. Lee · Sangdoo Yun · Hyun Oh Song


Abstract: Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. Longer context increases KV cache sizes, leading to significant memory overhead and higher attention computation latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed context caches across different queries. KVzip quantifies the importance of each KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by $3$-$4\times$ and decreases FlashAttention latency by approximately $2\times$, without performance degradation in question-answering, retrieval, mathematical reasoning, and code comprehension tasks. Evaluations include state-of-the-art models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing KV eviction methods, which suffer performance losses even at a 90\% cache budget ratio under multi-query scenarios. Codes are available at https://github.com/snu-mllab/KVzip.

Chat is not available.