Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Efficient and Accurate KV-cache Management for Long-Sequence LLMs

Yuzhen Mao · Qitong Wang · Martin Ester · Ke Li


Abstract:

Key-Value (KV) cache plays a pivotal role in accelerating inference in large language models (LLMs) by storing intermediate attention outputs, thereby avoiding redundant computation during auto-regressive generation. However, the cache's memory footprint scales linearly with sequence length, often resulting in memory bottlenecks on constrained hardware. While prior work has explored offloading KV-cache to the CPU and maintaining a reduced subset on the GPU, these approaches frequently suffer from imprecise token prioritization and degraded performance in long-generation tasks such as multi-turn dialogues and chain-of-thought reasoning. In this paper, we propose a novel KV-cache management strategy that integrates semantic token clustering with PagedAttention, a memory-efficient paging mechanism. By clustering semantically related tokens and organizing them into a hierarchical, dynamically updateable structure, our method improves cache hit rates and memory bandwidth utilization during CPU-GPU transfers. Experimental results show that our approach significantly enhances inference efficiency while preserving generation quality, particularly in long-sequence scenarios. Notably, it achieves higher accuracy even when operating with only half the memory budget, effectively addressing key limitations of existing KV-cache optimization methods.

Chat is not available.