ICML Poster Dialogue Without Limits: Constant-Sized KV Caches for Extended Response in LLMs

Poster

Dialogue Without Limits: Constant-Sized KV Caches for Extended Response in LLMs

Ravi Ghadia · Avinash Kumar · Gaurav Jain · Prashant J. Nair · Poulami Das

East Exhibition Hall A-B #E-2711

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias.We propose ${MorphKV}$, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy. MorphKV balances long-range dependencies and local coherence during text generation. It eliminates early-token bias while retaining high-fidelity context by adaptively ranking tokens through correlation-aware selection. Unlike heuristic retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This approach captures inter-token correlation with greater accuracy, which is crucial for tasks like content creation and code generation. Our studies on long-response tasks show 52.9\% memory savings and 18.2\% higher accuracy on average compared to state-of-the-art prior works, enabling efficient deployment.

Lay Summary: The increasing size of KV caches presents a critical bottleneck, particularly for long-response tasks, such as content creation, code generation, etc. Compressed KV caches address this challenge by evicting unimportant tokens, but presents a trade-off between accuracy versus memory savings. Retaining too few KVs reduces accuracy, whereas accurate methods are not memory-efficient. Ideally, we want to compress KV caches without sacrificing accuracy. We introduce $MorphKV$, a dynamic KV cache pruning method that maintains a constant-size KV cache by keeping only a select subset of KVs. MorphKV improves accuracy by dynamically preserving only those KVs that exhibit strong correlation with recently generated tokens. This enables MorphKV to efficiently handle long-context and long-response tasks, even when operating with limited hardware resources.

Chat is not available.