Spotlight Poster
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun · Li-Wen Chang · Wenlei Bao · Size Zheng · Ningxin Zheng · Xin Liu · Harry Dong · Yuejie Chi · Beidi Chen
East Exhibition Hall A-B #E-2805
Large language models that can understand very long texts require significant computer memory, which makes them slow and expensive to use, especially when many people are using them at once. Our research introduces ShadowKV, a new system designed to significantly speed up these models. ShadowKV works by smartly managing the model's memory: it keeps a small, compressed version of the most important data on the GPU and moves the less critical data to CPU. When the model needs information, ShadowKV quickly finds and retrieves only what's essential, avoiding delays and ensuring the model remains accurate. This allows us to handle many more requests and much longer texts simultaneously, making powerful long-context AI more efficient and accessible for broader use.