Poster
EPIC: Efficient Position-Independent Caching for Serving Large Language Models
JUNHAO HU · Wenrui Huang · Weidong Wang · Haoyi Wang · tiancheng hu · zhang qin · Hao Feng · Xusheng Chen · Yizhou Shan · Tao Xie
East Exhibition Hall A-B #E-3404
Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-IndependentCaching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate “attention sink” effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8× improvements in Time-To-First-Token (TTFT) and 7× throughput gains over existing systems, with negligible or no accuracy loss.
Large Language Models (LLMs) demonstrate strong capabilities across a wide range of applications; however, efficiently serving them becomes increasingly challenging as user requests (prompts) grow in complexity.In this work, we formalize Position-Independent Caching (PIC), an approach designed to substantially accelerate LLM inference by enabling modular reuse of intermediate representations. Building upon prior PIC efforts, we present EPIC, a serving system that incorporates a novel algorithm, LegoLink, which preserves model accuracy while minimizing computational overhead.Empirical results show that EPIC yields substantial performance gains, particularly in scenarios such as few-shot learning and retrieval-augmented generation, achieving significant improvements in both latency and throughput with minimal or no loss in accuracy.