Spotlight Poster
RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
Guanzheng Chen · Qilong Feng · Jinjie Ni · Xin Li · Michael Shieh
East Exhibition Hall A-B #E-2412
Large language models (LLMs) excel at answering questions from huge documents like books or reports, but processing every word makes them painfully slow. Our solution, Retrieval-Augmented Speculative Decoding (RAPID), acts like a sharp librarian and expert editor working together. RAPID swiftly pinpoints key passages relevant to the question, then another LLM drafts potential answers using only those snippets. The main LLM, like a skilled editor, verifies these drafts in parallel against the full document, quickly correcting or improving them instead of generating answers step by step. This teamwork makes RAPID over twice as fast while often producing better answers than traditional methods. Its speed and accuracy open new doors for efficiently analyzing complex texts like legal cases, scientific papers, or lengthy reports, transforming how we use powerful LLMs in real-world tasks.