Poster
CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration
Haoyun Jiang · Haolin li · jianwei zhang · Fei Huang · Qiang Hu · Minmin Sun · Shuai Xiao · Yong Li · Junyang Lin · Jiangchao Yao
East Exhibition Hall A-B #E-3012
Modern language models excel at understanding long texts but often require significant memory and time to process them. We discovered that certain parts of these models consistently focus on specific information, which inspired us to develop CateKVāa method that retains only the most important information from these parts while preserving more detail where necessary. This approach reduces memory usage and speeds up processing without sacrificing accuracy. Our experiments show that CateKV can reduce memory consumption by nearly three times and double the speed for single-sample inputs, while boosting throughput for batch inputs by almost four times. This makes handling long documents more efficient and practical.