Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Tokenization Workshop (TokShop)

Learning Dynamic Segmentation and Compression of Sequences in Transformer LLMs

Adrian Łańcucki

[ ]
Fri 18 Jul 1 p.m. PDT — 1:50 p.m. PDT

Abstract:

Transformer-based LLMs excel at language tasks, but their efficiency hinges on input sequence length. Typically, input resolution—imposed by a tokenizer—remains unchanged across all layers. In this talk, we introduce methods that enable end-to-end learning to dynamically pool, compress, or sparsify input or key-value token sequences. By effectively tracking down and removing redundancies, these methods deliver performance gains during training or inference. We arrive at a surprisingly practical method—Dynamic Memory Sparsification—that allows a model to achieve 8x KV cache compression within just a few hundred training steps. The resulting savings can be used not only to improve throughput and latency, but also to boost accuracy, as demonstrated across several reasoning tasks.

Chat is not available.