Invited Talk
in
Workshop: Tokenization Workshop (TokShop)
Learning Dynamic Segmentation and Compression of Sequences in Transformer LLMs
Adrian Łańcucki
Transformer-based LLMs excel at language tasks, but their efficiency hinges on input sequence length. Typically, input resolution—imposed by a tokenizer—remains unchanged across all layers. In this talk, we introduce methods that enable end-to-end learning to dynamically pool, compress, or sparsify input or key-value token sequences. By effectively tracking down and removing redundancies, these methods deliver performance gains during training or inference. We arrive at a surprisingly practical method—Dynamic Memory Sparsification—that allows a model to achieve 8x KV cache compression within just a few hundred training steps. The resulting savings can be used not only to improve throughput and latency, but also to boost accuracy, as demonstrated across several reasoning tasks.