Workshop
Tokenization Workshop (TokShop)
Tomasz Limisiewicz · Valentin Hofmann · Sachin Kumar · Farhan Samir · Jindřich Libovický · Jindřich Helcl · Orevaoghene Ahia · Elizabeth Salesky
West Meeting Room 111-112
Fri 18 Jul, 8:45 a.m. PDT
Tokenization defines how data are represented as input and output for many current machine learning systems, including language models. Tokenization has been shown to significantly affect the utility and effectiveness of these models (Mielke et al., 2021). This finding has stirred considerable interest in tokenization as a research direction in machine learning and its subfields, such as natural language processing, but currently, there is no venue specifically dedicated to it. Our initiative—TokShop (Tokenization Workshop)—aims to fill this gap and will focus on tokenization in a broad sense.
Chat is not available.
Timezone: America/Los_Angeles
Schedule
Fri 8:45 a.m. - 9:00 a.m.
|
Opening
(
Opening
)
>
|
🔗 |
Fri 9:00 a.m. - 9:10 a.m.
|
Coffee Break
|
🔗 |
Fri 9:10 a.m. - 10:00 a.m.
|
Beat them? Join them? Fix them? Tokenization Research in a Downstream World
(
Invited Talk
)
>
|
Yuval Pinter 🔗 |
Fri 10:00 a.m. - 10:50 a.m.
|
Insights from Pixel Language Modeling
(
Invited Talk
)
>
|
Desmond Elliott 🔗 |
Fri 10:50 a.m. - 12:00 p.m.
|
Poster Session: Tokenization of Text
(
Poster Session
)
>
|
🔗 |
|
→ Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations ( Poster ) > link | Brian Zheng · Alisa Liu · Orevaoghene M Ahia · Jonathan Hayase · Yejin Choi · Noah Smith 🔗 |
|
→ Subword Tokenization Strategies for Kurdish Word Embeddings ( Poster ) > link | Ali Salehi · Cassandra Jacobs 🔗 |
|
→ Continuous Chain of Thought Enables Parallel Exploration and Reasoning ( Poster ) > link | Halil Alperen Gozeten · Muhammed Emrullah Ildiz · Xuechen Zhang · Hrayr Harutyunyan · Ankit Singh Rawat · Samet Oymak 🔗 |
|
→ Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8 ( Poster ) > link | Preston Firestone · Shubham Ugare · Gagandeep Singh · Sasa Misailovic 🔗 |
|
→ Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives ( Poster ) > link | Ander Artola Velasco · Stratis Tsirtsis · Nastaran Okati · Manuel Gomez-Rodriguez 🔗 |
|
→ Evaluating Morphological Alignment of Tokenizers in 70 Languages ( Poster ) > link | Catherine Arnett · Marisa Hudspeth · Brendan O'Connor 🔗 |
|
→ Byte Latent Transformer: Patches Scale Better Than Tokens ( Poster ) > link |
14 presentersArtidoro Pagnoni · Ramakanth Pasunuru · Pedro Rodriguez · John Nguyen · Benjamin Muller · Margaret Li · Chunting Zhou · LILI YU · JASON WESTON · Luke Zettlemoyer · Gargi Ghosh · Mike Lewis · Ari Holtzman · Srinivasan Iyer |
|
→ Contextual morphologically-guided tokenization for pretrained Latin BERT models ( Poster ) > link | Marisa Hudspeth · Patrick J. Burns · Brendan O'Connor 🔗 |
|
→ SuperBPE: Space Travel for Language Models ( Poster ) > link | Alisa Liu · Jonathan Hayase · Valentin Hofmann · Sewoong Oh · Noah Smith · Yejin Choi 🔗 |
|
→ zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression ( Poster ) > link | Saibo Geng · Nathan Thomas Elian Ranchin · Yunzhen Yao · Maxime Peyrard · Chris Wendler · Michael Gastpar · Robert West 🔗 |
|
→ How Much is Enough? The Diminishing Returns of Tokenization Training Data ( Poster ) > link | Varshini Reddy · Craig Schmidt · Yuval Pinter · Chris Tanner 🔗 |
|
→ FLEXITOKENS: Flexible Tokenization for Evolving Language Models ( Poster ) > link | Abraham Owodunni · Orevaoghene M Ahia · Sachin Kumar 🔗 |
|
→ Sampling from Your Language Model One Byte at a Time ( Poster ) > link | Jonathan Hayase · Alisa Liu · Noah Smith · Sewoong Oh 🔗 |
|
→ BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization ( Poster ) > link | Sander Land · Catherine Arnett 🔗 |
|
→ ByteSpan: Information-Driven Subword Tokenisation ( Poster ) > link | Zebulon Goriely · Suchir Salhan · Pietro Lesci · Julius Cheng · Paula Buttery 🔗 |
Fri 10:50 a.m. - 12:00 p.m.
|
MorphTok: Morphologically Grounded Tokenization for Indic languages ( Poster with Prerecorded Video ) > link | Maharaj Brahma · N J Karthika · Atul Singh · Devaraja Adiga · Smruti Bhate · Ganesh Ramakrishnan · Rohit Saluja · Maunendra Sankar Desarkar 🔗 |
Fri 10:50 a.m. - 12:00 p.m.
|
Causal Estimation of Tokenisation Bias ( Poster with Prerecorded Video ) > link | Pietro Lesci · Clara Meister · Thomas Hofmann · Andreas Vlachos · Tiago Pimentel 🔗 |
Fri 10:50 a.m. - 12:00 p.m.
|
Adversarial Tokenization ( Poster with Prerecorded Video ) > link | Renato Geh · Zilei Shao · Guy Van den Broeck 🔗 |
Fri 10:50 a.m. - 12:00 p.m.
|
InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability ( Poster with Prerecorded Video ) > link | Kirill Semenov · Martin Popel 🔗 |
Fri 12:00 p.m. - 1:00 p.m.
|
Lunch Break
|
🔗 |
Fri 1:00 p.m. - 1:50 p.m.
|
Learning Dynamic Segmentation and Compression of Sequences in Transformer LLMs
(
Invited Talk
)
>
|
Adrian Łańcucki 🔗 |
Fri 1:50 p.m. - 3:00 p.m.
|
Poster Session: Tokenization Across Modalities
(
Poster Session
)
>
|
🔗 |
|
→ How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them ( Poster ) > link | Disen Liao · Freda Shi 🔗 |
|
→ Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs ( Poster ) > link | Greyson Brothers 🔗 |
|
→ Canonical Autoregressive Generation ( Poster ) > link | Ivi Chatzi · Nina Corvelo Benz · Stratis Tsirtsis · Manuel Gomez-Rodriguez 🔗 |
|
→ You Only Train Once: Efficient Tokenizer Selection for Arithmetic in Language Models ( Poster ) > link | Mucong Ding · Sean McLeish · Kazem Meidani · Igor Melnyk · Nam Nguyen · C. Bayan Bruss · Furong Huang 🔗 |
|
→ Conditional Unigram Tokenization with Parallel Data ( Poster ) > link | Gianluca Vico · Jindřich Libovický 🔗 |
|
→ One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression ( Poster ) > link | Keita Miwa · Kento Sasaki · Hidehisa Arai · Tsubasa Takahashi · Yu Yamaguchi 🔗 |
|
→ Tokenizing Nonverbal Communication in Salsa Dance ( Poster ) > link | Bermet Burkanova · Payam Jome Yazdian · Chuxuan Zhang · Trinity Evans · Paige Tuttösí · Angelica Lim 🔗 |
|
→ Watermarking Autoregressive Image Generation ( Poster ) > link | Nikola Jovanović · Ismail Labiad · Tomas Soucek · Martin Vechev · Pierre Fernandez 🔗 |
|
→ QuickMerge++: Token Merging with Autoregressive Prior ( Poster ) > link | Dong Liu · Yanxuan Yu 🔗 |
|
→ Overcoming Vocabulary Constraints with Pixel-level Fallback ( Poster ) > link | Jonas F. Lotz · Hendra Setiawan · Stephan Peitz · Yova Kementchedjhieva 🔗 |
|
→ Continuous Autoregressive Generation with Mixture of Gaussians ( Poster ) > link | Alex Quach · Johnson Tsun-Hsuan Wang · Ramin Hasani · Mathias Lechner · Alexander Amini 🔗 |
|
→ Motion-Focused Tokenization for Source-Free Video Domain Adaptation ( Poster ) > link | Tzu Ling Liu · Ian Stavness · Mrigank Rochan 🔗 |
|
→ Discrete JEPA: Learning Discrete Token Representations without Reconstruction ( Poster ) > link | Junyeob Baek · Hosung Lee · Christopher Hoang · Mengye Ren · Sungjin Ahn 🔗 |
|
→ CAT: Content-Adaptive Image Tokenization ( Poster ) > link | Junhong Shen · Kushal Tirumala · Michihiro Yasunaga · Ishan Misra · Luke Zettlemoyer · LILI YU · Chunting Zhou 🔗 |
|
→ Entropy-Driven Pre-tokenization for Byte Pair Encoding ( Poster ) > link | Yifan Hu · Ningyue Liang · Dachuan Zhao · Jonathan Geuter · Varshini Reddy · Craig Schmidt · Chris Tanner 🔗 |
Fri 1:50 p.m. - 3:00 p.m.
|
Tokenisation is NP-Complete ( Poster with Prerecorded Video ) > link | Philip Whittington · Gregor Bachmann · Tiago Pimentel 🔗 |
Fri 1:50 p.m. - 3:00 p.m.
|
HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling ( Poster with Prerecorded Video ) > link | rongkun xue · Yazhe Niu · Shuai Hu · Zixin Yin · Yongqiang Yao · Jing Yang 🔗 |
Fri 1:50 p.m. - 3:00 p.m.
|
GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling ( Poster with Prerecorded Video ) > link | Prabhav Sanga · Jaskaran Singh · ARUN DUBEY 🔗 |
Fri 1:50 p.m. - 3:00 p.m.
|
Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation ( Poster with Prerecorded Video ) > link | Marco Cognetta · David Pohl · Junyoung Lee · Naoaki Okazaki 🔗 |
Fri 3:00 p.m. - 3:30 p.m.
|
Coffee Break
|
🔗 |
Fri 3:30 p.m. - 4:30 p.m.
|
Panel: Future of Tokenization
(
Panel Discussion
)
>
|
🔗 |
Fri 4:30 p.m. - 5:00 p.m.
|
Best Paper Session
|
🔗 |
Fri 5:00 p.m. - 5:30 p.m.
|
Closing Remarks
|
🔗 |