Tokenization Workshop (TokShop)

Workshop

Tokenization Workshop (TokShop)

Tomasz Limisiewicz · Valentin Hofmann · Sachin Kumar · Farhan Samir · Jindřich Libovický · Jindřich Helcl · Orevaoghene Ahia · Elizabeth Salesky

West Meeting Room 111-112

Fri 18 Jul, 8:45 a.m. PDT

[ Abstract ] Workshop Website

Tokenization defines how data are represented as input and output for many current machine learning systems, including language models. Tokenization has been shown to significantly affect the utility and effectiveness of these models (Mielke et al., 2021). This finding has stirred considerable interest in tokenization as a research direction in machine learning and its subfields, such as natural language processing, but currently, there is no venue specifically dedicated to it. Our initiative—TokShop (Tokenization Workshop)—aims to fill this gap and will focus on tokenization in a broad sense.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Fri 8:45 a.m. - 9:00 a.m.	Opening ( Opening ) >	🔗
Fri 9:00 a.m. - 9:10 a.m.	Coffee Break	🔗
Fri 9:10 a.m. - 10:00 a.m.	Beat them? Join them? Fix them? Tokenization Research in a Downstream World ( Invited Talk ) >	Yuval Pinter 🔗
Fri 10:00 a.m. - 10:50 a.m.	Insights from Pixel Language Modeling ( Invited Talk ) >	Desmond Elliott 🔗
Fri 10:50 a.m. - 12:00 p.m.	Poster Session: Tokenization of Text ( Poster Session ) >	🔗
	→ Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations ( Poster ) > link Link	Brian Zheng · Alisa Liu · Orevaoghene M Ahia · Jonathan Hayase · Yejin Choi · Noah Smith 🔗
	→ Subword Tokenization Strategies for Kurdish Word Embeddings ( Poster ) > link Link	Ali Salehi · Cassandra Jacobs 🔗
	→ Continuous Chain of Thought Enables Parallel Exploration and Reasoning ( Poster ) > link Link	Halil Alperen Gozeten · Muhammed Emrullah Ildiz · Xuechen Zhang · Hrayr Harutyunyan · Ankit Singh Rawat · Samet Oymak 🔗
	→ Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8 ( Poster ) > link Link	Preston Firestone · Shubham Ugare · Gagandeep Singh · Sasa Misailovic 🔗
	→ Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives ( Poster ) > link Link	Ander Artola Velasco · Stratis Tsirtsis · Nastaran Okati · Manuel Gomez-Rodriguez 🔗
	→ Evaluating Morphological Alignment of Tokenizers in 70 Languages ( Poster ) > link Link	Catherine Arnett · Marisa Hudspeth · Brendan O'Connor 🔗
	→ Byte Latent Transformer: Patches Scale Better Than Tokens ( Poster ) > link Link	14 presenters Artidoro Pagnoni · Ramakanth Pasunuru · Pedro Rodriguez · John Nguyen · Benjamin Muller · Margaret Li · Chunting Zhou · LILI YU · JASON WESTON · Luke Zettlemoyer · Gargi Ghosh · Mike Lewis · Ari Holtzman · Srinivasan Iyer 🔗
	→ Contextual morphologically-guided tokenization for pretrained Latin BERT models ( Poster ) > link Link	Marisa Hudspeth · Patrick J. Burns · Brendan O'Connor 🔗
	→ SuperBPE: Space Travel for Language Models ( Poster ) > link Link	Alisa Liu · Jonathan Hayase · Valentin Hofmann · Sewoong Oh · Noah Smith · Yejin Choi 🔗
	→ zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression ( Poster ) > link Link	Saibo Geng · Nathan Thomas Elian Ranchin · Yunzhen Yao · Maxime Peyrard · Chris Wendler · Michael Gastpar · Robert West 🔗
	→ How Much is Enough? The Diminishing Returns of Tokenization Training Data ( Poster ) > link Link	Varshini Reddy · Craig Schmidt · Yuval Pinter · Chris Tanner 🔗
	→ FLEXITOKENS: Flexible Tokenization for Evolving Language Models ( Poster ) > link Link	Abraham Owodunni · Orevaoghene M Ahia · Sachin Kumar 🔗
	→ Sampling from Your Language Model One Byte at a Time ( Poster ) > link Link	Jonathan Hayase · Alisa Liu · Noah Smith · Sewoong Oh 🔗
	→ BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization ( Poster ) > link Link	Sander Land · Catherine Arnett 🔗
	→ ByteSpan: Information-Driven Subword Tokenisation ( Poster ) > link Link	Zebulon Goriely · Suchir Salhan · Pietro Lesci · Julius Cheng · Paula Buttery 🔗
Fri 10:50 a.m. - 12:00 p.m.	MorphTok: Morphologically Grounded Tokenization for Indic languages ( Poster with Prerecorded Video ) > link Link	Maharaj Brahma · N J Karthika · Atul Singh · Devaraja Adiga · Smruti Bhate · Ganesh Ramakrishnan · Rohit Saluja · Maunendra Sankar Desarkar 🔗
Fri 10:50 a.m. - 12:00 p.m.	Causal Estimation of Tokenisation Bias ( Poster with Prerecorded Video ) > link Link	Pietro Lesci · Clara Meister · Thomas Hofmann · Andreas Vlachos · Tiago Pimentel 🔗
Fri 10:50 a.m. - 12:00 p.m.	Adversarial Tokenization ( Poster with Prerecorded Video ) > link Link	Renato Geh · Zilei Shao · Guy Van den Broeck 🔗
Fri 10:50 a.m. - 12:00 p.m.	InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability ( Poster with Prerecorded Video ) > link Link	Kirill Semenov · Martin Popel 🔗
Fri 12:00 p.m. - 1:00 p.m.	Lunch Break	🔗
Fri 1:00 p.m. - 1:50 p.m.	Learning Dynamic Segmentation and Compression of Sequences in Transformer LLMs ( Invited Talk ) >	Adrian Łańcucki 🔗
Fri 1:50 p.m. - 3:00 p.m.	Poster Session: Tokenization Across Modalities ( Poster Session ) >	🔗
	→ How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them ( Poster ) > link Link	Disen Liao · Freda Shi 🔗
	→ Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs ( Poster ) > link Link	Greyson Brothers 🔗
	→ Canonical Autoregressive Generation ( Poster ) > link Link	Ivi Chatzi · Nina Corvelo Benz · Stratis Tsirtsis · Manuel Gomez-Rodriguez 🔗
	→ You Only Train Once: Efficient Tokenizer Selection for Arithmetic in Language Models ( Poster ) > link Link	Mucong Ding · Sean McLeish · Kazem Meidani · Igor Melnyk · Nam Nguyen · C. Bayan Bruss · Furong Huang 🔗
	→ Conditional Unigram Tokenization with Parallel Data ( Poster ) > link Link	Gianluca Vico · Jindřich Libovický 🔗
	→ One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression ( Poster ) > link Link	Keita Miwa · Kento Sasaki · Hidehisa Arai · Tsubasa Takahashi · Yu Yamaguchi 🔗
	→ Tokenizing Nonverbal Communication in Salsa Dance ( Poster ) > link Link	Bermet Burkanova · Payam Jome Yazdian · Chuxuan Zhang · Trinity Evans · Paige Tuttösí · Angelica Lim 🔗
	→ Watermarking Autoregressive Image Generation ( Poster ) > link Link	Nikola Jovanović · Ismail Labiad · Tomas Soucek · Martin Vechev · Pierre Fernandez 🔗
	→ QuickMerge++: Token Merging with Autoregressive Prior ( Poster ) > link Link	Dong Liu · Yanxuan Yu 🔗
	→ Overcoming Vocabulary Constraints with Pixel-level Fallback ( Poster ) > link Link	Jonas F. Lotz · Hendra Setiawan · Stephan Peitz · Yova Kementchedjhieva 🔗
	→ Continuous Autoregressive Generation with Mixture of Gaussians ( Poster ) > link Link	Alex Quach · Johnson Tsun-Hsuan Wang · Ramin Hasani · Mathias Lechner · Alexander Amini 🔗
	→ Motion-Focused Tokenization for Source-Free Video Domain Adaptation ( Poster ) > link Link	Tzu Ling Liu · Ian Stavness · Mrigank Rochan 🔗
	→ Discrete JEPA: Learning Discrete Token Representations without Reconstruction ( Poster ) > link Link	Junyeob Baek · Hosung Lee · Christopher Hoang · Mengye Ren · Sungjin Ahn 🔗
	→ CAT: Content-Adaptive Image Tokenization ( Poster ) > link Link	Junhong Shen · Kushal Tirumala · Michihiro Yasunaga · Ishan Misra · Luke Zettlemoyer · LILI YU · Chunting Zhou 🔗
	→ Entropy-Driven Pre-tokenization for Byte Pair Encoding ( Poster ) > link Link	Yifan Hu · Ningyue Liang · Dachuan Zhao · Jonathan Geuter · Varshini Reddy · Craig Schmidt · Chris Tanner 🔗
Fri 1:50 p.m. - 3:00 p.m.	Tokenisation is NP-Complete ( Poster with Prerecorded Video ) > link Link	Philip Whittington · Gregor Bachmann · Tiago Pimentel 🔗
Fri 1:50 p.m. - 3:00 p.m.	HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling ( Poster with Prerecorded Video ) > link Link	rongkun xue · Yazhe Niu · Shuai Hu · Zixin Yin · Yongqiang Yao · Jing Yang 🔗
Fri 1:50 p.m. - 3:00 p.m.	GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling ( Poster with Prerecorded Video ) > link Link	Prabhav Sanga · Jaskaran Singh · ARUN DUBEY 🔗
Fri 1:50 p.m. - 3:00 p.m.	Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation ( Poster with Prerecorded Video ) > link Link	Marco Cognetta · David Pohl · Junyoung Lee · Naoaki Okazaki 🔗
Fri 3:00 p.m. - 3:30 p.m.	Coffee Break	🔗
Fri 3:30 p.m. - 4:30 p.m.	Panel: Future of Tokenization ( Panel Discussion ) >	🔗
Fri 4:30 p.m. - 5:00 p.m.	Best Paper Session	🔗
Fri 5:00 p.m. - 5:30 p.m.	Closing Remarks	🔗