Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Tokenization Workshop (TokShop)

Entropy-Driven Pre-tokenization for Byte Pair Encoding

Yifan Hu · Ningyue Liang · Dachuan Zhao · Jonathan Geuter · Varshini Reddy · Craig Schmidt · Chris Tanner

Keywords: [ Tokenization ] [ Information Theory ] [ Entropy-based Segmentation ] [ Byte Pair Encoding ]

[ ] [ Project Page ]
Fri 18 Jul 1:50 p.m. PDT — 3 p.m. PDT

Abstract:

Byte Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern pretrained language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operations are agnostic to linguistic boundaries. To address this, we propose two entropy-informed pre-tokenization strategies that guide BPE segmentation using unsupervised information-theoretic cues. The first approach uses pointwise mutual information and left/right entropy to identify coherent character spans, while the second leverages predictive entropy derived from a pretrained GPT-2 model to detect boundary uncertainty. We evaluate both methods on a subset of the PKU corpus and demonstrate substantial improvements in segmentation precision, recall, and F1 score compared to standard BPE. Our results suggest that entropy-guided pre-tokenization not only enhances alignment with gold-standard linguistic units but also offers a promising direction for improving tokenization quality in low-resource and multilingual settings.

Chat is not available.