Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Tokenization Workshop (TokShop)

Evaluating Morphological Alignment of Tokenizers in 70 Languages

Catherine Arnett · Marisa Hudspeth · Brendan O'Connor

Keywords: [ tokenization ] [ morphology ] [ multilingual NLP ]

[ ] [ Project Page ]
Fri 18 Jul 10:50 a.m. PDT — noon PDT

Abstract:

While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. Here, we expand on previous work and develop datasets for 86 languages, which can be used to study tokenizer quality crosslinguistically. We also develop a new evaluation framework, addressing limitations of previous evaluations and providing flexible evaluation for 71 of those languages. We then correlate out alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggestingthat morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.

Chat is not available.