Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Tokenization Workshop (TokShop)

Contextual morphologically-guided tokenization for pretrained Latin BERT models

Marisa Hudspeth · Patrick J. Burns · Brendan O'Connor

Keywords: [ morphological segmentation ] [ morphologically-rich languages ] [ pos tagging ] [ morphological analysis ] [ subword representations ]

[ ] [ Project Page ]
Fri 18 Jul 10:50 a.m. PDT — noon PDT

Abstract:

Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich, medium-resource language. For both the standard WordPiece and Unigram Language Model (ULM) tokenization models, we propose two variations: one seeded with known morphological suffixes in the tokenizer vocabulary, and another using contextual pre-tokenization with a language-specific, lexicon-based morphological analyzer. From each learned tokenizer, we pretrain Latin BERT and evaluate its performance on POS and morphological feature classification. We find that morphologically-guided tokenization improves overall performance (e.g., 36% relative error reduction for morphological feature accuracy), with particularly large gains for specific, morphologically-signalled features (e.g., 54% relative error reduction for tense prediction). Our results highlight the utility of morphological linguistic resources to improve language modeling for morphologically complex languages.

Chat is not available.