Skip to yearly menu bar Skip to main content


Poster with Prerecorded Video
in
Workshop: Tokenization Workshop (TokShop)

InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability

Kirill Semenov · Martin Popel

Keywords: [ Unigram Language Model tokenization ] [ inline casing ] [ subword segmentation ] [ Neural Machine Translation ]

[ ] [ Project Page ]
Fri 18 Jul 10:50 a.m. PDT — noon PDT

Abstract:

We introduce two inline approaches to tokenization preprocessing of casing (InCa) and diacritics (InDia) in the texts. Their main component relies on an automatically created external dictionary that stores information about the most frequent casings or diacritizations of words, and marking only the non-frequent spellings. We show that in a number of noising scenarios, our casing algorithm shows the best performance, and in the cases where it performs on par with the alternative solutions, the intrinsic parameters of the tokenizer trained on our data are more stable. As for inline diacritization, this is the first solution of that type to our knowledge; we show its improvement on robustness against the de-diacritized texts compared to tokenization without preprocessing. We share our preprocessing systems at a public GitHub repository.

Chat is not available.