Poster with Prerecorded Video
in
Workshop: Tokenization Workshop (TokShop)
InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability
Kirill Semenov · Martin Popel
Keywords: [ Unigram Language Model tokenization ] [ inline casing ] [ subword segmentation ] [ Neural Machine Translation ]
We introduce two inline approaches to tokenization preprocessing of casing (InCa) and diacritics (InDia) in the texts. Their main component relies on an automatically created external dictionary that stores information about the most frequent casings or diacritizations of words, and marking only the non-frequent spellings. We show that in a number of noising scenarios, our casing algorithm shows the best performance, and in the cases where it performs on par with the alternative solutions, the intrinsic parameters of the tokenizer trained on our data are more stable. As for inline diacritization, this is the first solution of that type to our knowledge; we show its improvement on robustness against the de-diacritized texts compared to tokenization without preprocessing. We share our preprocessing systems at a public GitHub repository.