Skip to yearly menu bar Skip to main content


Poster with Prerecorded Video
in
Workshop: Tokenization Workshop (TokShop)

GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling

Prabhav Sanga · Jaskaran Singh · ARUN DUBEY

Keywords: [ Biological Sequence Modeling ] [ GeneticBPE ] [ Conserved Regions ] [ Motif Preservation ]

[ ] [ Project Page ]
Fri 18 Jul 1:50 p.m. PDT — 3 p.m. PDT

Abstract:

Tokenization plays a foundational yet underexplored role in biological sequence modeling. In this work, we present GeneticBPE, a biologically informed tokenisation framework that encodes prior structural knowledge such as seed motifs and conserved regions into the vocabulary construction process. Unlike standard subword methods that optimize purely for frequency or language-model likelihood, GeneticBPE integrates motif preservation objectives and generalisation-aware constraints into a modified merge scoring scheme. We evaluate our method on binary and multiclass miRNA classification tasks using the MirGeneDB v3.0 dataset and show that GeneticBPE outperforms character-level, k-mer, Unigram, and BPE tokenisations in accuracy, cross-species generalisation, and motif fidelity. Theoretical results demonstrate that tokenisation directly governs the inductive bias and domain robustness of sequence models. Our findings suggest that tokenisation should not be treated as a preprocessing utility, but rather as a design-critical component in biological NLP pipelines. Reproducibility: Code, motif files and pretrained tokenizer will be released under MIT license upon acceptance.

Chat is not available.