Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

The Gemma Sutras: Fine-Tuning Gemma 3 for Sanskrit Sandhi Splitting

Samarth P · Sanjay Mahalingam


Abstract:

Sandhi, the phonological process of merging morphemes in Sanskrit, plays a central role in the language’s grammar and expressive power. While the rules of Sandhi formation are well-defined in Pāṇini’s Aṣṭādhyāyī, the reverse task—Sandhi splitting—is substantially more complex due to inherent ambiguities, optional rule applications, and context-sensitive transformations. Accurate Sandhi splitting is essential for effective tokenization in Sanskrit, which lacks clear word boundaries and often presents densely fused compound forms.In this work, we propose a data-driven approach to Sandhi splitting by fine-tuning a large language model on a curated dataset of compound Sanskrit words and their morpheme-level decompositions. We construct a labeled dataset of over 48,000 training examples and 2,000 test examples, representing diverse Sandhi patterns across syntactic contexts. Leveraging the Gemma-3 4B model via the Unsloth framework with low-rank adaptation (LoRA) and 4-bit quantization, we fine-tune the model to learn the latent structure of Sandhi transformations and predict accurate splits. Our goal is to build a Sandhi-aware system that serves as a preprocessing module for Sanskrit tokenizers, thereby enhancing the linguistic alignment of classical Sanskrit texts with modern NLP pipelines. Our method demonstrates a scalable and linguistically grounded approach to computational Sanskrit processing and opens new directions for applying large language models to classical languages.

Chat is not available.