ICML Lexical Diversity as a Signal for Evaluating Generative Model Understanding: A Contrastive Study of African Languages in Real-World Speech Domains

Poster
in
Affinity Workshop: New In ML

Lexical Diversity as a Signal for Evaluating Generative Model Understanding: A Contrastive Study of African Languages in Real-World Speech Domains

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Understanding how generative models reflect lin-guistic variation is a key question in assessingtheir real-world understanding. Yet, in underrep-resented languages, particularly across Africanspeech contexts, we lack grounded metrics tobenchmark such understanding. This paperpresents a contrastive analysis of lexical diversityin Igbo, Yoruba, Hausa, and Nigerian Pidgin, us-ing spoken transcriptions from health, agriculture,and everyday domains. Applying Type-Token Ra-tio (TTR) and Measure of Textual Lexical Diver-sity (MTLD), we analyse how language structureand discourse type affect vocabulary use. We pro-pose that these metrics can serve as proxies forevaluating generative model understanding acrossdomains. Our findings show that domain and dis-course style significantly shape diversity patterns,highlighting gaps that model-generated outputswould need to bridge in order to approximate real-world speech. These results lay the groundworkfor future benchmarking of generative models inlow-resource languages using context-sensitivediversity measures.

Chat is not available.

Poster in Affinity Workshop: New In ML

Lexical Diversity as a Signal for Evaluating Generative Model Understanding: A Contrastive Study of African Languages in Real-World Speech Domains

Poster
in
Affinity Workshop: New In ML