ICML Poster Regress, Don't Guess: A Regression-like Loss on Number Tokens for Language Models

Poster

Regress, Don't Guess: A Regression-like Loss on Number Tokens for Language Models

Jonas Zausinger · Lars Pennig · Anamarija Kozina · Sean Sdahl · Julian Sikora · Adrian Dendorfer · Timofey Kuznetsov · Mohamad Hagog · Nina Wiedemann · Kacper Chlodny · Vincent Limbach · Anna Ketteler · Thorben Prein · Vishwa Singh · Michael Danziger · Jannis Born

East Exhibition Hall A-B #E-2612

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the Cross Entropy loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed **Number Token Loss** (NTL) comes in two flavors and minimizes either the $\mathcal{L}_p$ norm or the Wasserstein distance between the *numerical values* of the real and predicted number tokens. NTL can easily be added to any language model and extend the Cross Entropy objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope that this work can inspire LLM developers to improve their pretraining objectives.The code is available via: https://tum-ai.github.io/number-token-loss/

Lay Summary:

Large language models are great at writing documents and answering questions, but when it comes to math, they often make mistakes. A key reason is that these models do not have built-in understanding for how numbers relate to one another. For example, they treat the numbers “2” and “3” as just different words, not as digits that are close together.To address this, we developed a new way to train language models by giving the models additional feedback on numbers. Our method, called Number Token Loss (NTL), explicitly teaches models to understand that “2” and “3” are numerically close, and “2” and “9” are farther apart. It analyzes how much the model’s predictions need to shift to match the correct value, based on the numerical distance between the predicted number probabilities and true values.We tested this on math problems and found that it consistently improves performance. Importantly, our method can be used by any Language Model, is fast to compute and easy to integrate.

Chat is not available.