Skip to yearly menu bar Skip to main content


Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains

Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings

Lucas Mattioli · Youness Ait Hadichou · Sabrina Chaouche · Martin Gonzalez

Keywords: [ Text Embeddings ] [ Distribution Shifts ] [ Data Curation ] [ Model Collapse ] [ Data Quality Metrics ]


Abstract: Training models on uncurated text embeddings (TEs) derived from raw tabular data can lead to a severe failure mode known as $\underline{\text{model collapse}}$, where predictions converge to a single class regardless of input. By comparing models trained with identical hyper-parameter configurations on both raw tabular data and their TE-derived counterparts, we find that collapse is a consistent failure mode in the latter setting. We introduce a set of metrics that capture the extent of model collapse, offering a new perspective on TE quality as a proxy for data curation. Our results reveal that TE alone does not effectively function as a curation layer—and that their quality significantly influences downstream learning. More insidiously, we observe that the presence of model collapse can yield artificially inflated and $\underline{\text{spurious Accuracy-on-the-Line correlation}}$. These findings highlight the need for more nuanced curation and evaluation of embedding-based representations, particularly in out-of-distribution settings.

Chat is not available.