Poster
WildChat-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Benjamin Feuer · Chinmay Hegde
East Exhibition Hall A-B #E-3301
Language model (LLM) post-training can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WildChat-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating Re-Wild, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples.
Lots of people these days are training AI on synthetic data. But what distinguishes useful synthetic data from ... the other kind?To answer this question, we generated a lot of synthetic data using a lot of different LLMs (which we called DGMs, for Data Generating Models) and compared the performance of new LLMs trained on that synthetic data, both to each other and to some state-of-the-art open-source datasets. And we used what we learned to curate a new dataset, ReWild, which was better, by some reasonable measures, than existing public SFT datasets.We made all of our data and all of our models public, so that anyone can try training their own.