Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains
How to Recommend a Dataset for Model Training Team? Rethinking Proxy-Model-based Technique
Jiachen (Tianhao) Wang · Tong Wu · Kaifeng Lyu · Dawn Song · Ruoxi Jia · Prateek Mittal
Keywords: [ Proxy model ] [ data selection ]
Selecting a high-quality pretraining corpus for large language models (LLMs) is a crucial yet computationally expensive challenge. Proxy-model-based techniques have emerged as a practical solution to evaluate candidate datasets without incurring the costs of full-scale training. However, the current practice of using proxy models typically trains on each candidate corpus with a single set of hyperparameters. However, this approach is often unreliable because each dataset requires its own optimal training configuration, and the dataset rankings can completely reverse with even minor adjustments to the proxy training hyperparameters. We expose this fragility and formulate a more faithful objective for dataset selection: choose the dataset that attains the best achievable validation loss once its hyperparameters are fully optimized on the target model. To meet this objective, we introduce a simple yet effective patch to the current proxy-model-based method: train proxy models with a \emph{tiny} learning rate. We prove that, for random-feature models, sufficiently small learning rates asymptotically preserve the ordering of datasets by their optimal losses. Through extensive experiments, we show that tiny-learning-rate proxies achieve near-perfect Spearman rank correlation with target-scale models. Notably, this transferable signal emerges within just a few hundred training iterations, yielding significant computational savings.