ICML Poster LLM Data Selection and Utilization via Dynamic Bi-level Optimization

Poster

LLM Data Selection and Utilization via Dynamic Bi-level Optimization

Yang Yu · Kai Han · Hang Zhou · Yehui Tang · Kaiqi Huang · Yunhe Wang · Dacheng Tao

East Exhibition Hall A-B #E-2305

[ Abstract ] [ Lay Summary ]

[ Slides] [ Poster] [ OpenReview]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model’s data preferences evolve throughout training, providing new insights into the data preference of the model during training.

Lay Summary:

Training powerful language models requires vast amounts of data, but using all available data can be inefficient, costly, and environmentally taxing — especially when much of it is low quality. While recent methods try to select better data before training begins, they often ignore how a model’s preferences change as it learns.Our work introduces a new model called the Data Weighting Model (DWM), which dynamically adjusts how much influence each piece of data has during training. Instead of treating all data in a batch equally, DWM learns to emphasize the most helpful examples at each training stage. This system is trained using a novel two-step optimization approach that tracks the model’s evolving preferences. We show that even when starting with randomly chosen data, DWM can improve training efficiency and performance — even outperforming some hand-picked datasets. The trained weighting model also works well across different model sizes and can boost other data selection techniques. Our research offers a more adaptive, cost-effective way to train large language models.

Chat is not available.