Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains
Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement
Simon Yu · Liangyu Chen · Sara Ahmadian · Marzieh Fadaee
Keywords: [ iterative refinement ] [ training efficiency ] [ diversity ] [ data selection ]
Finetuning large language models on instruction data is an important step in enriching the knowledge learned during pre-training and improving instruction-following capabilities. As the number of instruction datasets continues to grow, selecting the right data to achieve optimal results becomes increasingly important. In this work, we ask a prominent question: How can we determine the optimal subset of data for effective training? While much of the existing research primarily emphasizes local criteria, such as instance quality, for subset selection, we argue that a global approach focused on data diversity is more critical. Our approach utilizes -means clustering to ensure that the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, with the importance and sampling weight of each cluster being reassessed in every training iteration. This method allows us to reduce the effect of outliers and automatically filter out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over the random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is submitted as supplementary materials.