Skip to yearly menu bar Skip to main content


Poster

Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples

chengqian gao · Haonan Li · Liu Liu · Zeke Xie · Peilin Zhao · Zhiqiang Xu

East Exhibition Hall A-B #E-2900
[ ] [ ]
Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

The alignment of large language models (LLMs) often assumes that using more clean data yields better outcomes, overlooking the match between model capacity and example difficulty. Challenging this, we propose a new principle: Preference data vary in difficulty, and overly difficult examples hinder alignment, by exceeding the model's capacity. Through systematic experimentation, we validate this principle with three key findings: (1) preference examples vary in difficulty, as evidenced by consistent learning orders across alignment runs; (2) overly difficult examples significantly degrade performance across four LLMs and two datasets; and (3) the capacity of a model dictates its threshold for handling difficult examples, underscoring a critical relationship between data selection and model capacity. Building on this principle, we introduce Selective DPO, which filters out overly difficult examples. This simple adjustment improves alignment performance by 9-16\% in win rates on the AlpacaEval 2 benchmark compared to the DPO baseline, surpassing a series of DPO variants with different algorithmic adjustments. These results together illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs. Code is available at https://github.com/glorgao/SelectiveDPO

Lay Summary:

Large language model (LLM) alignment is not a more data is always better game. We discover that examples are learned in a consistent order---across different runs and training data---reflecting an intrinsic difficulty tied to model capacity (quantified by validation loss), and that the hardest slice---those lying beyond the model’s reach---actually degrades alignment performance. Dropping these hardest cases and training only on the rest is a tiny change but a big win. On the AlpacaEval 2 benchmark, this cut-down curriculum raises the model’s win rate by 9–16 points, beating a host of more complicated DPO variants. The lesson is clear: tune a model on what it can realistically learn.

Chat is not available.