Poster
Larger or Smaller Reward Margins to Select Preferences for LLM Alignment?
Kexin Huang · Junkang Wu · Ziqian Chen · xue wang · Jinyang Gao · Bolin Ding · Jiancan Wu · Xiangnan He · Xiang Wang
East Exhibition Hall A-B #E-3311
When teaching AI language models to understand human preferences, researchers face a challenge: existing methods for evaluating training data quality often provide conflicting assessments, making it difficult to select the most effective data for training. Our research introduces a new measurement approach that bridges this gap by considering both what the AI system currently understands and what we want it to learn. This helps us identify which training examples will be most valuable for teaching the AI to better align with human values. Experiments show that training examples selected by our method consistently outperform existing metrics under various training settings, enabling more efficient training of AI systems that better understand and respect human preferences.