ICML Large Language Model Value Alignment via Multi-Stage Fine-Tuning and Expert-Annotated Supervision

Oral
in
Affinity Workshop: 4th MusIML workshop at ICML’25

Large Language Model Value Alignment via Multi-Stage Fine-Tuning and Expert-Annotated Supervision

Yan Sha · Shaokai Yang · Zhao DONG

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Ensuring that large language models (LLMs) generate responses aligned with human values is a critical challenge in AI safety and deployment. We present a multi-stage alignment framework that combines expert annotation, structured arbitration, and iterative fine-tuning. In our approach, model responses to diverse user prompts are rated by multiple experts on key dimensions. Cases with conflicting ratings are escalated to senior-expert arbitration, resulting in high-confidence consensus labels. This curated supervision is used in successive rounds of model fine-tuning, with each iteration further refining alignment. To safeguard conversational quality, we employ Sentence-BERT to quantitatively measure dialogue coherence before and after alignment. Our experimental results demonstrate that this process improves alignment outcomes, while maintaining or enhancing coherence and relevance. Our framework provides a systematic, scalable solution for aligning LLMs with human values and intent.

Chat is not available.

Oral in Affinity Workshop: 4th MusIML workshop at ICML’25

Large Language Model Value Alignment via Multi-Stage Fine-Tuning and Expert-Annotated Supervision

Yan Sha · Shaokai Yang · Zhao DONG

Oral
in
Affinity Workshop: 4th MusIML workshop at ICML’25