Poster
AlphaPO: Reward Shape Matters for LLM Alignment
Aman Gupta · Shao Tang · Qingquan Song · Sirou Zhu · Jiwoo Hong · Ankan Saha · Viral Gupta · Noah Lee · Eunki Kim · Siyu Zhu · Parag Agrawal · Natesh Pillai · Sathiya Keerthi
East Exhibition Hall A-B #E-2509
Large language models are often fine-tuned to follow human instructions using methods like Reinforcement Learning with Human Feedback (RLHF), but newer Direct Alignment Algorithms (DAAs) such as DPO and SimPO skip the separate reward-modeling step and directly optimize for human preferences—sometimes at the cost of reducing the model’s likelihood of generating preferred responses, a problem known as likelihood displacement .This paper introduces AlphaPO, a simple yet powerful tweak: it adds a tunable parameter α to reshape the reward function itself, allowing precise control over how aggressively the model shifts probability mass toward preferred outputs without overshooting or under‐optimizing . By varying α, AlphaPO produces training trajectories that better balance margin improvement against maintaining high preferred-response probabilities, effectively mitigating both over-optimization and catastrophic likelihood displacement .In experiments on state-of-the-art 7–8 billion-parameter instruct models, AlphaPO boosts alignment performance by 7–10 % relative to SimPO and by 15–50 % relative to DPO—without longer or more verbose outputs—highlighting that the shape of the reward function is a crucial, previously underexplored knob for aligning LLMs to human values .