ICML Poster Design Considerations in Offline Preference-based RL

Poster

Design Considerations in Offline Preference-based RL

Alekh Agarwal · Christoph Dann · Teodor Vanislavov Marinov

West Exhibition Hall B2-B3 #W-901

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.

Lay Summary:

This work provides a theoretical analysis of offline methods for Reinforcement Learning from Human Preferences (RLHF), a post-training technique used to improve language models. We examine how different design choices in algorithms like DPO, IPO, and SLiC impact the quality of the learned policy by providing a general framework for a broad range of offline RLHF techniques. In particular we demonstrate both through theory and empirical evaluation on a text summarization task that the choice of loss function and reference policy are critical. Specifically, the squared loss used by IPO outperforms the logistic loss of DPO because it has better curvature properties, leading to more stable training and preventing catastrophic collapse suffered by DPO where the model's performance degrades sharply after an initial improvement. Our findings suggest that careful selection of the loss function and ensuring the training data sufficiently covers a wide range of possible responses are crucial for successfully and reliably training preference-aligned models.

Chat is not available.