Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)

In-Context Personalized Alignment with Feedback History under Counterfactual Evaluation

Xisen Jin · Zheng Li · Zhenwei DAI · Hui Liu · Xianfeng Tang · Chen Luo · Rahul Goutam · Xiang Ren · Qi He


Abstract:

Accommodating diverse preferences of users is an arising challenge in large language model (LLM) alignment. A prevalent solution is to prompt LLMs with past user feedback in earlier conversations, so that LLMs can infer and adapt generations to the user preferences. In this paper, we revisit such in-context LLM personalization paradigm under a synthetic counterfactual evaluation setup, where each candidate response can be the preferable response depending on the preferences. We examine whether model responses can be steered to diverse preferences with distinct feedback history provided in-context. Our experiments suggest that off-the-shelf LLMs struggle in understanding user preferences from in-context feedback for personalized reward modeling and response generation. We show that fine-tuning is almost necessary so that in-context feedback are leveraged, where small 7-8B LLMs improve over off-the-shelf LLMs. Lastly, we improve fine-tuned response generation models via rejection sampling of training data guided by the personalized reward model.

Chat is not available.