Poster
Square$\chi$PO: Differentially Private and Robust $\chi^2$-Preference Optimization in Offline Direct Alignment
Xingyu Zhou · Yulian Wu · Wenqian Weng · Francesco Orabona
West Exhibition Hall B2-B3 #W-1012
Training large language models to align with human preferences is a key challenge in AI. But what if the feedback from humans is noisy—or needs to be kept private? In this work, we introduce a new, simple method to improve how language models learn from such feedback, even when it's imperfect or protected. Our approach replaces the usual training loss with a new version that makes learning more stable and accurate. As a result, our method is the first to offer strong guarantees when both privacy and noise are present, covering cases where either the feedback, the user prompts, or both need to stay private. We also show it works well even when the model has to make complex decisions. One surprising insight from our work is that whether privacy protections are applied before or after data is corrupted makes a big difference. Overall, our findings not only improve current techniques, but also provide new theoretical tools for building more trustworthy and robust AI systems.