Workshop
2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)
Belen Martin Urcelay · Micah Carroll · Maria Teresa Parreira · Thomas Kleine Buening · Andreas Krause · Anca Dragan
West Ballroom A
Fri 18 Jul, 9 a.m. PDT
Our workshop brings together experts in machine learning, cognitive science, behavioral psychology, and economics to explore human-AI alignment by examining human (and AI) feedback mechanisms, their mathematical models, and practical implications. By fostering collaboration between technical and behavioral science communities, we aim to develop more realistic models of human feedback that can better inform the development of aligned AI systems.
Chat is not available.
Timezone: America/Los_Angeles
Schedule
Fri 9:00 a.m. - 9:05 a.m.
|
Opening Remarks
(
'Intro'
)
>
|
🔗 |
Fri 9:05 a.m. - 9:40 a.m.
|
Explainable Decision Support and Justification for Model Alignment in Human-Robot Teams ( Invited Talk ) > link | Matthew Luebbers 🔗 |
Fri 9:40 a.m. - 9:55 a.m.
|
Aligned Textual Scoring Rule ( Oral ) > link | Yuxuan Lu · Yifan Wu · Jason Hartline · Michael Curry 🔗 |
Fri 9:55 a.m. - 10:10 a.m.
|
Copilot Arena: A Platform for Code LLM Evaluation in the Wild ( Oral ) > link | Wayne Chi · Valerie Chen · Anastasios Angelopoulos · Wei-Lin Chiang · Aditya Mittal · Naman Jain · Tianjun Zhang · Ion Stoica · Chris Donahue · Ameet Talwalkar 🔗 |
Fri 10:10 a.m. - 10:25 a.m.
|
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes ( Oral ) > link | Katarzyna Kobalczyk · Claudio Fanconi · Hao Sun · Mihaela van der Schaar 🔗 |
Fri 10:25 a.m. - 11:00 a.m.
|
Alignment is social: lessons from human alignment for AI ( Invited Talk ) > link | Gillian Hadfield 🔗 |
Fri 11:00 a.m. - 11:15 a.m.
|
Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset ( Oral ) > link |
14 presentersLily Zhang · Smitha Milli · Karen Jusko · Jonathan Smith · Brandon Amos · Wassim Bouaziz · Jack Kussman · Manon Revel · Lisa Titus · Bhaktipriya Radharapu · Jane Dwivedi-Yu · Vidya Sarma · Kristopher Rose · Maximilian Nickel |
Fri 11:15 a.m. - 11:30 a.m.
|
Deep Context-Dependent Choice Model ( Oral ) > link | Shuhan Zhang · Zhi Wang · Rui Gao · Shuang Li 🔗 |
Fri 11:30 a.m. - 12:05 p.m.
|
The Limits of Preferences: Navigating Human-AI Feedback Tradeoffs in Alignment ( Invited Talk ) > link | Valentina Pyatkin 🔗 |
Fri 12:05 p.m. - 1:30 p.m.
|
Lunch and Poster Session I
(
Poster
)
>
|
🔗 |
Fri 1:30 p.m. - 2:05 p.m.
|
Personalization and pluralistic alignment of LLMs via reinforcement learning fine-tuning ( Invited Talk ) > link | Natasha Jaques 🔗 |
Fri 2:05 p.m. - 3:00 p.m.
|
Panel
(
Panel
)
>
|
Stephen Casper · Yilun Zhou · Goran Radanovic · Natasha Jaques 🔗 |
Fri 3:00 p.m. - 3:30 p.m.
|
Coffee Break
(
'Break'
)
>
|
🔗 |
Fri 3:30 p.m. - 3:45 p.m.
|
Doctor Approved: Generating Medically Accurate Skin Disease Images through AI–Expert Feedback ( Oral ) > link | Janet Wang · Yunbei Zhang · Zhengming Ding · Jihun Hamm 🔗 |
Fri 3:45 p.m. - 4:00 p.m.
|
Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings ( Oral ) > link | Jenny Huang · Yunyi Shen · Dennis Wei · Tamara Broderick 🔗 |
Fri 4:00 p.m. - 4:55 p.m.
|
Poster Session II
(
Poster
)
>
|
🔗 |
Fri 4:55 p.m. - 5:00 p.m.
|
Closing Remarks
(
Close
)
>
|
🔗 |
|
→ Learning interpretable descriptions of human preferences ( Poster ) > link | Rajiv Movva · Emma Pierson 🔗 |
|
→ Vertical Moral Growth: A Novel Developmental Framework for Human Feedback Quality in AI Alignment ( Poster ) > link | Taichiro Endo 🔗 |
|
→ Improvement-Guided Iterative DPO for Diffusion Models ( Poster ) > link | Ying Fan · Fei Deng · Yang Zhao · Sahil Singla · Rahul Jain · Tingbo Hou · Kangwook Lee · Feng Yang · Deepak Ramachandran · Qifei Wang 🔗 |
|
→ Empirical Studies on the Limitations of Direct Preference Optimization, and a Possible Quick Fix ( Poster ) > link | Jiarui Yao · Yong LIN · Tong Zhang 🔗 |
|
→ Inference-Time Reward Hacking in Large Language Models ( Poster ) > link | Hadi Khalaf · Claudio Mayrink Verdun · Alex Oesterling · Himabindu Lakkaraju · Flavio Calmon 🔗 |
|
→ FSPO: Few-Shot Preference Optimization of Synthetic Preference Data Elicits LLM Personalization to Real Users ( Poster ) > link | Anikait Singh · Sheryl Hsu · Kyle Hsu · Eric Mitchell · Stefano Ermon · Tatsunori Hashimoto · Archit Sharma · Chelsea Finn 🔗 |
|
→ BiasLab: Toward Explainable Political Bias Detection with Dual-Axis Human Annotations and Rationale Indicators ( Poster ) > link | KMA SOLAIMAN 🔗 |
|
→ Implicit User Feedback in Human-LLM Dialogues: Informative to Understand Users yet Noisy as a Learning Signal ( Poster ) > link | Yuhan Liu · Michael Zhang · Eunsol Choi 🔗 |
|
→ Configurable Preference Tuning with Rubric-Guided Synthetic Data ( Poster ) > link | Victor Gallego 🔗 |
|
→ Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward ( Poster ) > link | Yanming Wan · Jiaxing Wu · Marwa Abdulhai · Lior Shani · Natasha Jaques 🔗 |
|
→ Tracing Human-like Traits in LLMs: Origins, Real-World Manifestation, and Controllability ( Poster ) > link | Pengrui Han · Rafal Kocielnik · Peiyang Song · Ramit Debnath · Dean Mobbs · Anima Anandkumar · R. Michael Alvarez 🔗 |
|
→ ReDit: Reward Dithering for Improved LLM Policy Optimization ( Poster ) > link | Chenxing Wei · Jiarui Yu · Ying He · Hande Dong · Yao Shu · Fei Yu 🔗 |
|
→ Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization ( Poster ) > link | Chengcan Wu · Zhixin Zhang · Zeming Wei · Yihao Zhang · Meng Sun 🔗 |
|
→ Do Language Models Understand Discrimination? Testing Alignment with Human Legal Reasoning under the ECHR ( Poster ) > link | Tatiana Botskina 🔗 |
|
→ Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization ( Poster ) > link | Matteo Gallici · Haitz Sáez de Ocáriz Borde 🔗 |
|
→ Playing the Data: Video Games as a Tool to Annotate and Train Models on Large Datasets ( Poster ) > link | Parham Ghasemloo Gheidari · Kai-Hsiang Chang · Roman Sarrazin-Gendron · Renata Mutalova · Alexander Butyaev · Attila Szantner · Jérôme Waldispühl 🔗 |
|
→ In-Context Personalized Alignment with Feedback History under Counterfactual Evaluation ( Poster ) > link | Xisen Jin · Zheng Li · Zhenwei DAI · Hui Liu · Xianfeng Tang · Chen Luo · Rahul Goutam · Xiang Ren · Qi He 🔗 |
|
→ Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits ( Poster ) > link | Qingyue Zhao · Kaixuan Ji · Heyang Zhao · Tong Zhang · Quanquan Gu 🔗 |
|
→ Multi-Task Reward Learning from Human Ratings ( Poster ) > link | Mingkang Wu · Devin White · Evelyn Rose · Vernon Lawhern · Nicholas Waytowich · Yongcan Cao 🔗 |
|
→ Doubly Robust Alignment for Large Language Models ( Poster ) > link | Erhan Xu · Kai Ye · Hongyi Zhou · Luhan Zhu · Francesco Quinzan · Chengchun Shi 🔗 |
|
→ KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF ( Poster ) > link | Lennie Wells · Edward J. Young · Jason Brown · Sergio Bacallado 🔗 |
-
|
CUDA: Capturing Uncertainty and Diversity in Preference Feedback Augmentation ( Oral ) > link | Sehyeok Kang · Jaewook Jeong · Se-Young Yun 🔗 |
|
→ Entropy Controllable Direct Preference Optimization ( Poster ) > link | Motoki Omura · Yasuhiro Fujita · Toshiki Kataoka 🔗 |
|
→ The Strong, weak and benign Goodhart’s law. An independence-free and paradigm-agnostic formalisation ( Poster ) > link | Adrien Majka · El-Mahdi El-Mhamdi 🔗 |
|
→ Unanchoring the Mind: DAE-Guided Counterfactual Reasoning for Rare Disease Diagnosis ( Poster ) > link | Yuting Yan · Yinghao Fu · Wendi Ren · Shuang Li 🔗 |
|
→ Understanding Likelihood Over-optimisation in Direct Alignment Algorithms ( Poster ) > link | Zhengyan Shi · Sander Land · Acyr Locatelli · Matthieu Geist · Max Bartolo 🔗 |
|
→ Advancing LLM Safe Alignment with Safety Representation Ranking ( Poster ) > link | Tianqi Du · Zeming Wei · Quan Chen · Chenheng Zhang · Yisen Wang 🔗 |
|
→ LoRe: Personalizing LLMs via Low-Rank Reward Modeling ( Poster ) > link | Avinandan Bose · Zhihan Xiong · Yuejie Chi · Simon Du · Lin Xiao · Maryam Fazel 🔗 |
|
→ Online Learning and Equilibrium Computation with Ranking Feedback ( Poster ) > link | Mingyang Liu · Yongshan Chen · Zhiyuan Fan · Gabriele Farina · Asuman Ozdaglar · Kaiqing Zhang 🔗 |
|
→ Efficient Generative Models Personalization via Optimal Experimental Design ( Poster ) > link | Guy Schacht · Mojmir Mutny · Riccardo De Santi · Ziyad Sheebaelhamd · Andreas Krause 🔗 |
|
→ Aligning Neural Style Representations for Style-based Clustering ( Poster ) > link | Abhishek Dangeti · Pavan Gajula · Vikram Jamwal · Vivek Srivastava 🔗 |
|
→ Aggregated Individual Reporting for Post-Deployment Evaluation ( Poster ) > link | Jessica Dai · Inioluwa Raji · Benjamin Recht · Irene Y. Chen 🔗 |
|
→ Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of Value ( Poster ) > link |
22 presentersRyan Lowe · Joe Edelman · Tan Zhi-Xuan · Oliver Klingefjord · Ellie Hain · Vincent Wang-Maścianica · Atrisha Sarkar · Michiel Bakker · Fazl Barez · Matija Franklin · Andreas Haupt · Jobstq Heitzig · Wesley H. Holliday · Julian Jara-Ettinger · Atoosa Kasirzadeh · Ryan Kearns · James Kirkpatrick · Andrew Koh · Joel Lehman · Sydney Levine · Manon Revel · Ivan Vendrov |
|
→ Self-Concordant Preference Learning from Noisy Labels ( Poster ) > link | Shiv Shankar · Madalina Fiterau 🔗 |
|
→ Selective Preference Aggregation ( Poster ) > link | Shreyas Kadekodi · Hayden McTavish · Berk Ustun 🔗 |
|
→ Auto-Guideline Alignment: Probing and Simulating Human Ideological Preferences in LLMs via Prompt Engineering ( Poster ) > link | Chien-Hua Chen · Chang Chih Meng · Li-Ni Fu · Hen-Hsen Huang · I-Chen Wu 🔗 |
|
→ Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval ( Poster ) > link | Taiye Chen · Zeming Wei · Ang Li · Yisen Wang 🔗 |
|
→ EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments ( Poster ) > link | Sara Fish · Julia Shephard · Minkai Li · Ran Shorrer · Yannai A. Gonczarowski 🔗 |
|
→ Human Feedback Guided Reinforcement Learning for Unknown Temporal Tasks via Weighted Finite Automata ( Poster ) > link | Nathaniel Smith · Yu Wang 🔗 |
|
→ Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model ( Poster ) > link | Jihun Yun · Juno Kim · Jongho Park · Junhyuck Kim · Jongha (Jon) Ryu · Jaewoong Cho · Kwang-Sung Jun 🔗 |
|
→ Rewrite-to-Rank: Optimizing Ad Visibility via Retrieval-Aware Text Rewriting ( Poster ) > link | Chloe Ho · Ishneet Singh · Diya Sharma · Tanvi Anumandla · Michael Lu · Vasu Sharma · Kevin Zhu 🔗 |
|
→ Mimicking Human Intuition: Cognitive Belief-Driven Reinforcement Learning ( Poster ) > link | Xingrui Gu · Guanren Qiao · Chuyi Jiang 🔗 |
|
→ Robust Multi-Objective Controlled Decoding of Large Language Models ( Poster ) > link | Seongho Son · William Bankes · Sangwoong Yoon · Shyam Sundhar Ramesh · Xiaohang Tang · Ilija Bogunovic 🔗 |
|
→ CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics ( Poster ) > link | Perampalli Shravan Nayak · Mehar Bhatia · Xiaofeng Zhang · Verena Rieser · Lisa Hendricks · Sjoerd van Steenkiste · Yash Goyal · Karolina Stanczak · Aishwarya Agrawal 🔗 |
|
→ In-Context Alignment at Scale: When More is Less ( Poster ) > link | Neelabh Madan · Lakshmi Subramanian 🔗 |
|
→ Dynamic Guardian Models: Realtime Content Moderation With User-Defined Policies ( Poster ) > link | Monte Hoover · Vatsal Baherwani · Neel Jain · Khalid Saifullah · Joseph Vincent · Chirag Jain · Melissa Rad · C. Bayan Bruss · Ashwinee Panda · Tom Goldstein 🔗 |
|
→ On the strength of goodhart's law ( Poster ) > link | Adrien Majka · Wassim Bouaziz · El-Mahdi El-Mhamdi 🔗 |
|
→ Reasoning Isn't Enough: Examining Truth-Bias and Sycophancy in LLMs ( Poster ) > link | Emilio Barkett · Olivia Long · Madhavendra Thakur 🔗 |
|
→ Angular Steering: Behavior Control via Rotation in Activation Space ( Poster ) > link | Hieu M. Vu · Tan Nguyen 🔗 |
|
→ A Unified Perspective on Reward Distillation Through Ratio Matching ( Poster ) > link | Kenan Hasanaliyev · Schwinn Saereesitthipitak · Rohan Sanda 🔗 |
|
→ Geometry-Aware Preference Learning for 3D Texture Generation ( Poster ) > link | AmirHossein Zamani · Tianhao Xie · Amir Aghdam · Tiberiu Popa · Eugene Belilovsky 🔗 |
|
→ What Matters when Modeling Human Behavior using Imitation Learning? ( Poster ) > link | Aneri Muni · Esther Derman · Vincent Taboga · Pierre-Luc Bacon · Erick Delage 🔗 |
|
→ Expected Reward Prediction, with Applications to Model Routing ( Poster ) > link | Kenan Hasanaliyev · Silas Alberti · Jenny Hamer · Dheeraj Rajagopal · Kevin Robinson · Jasper Snoek · Victor Veitch · Alexander D'Amour 🔗 |
|
→ Language Model Personalization via Reward Factorization ( Poster ) > link | Idan Shenfeld · Felix Faltings · Pulkit Agrawal · Aldo Pacchiano 🔗 |
|
→ Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning ( Poster ) > link | Kai Ye · Hongyi Zhou · Jin Zhu · Francesco Quinzan · Chengchun Shi 🔗 |
|
→ Theoretical Analysis of KL-regularized RLHF with Multiple Reference Models ( Poster ) > link | Gholamali Aminian · Amir R. Asadi · Idan Shenfeld · Youssef Mroueh 🔗 |
|
→ Alignment of Large Language Models with Constrained Learning ( Poster ) > link | Botong Zhang · Shuo Li · Ignacio Hounie · Osbert Bastani · Dongsheng Ding · Alejandro Ribeiro 🔗 |
|
→ Robust Reward Modeling via Causal Rubrics ( Poster ) > link |
12 presentersPragya Srivastava · Harman Singh · Rahul Madhavan · Gandharv Patil · Sravanti Addepalli · Arun Sai Suggala · Rengarajan Aravamudhan · Soumya Sharma · Anirban Laha · Aravindan Raghuveer · Karthikeyan Shanmugam · Doina Precup |
|
→ ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment ( Poster ) > link | Xiaoqiang Lin · Arun Verma · Zhongxiang Dai · Daniela Rus · See-Kiong Ng · Bryan Kian Hsiang Low 🔗 |
|
→ Mechanism Design for Alignment via Human Feedback ( Poster ) > link | Julian Manyika · Michael Wooldridge · Jiarui Gan 🔗 |
|
→ Composition and Alignment of Diffusion Models using Constrained Learning ( Poster ) > link | Shervin Khalafi · Ignacio Hounie · Dongsheng Ding · Alejandro Ribeiro 🔗 |