ICML Auto-Guideline Alignment: Probing and Simulating Human Ideological Preferences in LLMs via Prompt Engineering

Poster
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)

Auto-Guideline Alignment: Probing and Simulating Human Ideological Preferences in LLMs via Prompt Engineering

Chien-Hua Chen · Chang Chih Meng · Li-Ni Fu · Hen-Hsen Huang · I-Chen Wu

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Aligning large language models (LLMs) with human values usually requires expensive reinforcement learning from human feedback. We introduce Auto-Guideline Alignment (AGA), a prompt-only framework that uncovers, audits, and steers *hidden* ideological preferences by treating concise, human-readable guidelines as transparent reward proxies, without any parameter updates.To evaluate AGA, we employ a **GPT-4.1** to generate a dataset of 600 left/right political dilemmas covering 30 topics (five domains $\times$ six subdomains). Three experiments show that: (1) LLMs exhibit a consistent left-leaning bias; (2) AGA aligns models to all $2^5$ domain-level ideology mixtures, with degraded performance under cross-domain conflict; and (3) intra-domain stance fragmentation leads to unstable alignment and reduced accuracy. Overall, AGA delivers scalable, transparent, and reproducible value alignment, replacing costly human labeling with explicit rules and iterative self-evaluation.

Chat is not available.

Poster in Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)

Auto-Guideline Alignment: Probing and Simulating Human Ideological Preferences in LLMs via Prompt Engineering

Chien-Hua Chen · Chang Chih Meng · Li-Ni Fu · Hen-Hsen Huang · I-Chen Wu

Poster
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)