ICML Discovering Forbidden Topics in Language Models

Poster
in
Workshop: Actionable Interpretability

Discovering Forbidden Topics in Language Models

Can Rager · Chris Wendler · Rohit Gandikota · David Bau

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 19 Jul 10:40 a.m. PDT — 11:40 a.m. PDT

Abstract:

Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. As most LLMs have undisclosed training processes, it is difficult to discover undesired refusal behavior that may have been purposefully baked in or learned by accident. We introduce this new problem setting and develop a refusal discovery method, LLM-crawler, that uses token prefilling to find forbidden topics. We benchmark the LLM-crawler on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawl to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits “thought suppression” behavior that indicates memorization of CCP-aligned responses. Although Perplexity-R1-1776-70B is claimed to be decensored, LLM-crawler elicits censored answers of the quantized model. Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.

Chat is not available.

Poster in Workshop: Actionable Interpretability

Discovering Forbidden Topics in Language Models

Can Rager · Chris Wendler · Rohit Gandikota · David Bau

Poster
in
Workshop: Actionable Interpretability