ICML Probing Evaluation Awareness of Language Models

Poster
in
Workshop: Workshop on Technical AI Governance

Probing Evaluation Awareness of Language Models

Jord Nguyen · Hoang Khiem · Carlo Attubato · Felix Hofstätter

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Language models can distinguish between testing and deployment phases—a capability known as evaluation awareness. This has significant safety and policy implications, potentially undermining the reliability of evaluations that are central to AI governance frameworks and voluntary industry commitments. In this paper, we study evaluation awareness in Llama-3.3-70B-Instruct. We train linear probes for test-vs-deployment, and these probes generalise across real-world testing and deployment prompts, suggesting that current models internally represent this distinction. We used probes to determine the authenticity of evaluation prompts. We hypothesise that 'inauthentic' prompts, i.e. prompts which do not look like real-world prompts, would be classified as more test-like by our probe. We find that current deception evaluations are indeed classified as test-like by the probes, suggesting they might already appear artificial to models.

Chat is not available.

Poster in Workshop: Workshop on Technical AI Governance

Probing Evaluation Awareness of Language Models

Jord Nguyen · Hoang Khiem · Carlo Attubato · Felix Hofstätter

Poster
in
Workshop: Workshop on Technical AI Governance