Poster
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
Yifan Sun · Han Wang · Dongbai Li · Gang Wang · Huan Zhang
East Exhibition Hall A-B #E-2607
Benchmark Data Contamination (BDC)—the inclusion of benchmark testing samples in the training set—has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics—fidelity and contamination resistance—to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy effectively balances fidelity and contamination resistance. No semantic-preserving strategy yields a significant improvement in resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, while semantic-altering strategies sacrifice fidelity for resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository is available at https://github.com/ASTRAL-Group/BDCmitigationassessment.
Modern language models often appear to perform impressively on evaluation tests. However, many of these test questions are unintentionally included in the models’ training data—a problem known as benchmark data contamination. This leads to overly optimistic scores and unreliable assessments of what the models truly understand. To address this, researchers have proposed ways to revise test questions or create new ones. But are these fixes actually effective?In this paper, we conduct a thorough study of 20 different contamination-mitigation methods, testing them on five evaluation sets and ten language models. We introduce two new ways to measure how well these methods work—whether they preserve the original test’s intent and whether they successfully reduce the contamination. Our findings show that most current approaches fail to strike a good balance: some make the test cleaner but change its meaning, while others preserve meaning but do little to reduce contamination. This highlights the need for better solutions to ensure fair and trustworthy evaluation of language models.