ICML Poster Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing

Poster

Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing

Han Jiang · Xiaoyuan Yi · Zhihua Wei · Ziang Xiao · Shu Wang · Xing Xie

East Exhibition Hall A-B #E-1910

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Warning: Contains harmful model outputs.Despite significant advancements, the propensity of Large Language Models (LLMs) to generate harmful and unethical content poses critical challenges.Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Although numerous benchmarks have been constructed to assess social bias, toxicity, and ethical issues in LLMs, those static benchmarks suffer from evaluation chronoeffect, in which, as models rapidly evolve, existing benchmarks may leak into training data or become saturated, overestimating ever-developing LLMs. To tackle this problem, we propose GETA, a novel generative evolving testing approach based on adaptive testing methods in measurement theory. Unlike traditional adaptive testing methods that rely on a static test item pool, GETA probes the underlying moral boundaries of LLMs by dynamically generating test items tailored to model capability. GETA co-evolves with LLMs by learning a joint distribution of item difficulty and model value conformity, thus effectively addressing evaluation chronoeffect. We evaluated various popular LLMs with GETA and demonstrated that 1) GETA can dynamically create difficulty-tailored test items and 2) GETA's evaluation results are more consistent with models' performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.

Lay Summary:

Large language models (LLMs), like ChatGPT, have made big strides in recent years—but they still sometimes produce harmful or unethical content. To use these tools responsibly, we need reliable ways to measure how well they align with human values. Unfortunately, most current evaluation methods rely on fixed test sets that can become outdated or even show up in the models’ training data, making them less useful over time.Our research introduces a new approach called GETA (Generative Evolving Testing Approach), which solves this problem by generating fresh, tailored test questions on the fly. Inspired by testing methods used in education, GETA adjusts the difficulty of each question based on the model’s abilities and adapts as models evolve.We tested GETA on several popular LLMs and found it could better track how well these models behave in unfamiliar situations—something static tests struggle with. This work matters because it offers a smarter, more future-proof way to evaluate the ethical behavior of LLMs, helping ensure they’re safe and trustworthy as they continue to grow more powerful.

Chat is not available.