Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains
EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits
Wayne Chi · Valerie Chen · Ryan Shar · Aditya Mittal · Jenny Liang · Wei-Lin Chiang · Anastasios Angelopoulos · Ion Stoica · Graham Neubig · Ameet Talwalkar · Chris Donahue
Keywords: [ real-world ] [ edit ] [ code edit ] [ llm ] [ code ]
Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EditBench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e., user instructions and code contexts collected in the wild. We propose a pipeline that combines automatic test generation using an agentic workflow with human-in-the-loop validation and revision to create a set of test cases for each set of collected user instructions and code contexts. EditBench covers a diverse set of real-world use cases, ranging from resolving errors to adding features. We evaluate 17 diverse LLMs and observe that EditBench is a challenging set of problems where even the best state-of-the-art models score less than 60%. Further, we find that model performance varies across different categories of user instructions, indicating room for improvement. We design the EditBench data pipeline to be a renewable pathway for live dataset construction that mitigates training data contamination.