Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

Environment Free Coding Benchmarks: Evaluating Language Model Coding Capabilities without a Dedicated Environment

Laurence Liang


Abstract:

The increasing adoption of language models for coding tasks has prompted researchers to develop coding benchmarks to better assess and quantify a language model’s coding abilities on a variety of tasks. Existing benchmarks effectively evaluate model code generation and understanding abilities, but typically require an external environment to verify code, which can slow down and complicate model evaluation. This paper presents the Environment-Free Coding Bencharks (EFCB) suite - a collection of 5,512 questions from real-world GitHub pull requests -that introduces multiple advantages relative to existing coding benchmarks: eliminating the need to use an external coding environment, a larger and more diverse question bank spanning different programming languages and industry use cases, and a multi-faceted collection of tasks that evaluate different indicators with respect to model coding abilities. By evaluating EFCB with o4-mini and Llama-3.3-70B as state of the art (SOTA) models, we observe that current SOTA models achieve approximately uniform performance across different programming languages and use cases, and we identify areas of improvement for existing SOTA models given that current EFCB results have not yet attained benchmark saturation.

Chat is not available.