Spotlight Poster
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
Samuel Miserendino · Michele Wang · Tejal Patwardhan · Johannes Heidecke
East Exhibition Hall A-B #E-1707
Thu 17 Jul 3:30 p.m. PDT — 4:30 p.m. PDT
Large language models like ChatGPT are getting better at coding, but could they perform real software engineering work? The SWE-Lancer benchmark aims to measure this. We collected nearly 1500 actual freelance software engineering jobs from Upwork that are worth $1 million USD in total payouts, and measured model performance on these tasks. SWE-Lancer tasks consist of both independent engineering work — like bug fixes and feature implementations — and managerial tasks, where models act as a tech lead and pick the best implementation solution for a given problem. We evaluated multiple state-of-the-art models, and all were unable to solve the majority of tasks, indicating they do not surpass human performance on real-world software engineering work. To facilitate future research, we open-sourced our evaluation code and a public evaluation split so anyone can run SWE-Lancer. By mapping model performance to real dollars, we hope SWE-Lancer enables greater research into the economic impact of AI model development.