BigCodeArena
Compare two AI models by sending them code and seeing their responses
Inspired by LMArena for LLMs, we've built a platform that allows anyone to compare code generation models side-by-side, but with a crucial difference: you can actually run the code and see what it produces. Just submit a coding task, watch two different models generate solutions, execute both programs, and vote on which model produced better results. The outcomes are organized into a leaderboard that displays the community's highest-rated models.
The field of code generation has long struggled with reliable evaluation methods. Traditional benchmarks like HumanEval test code against predefined test cases, but these represent only a tiny fraction of real-world programming tasks. Human evaluation platforms exist for general chatbots, but they fall short for code: reading raw source code and mentally simulating its execution is cognitively demanding and error-prone, especially for longer programs or complex UI applications.
Consider this scenario:
You ask two AI models to build a responsive photo gallery website. Both generate code that looks syntactically correct. But which one is actually better? Without running the code, it's nearly impossible to tell. One might produce a beautiful, functional grid layout, while the other might have subtle bugs or poor styling that only become apparent when rendered in a browser.
This observation led us to a key insight: execution feedback is essential for humans to judge code quality reliably. That's exactly what BigCodeArena provides.
BigCodeArena extends the Chatbot Arena framework with powerful features specifically designed for code evaluation:
Every code snippet generated by models is automatically executed in isolated sandbox environments. Whether it's a Python script, a React web app, a PyGame game, or a C++ algorithm, you can see the actual output, not just the source code.
We currently support 10 languages (Python, JavaScript, TypeScript, HTML, C, C++, Java, Go, Rust, and Markdown) and 8 execution environments:
Unlike static code comparison, you can actually interact with the generated applications:
Real programming isn't one-and-done. BigCodeArena supports multi-turn interactions, allowing you to refine requirements, ask for features to be added, or request bug fixes -- just like working with a real coding assistant.
Since launching in February 2025, BigCodeArena has collected over 14,000 conversations from more than 500 unique users, with 4,700+ high-quality preference votes comparing 10 frontier LLMs.
Our users have explored remarkably diverse coding scenarios:
Python dominates with over 4,000 conversations, followed by JavaScript/TypeScript (3,359), HTML (1,601), and C++ (642). Among frameworks, direct Python interpreters lead usage (6,000 sessions), with React (2,729), Core Web (1,574), Streamlit (1,254), and PyGame (1,087) also seeing heavy use.
Most interactions are focused and efficient: 76% of conversations consist of just 2 turns (one request, one response), with a mean conversation length of 4.12 messages. However, the platform supports extended multi-turn debugging sessions when needed, with some conversations exceeding 10 turns as users refine complex applications.
From our 14K conversations, we filtered for high-quality pairwise comparisons: conversations with at least two turns and actual code execution. This yielded 4,731 voting samples, with each evaluated model receiving at least 700 votes. We aggregate these votes into Elo ratings using the Bradley-Terry model, which estimates the probability that one model beats another based on head-to-head comparisons.
To ensure robust rankings, we use 100 bootstrap resamples to construct 95% confidence intervals, so we can identify statistically significant performance differences between models.
We evaluate models under three settings to control for different factors:
Rankings remain remarkably consistent across all three settings, revealing clear performance tiers:
Top Tier: o3-mini and o1-mini consistently lead with the highest Elo ratings and tight confidence intervals. These models maintain top performance regardless of environment or language constraints, showing strong robustness across coding scenarios. Claude-3.5-Sonnet follows closely, particularly excelling when language is controlled.
Mid Tier: GPT-4o, o1, and Gemini-2.0-Pro/Flash form a competitive middle tier. GPT-4o shows some sensitivity to language matching, suggesting room for improvement in multilingual consistency.
Open Source Models: Qwen2.5 variants and Llama-3.3-70B lag behind frontier proprietary models, highlighting the performance gap that remains between leading closed and open models.
Figure: Overall win rate heatmaps (percentage of all pairwise comparisons won) of each model in the sessions across languages (left) and execution environments (right). For each category, we only keep models that appear in at least 3 conversation sessions.
Breaking down performance by programming language reveals interesting patterns:
Analyzing win rates by execution environment reveals how models handle different runtime contexts:
Robust Performers: o3-mini maintains consistently strong performance across React, Streamlit, Gradio, Core Web, and PyGame, demonstrating excellent environmental adaptability.
Stable but Selective: Claude-3.5-Sonnet and Gemini-2.0-Flash show generally stable performance but with reduced win rates in complex UI-heavy environments like Vue and Mermaid.
Framework-Specific Weaknesses: Qwen2.5 models, while competitive in some web frameworks (Core Web, React), struggle significantly with interactive and visualization-oriented environments like PyGame, Vue, and Mermaid. These environments often require precise handling of control flow, graphics rendering, and package dependencies.
These results highlight an important insight: aggregate Elo scores don't tell the whole story. Some models remain brittle under specific runtime constraints, and execution environment matters significantly for real-world deployment.
To advance research beyond crowdsourced evaluation, we're releasing two complementary benchmarks:
Building on our 4,700+ preference votes, BigCodeReward tests how well LLMs can judge code quality when acting as reward models. The key finding? Execution results dramatically improve judgment accuracy.
When models can see execution outputs (screenshots of web apps, game visuals, program logs), their alignment with human preferences increases substantially:
This reinforces our core thesis: you can't reliably judge code without running it -- and this applies to both humans and AI judges.
Inspired by Arena-Hard-Auto, AutoCodeArena provides a scalable way to evaluate new models without waiting for thousands of human votes. We carefully selected 600 representative prompts from our crowdsourced data, spanning all programming topics and frameworks.
Using automated LLM judges (Claude-3.7-Sonnet) to evaluate code execution results against a GPT-4.1 baseline, we can rapidly benchmark new models. This approach enables weekly leaderboard updates as new models are released.
Our automated benchmark evaluated 20+ cutting-edge models, including recently released systems:
Top Performers:
Figure: Win rates of recent LLMs on AutoCodeArena against a GPT-4.1 baseline, judged by Claude-3.7-Sonnet. The 50% mark represents parity with GPT-4.1. Models above this line outperform the baseline, while those below underperform. Error bars show 95% confidence intervals. Note: Claude-3.7-Sonnet is excluded from rankings to avoid self-judgment bias, and GPT-4.1 appears only as the reference baseline.
The results show that while proprietary models maintain an edge, open-source models are rapidly closing the gap, with some approaching GPT-4.1-level performance.
BigCodeArena is open to everyone -- no account required! Visit https://huggingface.co/spaces/bigcode/arena to:
Whether you're building a React dashboard, creating a PyGame game, solving algorithmic challenges, or generating creative visualizations, BigCodeArena lets you see which models truly deliver.
Following the BigCode Project's commitment to transparency, we're releasing:
We envision BigCodeArena as a long-term project that evolves with the community:
Evaluating code isn't like evaluating text -- you need to run it, test it, and interact with it. BigCodeArena makes this possible at scale, combining human judgment with real execution feedback to create the most reliable evaluation platform for code generation models.
Join us in building the future of code generation evaluation. Write a prompt, compare the models, and vote for your favorite. Your feedback helps the entire community understand which models truly deliver on the promise of AI-assisted programming.
We'd love to hear your feedback! Connect with us on GitHub, join discussions in the Hugging Face Space community tab, or reach out to the BigCode Project at contact@bigcode-project.org.
We thank Leandro von Werra for his valuable suggestions and feedback on the blog.
@article{zhuo2025bigcodearena,
title={BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution},
author={Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra},
year={2025}
}
Try BigCodeArena now: Hugging Face Space
Read the paper: Hugging Face
Run the code: GitHub
Explore the collection: Hugging Face Collection
Compare two AI models by sending them code and seeing their responses