bigcodebench-leaderboard

Running

App Files Files Community

Extend your code model evaluation with cost + latency + hallucination metrics

#13

by vigneshwar234 - opened 3 days ago

Discussion

vigneshwar234

3 days ago

Hi BigCode team and community 👋

BigCodeBench is the reference benchmark for code generation quality. For engineering teams choosing a code LLM for production use, accuracy is table stakes — cost and latency decide the final pick.

I built an open source LLM Evaluation Framework that wraps any benchmark with 5 simultaneous metrics:

→ 💰 Cost per 1K tokens — across all major code model providers
→ ⚡ Latency p50/p95/p99 — tail latency matters for IDE integrations
→ 🔍 Hallucination Rate — detects overconfident wrong code explanations
→ 🎯 Accuracy — 4-strategy scorer compatible with MC-format tasks
→ 🧠 Reasoning Quality — scores CoT depth in code explanations

Works with any LiteLLM-compatible model, including all models on this leaderboard.

Live demo (no API key): https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Would love feedback from the BigCode community on extending for code-specific accuracy metrics!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment