Extend your code model evaluation with cost + latency + hallucination metrics

#13
by vigneshwar234 - opened

Hi BigCode team and community πŸ‘‹

BigCodeBench is the reference benchmark for code generation quality. For engineering teams choosing a code LLM for production use, accuracy is table stakes β€” cost and latency decide the final pick.

I built an open source LLM Evaluation Framework that wraps any benchmark with 5 simultaneous metrics:

β†’ πŸ’° Cost per 1K tokens β€” across all major code model providers
β†’ ⚑ Latency p50/p95/p99 β€” tail latency matters for IDE integrations
β†’ πŸ” Hallucination Rate β€” detects overconfident wrong code explanations
β†’ 🎯 Accuracy β€” 4-strategy scorer compatible with MC-format tasks
β†’ 🧠 Reasoning Quality β€” scores CoT depth in code explanations

Works with any LiteLLM-compatible model, including all models on this leaderboard.

Live demo (no API key): https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Would love feedback from the BigCode community on extending for code-specific accuracy metrics!

Sign up or log in to comment