Extend your code model evaluation with cost + latency + hallucination metrics
Hi BigCode team and community π
BigCodeBench is the reference benchmark for code generation quality. For engineering teams choosing a code LLM for production use, accuracy is table stakes β cost and latency decide the final pick.
I built an open source LLM Evaluation Framework that wraps any benchmark with 5 simultaneous metrics:
β π° Cost per 1K tokens β across all major code model providers
β β‘ Latency p50/p95/p99 β tail latency matters for IDE integrations
β π Hallucination Rate β detects overconfident wrong code explanations
β π― Accuracy β 4-strategy scorer compatible with MC-format tasks
β π§ Reasoning Quality β scores CoT depth in code explanations
Works with any LiteLLM-compatible model, including all models on this leaderboard.
Live demo (no API key): https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
Would love feedback from the BigCode community on extending for code-specific accuracy metrics!