use system prompt for reasoning benchmarks reproduction (IMO-AnswerBench,..)

#24
by ychenNLP - opened

Reproducibility Tip: Use the System Prompt for Math Benchmarks

We've noticed that some community benchmarks report significantly lower math scores for Nemotron-Cascade-2-30B-A3B than what we observe internally. After investigating, we found that the system prompt matters a lot for this model's math performance.

Recommended Prompt Template

system: |-
  You are a helpful and harmless assistant.\n\nYou are not allowed to use any tools.

user: |-
  {problem}\n\nPlease place your final answer inside \boxed{}.

IMO AnswerBench Results (data link)

System prompt Avg tokens Accuracy
No ~16k 64.9%
Yes ~63k 82.5%

Other configs

  • vLLM: --mamba_ssm_cache_dtype float32
  • temperature 1.0
  • top-p 1.0
  • max output length 256k

Longer generation budgets allow the model to think more deeply, and performance scales accordingly.

If you're benchmarking this model on math and seeing lower-than-expected numbers, the system prompt is the first thing to check.

Sign up or log in to comment