use system prompt for reasoning benchmarks reproduction (IMO-AnswerBench,..)
#24
by ychenNLP - opened
Reproducibility Tip: Use the System Prompt for Math Benchmarks
We've noticed that some community benchmarks report significantly lower math scores for Nemotron-Cascade-2-30B-A3B than what we observe internally. After investigating, we found that the system prompt matters a lot for this model's math performance.
Recommended Prompt Template
system: |-
You are a helpful and harmless assistant.\n\nYou are not allowed to use any tools.
user: |-
{problem}\n\nPlease place your final answer inside \boxed{}.
IMO AnswerBench Results (data link)
| System prompt | Avg tokens | Accuracy |
|---|---|---|
| No | ~16k | 64.9% |
| Yes | ~63k | 82.5% |
Other configs
- vLLM:
--mamba_ssm_cache_dtype float32 - temperature 1.0
- top-p 1.0
- max output length 256k
Longer generation budgets allow the model to think more deeply, and performance scales accordingly.
If you're benchmarking this model on math and seeing lower-than-expected numbers, the system prompt is the first thing to check.