use system prompt for reasoning benchmarks reproduction (IMO-AnswerBench,..)

#24

by ychenNLP - opened about 2 hours ago

Reproducibility Tip: Use the System Prompt for Math Benchmarks

We've noticed that some community benchmarks report significantly lower math scores for Nemotron-Cascade-2-30B-A3B than what we observe internally. After investigating, we found that the system prompt matters a lot for this model's math performance.

Recommended Prompt Template

system: |-
  You are a helpful and harmless assistant.\n\nYou are not allowed to use any tools.

user: |-
  {problem}\n\nPlease place your final answer inside \boxed{}.

IMO AnswerBench Results (data link)

System prompt	Avg tokens	Accuracy
No	~16k	64.9%
Yes	~63k	82.5%

Other configs

vLLM: --mamba_ssm_cache_dtype float32
temperature 1.0
top-p 1.0
max output length 256k

Longer generation budgets allow the model to think more deeply, and performance scales accordingly.

If you're benchmarking this model on math and seeing lower-than-expected numbers, the system prompt is the first thing to check.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment