Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning
Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning is a reasoning-focused fine-tuned model designed to improve zero-shot general reasoning performance on top of the Qwen3.5-35B-A3B base model.
This model is trained on the voidful/gemini-3.1-opus-4.6-reasoning-merged dataset. The training objective is simple. We aim to improve structured reasoning, reading comprehension, commonsense inference, and science-style QA, while keeping the model practical for real-world usage.
Why this model
This model extends the same reasoning-oriented training pipeline used in the 27B version to a larger base model (35B-A3B).
The central hypothesis is:
If reasoning supervision is effective, scaling the base model should preserve or further improve zero-shot reasoning performance.
We evaluate this claim using a standard zero-shot benchmark suite.
Key results
Zero-shot accuracy (acc)
| Task | Ours | Baseline: Qwen3.5-35B-A3B | Jackrong FT | Gain over Baseline | Gain over Jackrong |
|---|---|---|---|---|---|
| arc_challenge | 0.6152 | 0.5947 | 0.5853 | +0.0205 | +0.0299 |
| arc_easy | 0.8510 | 0.8396 | 0.8439 | +0.0114 | +0.0071 |
| boolq | 0.9043 | 0.8765 | 0.8853 | +0.0278 | +0.0190 |
| hellaswag | 0.6522 | 0.6243 | 0.6244 | +0.0279 | +0.0278 |
| openbookqa | 0.3480 | 0.3400 | 0.3360 | +0.0080 | +0.0120 |
| piqa | 0.8232 | 0.8264 | 0.8221 | -0.0032 | +0.0011 |
| winogrande | 0.7616 | 0.7474 | 0.7403 | +0.0142 | +0.0213 |
Normalized accuracy (acc_norm)
| Task | Ours | Baseline | Jackrong FT | Gain over Baseline | Gain over Jackrong |
|---|---|---|---|---|---|
| arc_challenge | 0.6459 | 0.6195 | 0.6203 | +0.0264 | +0.0256 |
| arc_easy | 0.8169 | 0.7963 | 0.8152 | +0.0206 | +0.0017 |
| hellaswag | 0.8433 | 0.8256 | 0.8204 | +0.0177 | +0.0229 |
| openbookqa | 0.4580 | 0.4360 | 0.4480 | +0.0220 | +0.0100 |
| piqa | 0.8237 | 0.8346 | 0.8237 | -0.0109 | 0.0000 |
Summary
We compute simple averages over the reported metrics.
Average acc
- Ours: 70.79
- Baseline: 69.13
- Jackrong FT: 69.03
Average acc_norm
- Ours: 71.76
- Baseline: 70.40
- Jackrong FT: 70.35
Key observations
1. Consistent but smaller gains
The model improves over both baselines on most tasks. However, the magnitude of improvement is smaller than the 27B version.
This suggests:
Scaling alone does not automatically amplify reasoning gains from fine-tuning.
2. Strong improvements on reasoning-heavy tasks
The largest gains appear on:
- BoolQ
- ARC-Challenge
- HellaSwag
These tasks require multi-step reasoning or contextual inference.
This aligns with the training objective.
3. Mixed results on simpler tasks
On PIQA, the model slightly underperforms the base model.
This indicates a known trade-off:
Reasoning-focused fine-tuning may slightly degrade performance on simpler or more pattern-based tasks.
4. Stability across benchmarks
Despite some trade-offs, the model shows:
- consistent improvements across most tasks
- no catastrophic regression
This suggests the training dataset provides stable supervision.
What this means in practice
This model is a strong choice if you want:
- better zero-shot reasoning
- improved reading comprehension
- more reliable structured inference
However, compared to the 27B version:
- gains are more moderate
- improvements are less dramatic
This indicates diminishing returns when scaling base model size under the same training recipe.
Training data
This model is fine-tuned on:
voidful/gemini-3.1-opus-4.6-reasoning-merged
The dataset focuses on reasoning-oriented supervision, including:
- structured reasoning traces
- QA-style reasoning
- commonsense inference
Limitations
The current evaluation focuses on general reasoning benchmarks.
We do not yet claim improvements on:
- coding tasks
- multilingual reasoning
- math-heavy benchmarks
Further evaluation is required.
Takeaway
This model validates an important hypothesis:
Reasoning-focused fine-tuning improves zero-shot reasoning consistently, even at larger scale.
But it also reveals a deeper insight:
Scaling the base model does not guarantee proportional reasoning gains.
- Downloads last month
- 214