Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning

Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning is a reasoning-focused fine-tuned model designed to improve zero-shot general reasoning performance on top of the Qwen3.5-35B-A3B base model.

This model is trained on the voidful/gemini-3.1-opus-4.6-reasoning-merged dataset. The training objective is simple. We aim to improve structured reasoning, reading comprehension, commonsense inference, and science-style QA, while keeping the model practical for real-world usage.


Why this model

This model extends the same reasoning-oriented training pipeline used in the 27B version to a larger base model (35B-A3B).

The central hypothesis is:

If reasoning supervision is effective, scaling the base model should preserve or further improve zero-shot reasoning performance.

We evaluate this claim using a standard zero-shot benchmark suite.


Key results

Zero-shot accuracy (acc)

Task Ours Baseline: Qwen3.5-35B-A3B Jackrong FT Gain over Baseline Gain over Jackrong
arc_challenge 0.6152 0.5947 0.5853 +0.0205 +0.0299
arc_easy 0.8510 0.8396 0.8439 +0.0114 +0.0071
boolq 0.9043 0.8765 0.8853 +0.0278 +0.0190
hellaswag 0.6522 0.6243 0.6244 +0.0279 +0.0278
openbookqa 0.3480 0.3400 0.3360 +0.0080 +0.0120
piqa 0.8232 0.8264 0.8221 -0.0032 +0.0011
winogrande 0.7616 0.7474 0.7403 +0.0142 +0.0213

Normalized accuracy (acc_norm)

Task Ours Baseline Jackrong FT Gain over Baseline Gain over Jackrong
arc_challenge 0.6459 0.6195 0.6203 +0.0264 +0.0256
arc_easy 0.8169 0.7963 0.8152 +0.0206 +0.0017
hellaswag 0.8433 0.8256 0.8204 +0.0177 +0.0229
openbookqa 0.4580 0.4360 0.4480 +0.0220 +0.0100
piqa 0.8237 0.8346 0.8237 -0.0109 0.0000

Summary

We compute simple averages over the reported metrics.

Average acc

  • Ours: 70.79
  • Baseline: 69.13
  • Jackrong FT: 69.03

Average acc_norm

  • Ours: 71.76
  • Baseline: 70.40
  • Jackrong FT: 70.35

Key observations

1. Consistent but smaller gains

The model improves over both baselines on most tasks. However, the magnitude of improvement is smaller than the 27B version.

This suggests:

Scaling alone does not automatically amplify reasoning gains from fine-tuning.


2. Strong improvements on reasoning-heavy tasks

The largest gains appear on:

  • BoolQ
  • ARC-Challenge
  • HellaSwag

These tasks require multi-step reasoning or contextual inference.

This aligns with the training objective.


3. Mixed results on simpler tasks

On PIQA, the model slightly underperforms the base model.

This indicates a known trade-off:

Reasoning-focused fine-tuning may slightly degrade performance on simpler or more pattern-based tasks.


4. Stability across benchmarks

Despite some trade-offs, the model shows:

  • consistent improvements across most tasks
  • no catastrophic regression

This suggests the training dataset provides stable supervision.


What this means in practice

This model is a strong choice if you want:

  • better zero-shot reasoning
  • improved reading comprehension
  • more reliable structured inference

However, compared to the 27B version:

  • gains are more moderate
  • improvements are less dramatic

This indicates diminishing returns when scaling base model size under the same training recipe.


Training data

This model is fine-tuned on:

  • voidful/gemini-3.1-opus-4.6-reasoning-merged

The dataset focuses on reasoning-oriented supervision, including:

  • structured reasoning traces
  • QA-style reasoning
  • commonsense inference

Limitations

The current evaluation focuses on general reasoning benchmarks.

We do not yet claim improvements on:

  • coding tasks
  • multilingual reasoning
  • math-heavy benchmarks

Further evaluation is required.


Takeaway

This model validates an important hypothesis:

Reasoning-focused fine-tuning improves zero-shot reasoning consistently, even at larger scale.

But it also reveals a deeper insight:

Scaling the base model does not guarantee proportional reasoning gains.

Downloads last month
214
Safetensors
Model size
26.6M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for voidful/Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning

Finetuned
(1)
this model
Quantizations
2 models

Dataset used to train voidful/Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning

Collection including voidful/Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning