Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning

Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning is a reasoning-focused fine-tuned model designed to improve zero-shot general reasoning performance on top of the Qwen3.5-35B-A3B base model.

This model is trained on the voidful/gemini-3.1-opus-4.6-reasoning-merged dataset. The training objective is simple. We aim to improve structured reasoning, reading comprehension, commonsense inference, and science-style QA, while keeping the model practical for real-world usage.

Why this model

This model extends the same reasoning-oriented training pipeline used in the 27B version to a larger base model (35B-A3B).

The central hypothesis is:

If reasoning supervision is effective, scaling the base model should preserve or further improve zero-shot reasoning performance.

We evaluate this claim using a standard zero-shot benchmark suite.

Key results

Zero-shot accuracy (`acc`)

Task	Ours	Baseline: Qwen3.5-35B-A3B	Jackrong FT	Gain over Baseline	Gain over Jackrong
arc_challenge	0.6152	0.5947	0.5853	+0.0205	+0.0299
arc_easy	0.8510	0.8396	0.8439	+0.0114	+0.0071
boolq	0.9043	0.8765	0.8853	+0.0278	+0.0190
hellaswag	0.6522	0.6243	0.6244	+0.0279	+0.0278
openbookqa	0.3480	0.3400	0.3360	+0.0080	+0.0120
piqa	0.8232	0.8264	0.8221	-0.0032	+0.0011
winogrande	0.7616	0.7474	0.7403	+0.0142	+0.0213

Normalized accuracy (`acc_norm`)

Task	Ours	Baseline	Jackrong FT	Gain over Baseline	Gain over Jackrong
arc_challenge	0.6459	0.6195	0.6203	+0.0264	+0.0256
arc_easy	0.8169	0.7963	0.8152	+0.0206	+0.0017
hellaswag	0.8433	0.8256	0.8204	+0.0177	+0.0229
openbookqa	0.4580	0.4360	0.4480	+0.0220	+0.0100
piqa	0.8237	0.8346	0.8237	-0.0109	0.0000

Summary

We compute simple averages over the reported metrics.

Average `acc`

Ours: 70.79
Baseline: 69.13
Jackrong FT: 69.03

Average `acc_norm`

Ours: 71.76
Baseline: 70.40
Jackrong FT: 70.35

Key observations

1. Consistent but smaller gains

The model improves over both baselines on most tasks. However, the magnitude of improvement is smaller than the 27B version.

This suggests:

Scaling alone does not automatically amplify reasoning gains from fine-tuning.

2. Strong improvements on reasoning-heavy tasks

The largest gains appear on:

BoolQ
ARC-Challenge
HellaSwag

These tasks require multi-step reasoning or contextual inference.

This aligns with the training objective.

3. Mixed results on simpler tasks

On PIQA, the model slightly underperforms the base model.

This indicates a known trade-off:

Reasoning-focused fine-tuning may slightly degrade performance on simpler or more pattern-based tasks.

4. Stability across benchmarks

Despite some trade-offs, the model shows:

consistent improvements across most tasks
no catastrophic regression

This suggests the training dataset provides stable supervision.

What this means in practice

This model is a strong choice if you want:

better zero-shot reasoning
improved reading comprehension
more reliable structured inference

However, compared to the 27B version:

gains are more moderate
improvements are less dramatic

This indicates diminishing returns when scaling base model size under the same training recipe.

Training data

This model is fine-tuned on:

voidful/gemini-3.1-opus-4.6-reasoning-merged

The dataset focuses on reasoning-oriented supervision, including:

structured reasoning traces
QA-style reasoning
commonsense inference

Limitations

The current evaluation focuses on general reasoning benchmarks.

We do not yet claim improvements on:

coding tasks
multilingual reasoning
math-heavy benchmarks

Further evaluation is required.

Takeaway

This model validates an important hypothesis:

Reasoning-focused fine-tuning improves zero-shot reasoning consistently, even at larger scale.

But it also reveals a deeper insight:

Scaling the base model does not guarantee proportional reasoning gains.

Downloads last month: 214

Safetensors

Model size

26.6M params

Tensor type

BF16

Model tree for voidful/Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning

Base model

voidful/Qwen3.5-35B-A3B-earica

Finetuned

(1)

this model

Quantizations

2 models

Dataset used to train voidful/Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning

Collection including voidful/Qwen3.5-35B-A3B-gemini-3.1-opus-4.6-reasoning

Qwen3.5-gemini-3.1-opus-4.6-reasoning

Collection

3 items • Updated 16 days ago