Qwen3-1.7B-Thinking-Distil

Extended Reasoning Distillation from Qwen3-30B-A3B-Thinking → 1.7B

Convergent Intelligence LLC: Research Division


What This Is

The most downloaded model in the Convergent Intelligence portfolio. Qwen3-1.7B-Thinking-Distil captures extended deliberation patterns from the Qwen3-30B-A3B Thinking teacher — the variant that generates long-form reasoning chains before committing to an answer — and compresses them into a 1.7B student via supervised fine-tuning on the longwriter-6k dataset.

The Thinking teacher produces the richest signal of the three teacher variants in the DistilQwen family (Instruct, Thinking, Coder). Where Instruct distillation captures clean instruction-following and Coder captures hierarchical decomposition, Thinking distillation captures the extended internal monologue — the model reasoning through uncertainty, backtracking, and re-evaluating before arriving at a conclusion. That deliberative depth is what makes this variant the highest-download model in the collection.

Architecture

Parameter Value
Architecture Qwen3ForCausalLM
Parameters ~2.03B (1.7B effective)
Hidden Size 2048
Layers 28
Attention Heads 16 (Q) / 8 (KV) — GQA
Intermediate 6144
Head Dimension 128
Context Length 40,960 tokens (max position)
Vocabulary 151,936
Precision BF16
Activation SiLU

Training

Teacher: Qwen3-30B-A3B-Thinking Student: Qwen3-1.7B Dataset: longwriter-6k — long-form generation samples that preserve extended reasoning chains Method: Supervised Fine-Tuning (SFT) via TRL

Parameter Value
Max Sequence Length 4,096
Precision BF16
Framework TRL (SFTTrainer)
Hardware NVIDIA H100

The training captures the teacher's extended thinking traces through direct SFT rather than logit-level KD. This is a deliberate design choice — the longwriter-6k dataset provides naturally long reasoning samples where the signal is in the structure of the generation (how the teacher approaches, reconsiders, and resolves), not just the final token probabilities.

For the full topology-aware distillation pipeline (BV decomposition, jump detection, curriculum ordering), see TopologicalQwen. This model is the SFT-direct variant — simpler, faster to train, and empirically the most downloaded for a reason: the Thinking teacher's extended chains transfer well through pure SFT.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "reaperdoesntknow/Qwen3-1.7B-Thinking-Distil",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "reaperdoesntknow/Qwen3-1.7B-Thinking-Distil"
)

messages = [
    {"role": "user", "content": "Explain why gradient descent can get stuck in saddle points but not local minima in high dimensions."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=2048,
    do_sample=True,
    top_p=0.9,
    temperature=0.7,
    repetition_penalty=1.15
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Generation Tips

  • Temperature 0.6–0.8 works best for reasoning tasks — low enough for coherence, high enough to activate the extended deliberation patterns from the Thinking teacher.
  • Repetition penalty 1.1–1.2 prevents the model from getting caught in reasoning loops during long generations.
  • Max tokens 1024–2048 — the model was trained on 4096 max seq, so it can generate long. Give it room.
  • The model inherits the Thinking teacher's tendency to reason before answering. Let it.

Distillation Position

Qwen3-30B-A3B-Thinking (teacher)
  ↓ SFT on longwriter-6k (4096 max seq)
Qwen3-1.7B-Thinking-Distil ← you are here

This model is the direct SFT path. The DistilQwen collection also includes models that go through additional refinement stages:

Qwen3-1.7B (base)
  → Qwen3-1.7B-Distilled-30B-A3B (Instruct teacher KD)
    → DiStil (uncensored SFT)
      → Disctil (DISC refinement)
        → TopologicalQwen (full TKD pipeline)

Different paths, different capabilities. This model prioritizes extended reasoning. TopologicalQwen prioritizes structural precision. The Coder variant prioritizes hierarchical decomposition. They're complementary.

DistilQwen Collection

Model Downloads What It Does
Qwen3-1.7B-Thinking-Distil 1,188 ← this model. Thinking teacher SFT.
TopologicalQwen 1,134 Full TKD pipeline. BV decomposition + DualMind format.
DiStil-Qwen3-1.7B-uncensored 1,030 DISC-informed uncensored distillation.
Qwen3-1.7B-Coder-Distilled-SFT 966 Coder teacher. Hierarchical problem solving.
DistilQwen3-1.7B-uncensored 832 Base uncensored variant.

Full collection: DistilQwen on HuggingFace

Methodology

Full methodology paper: Structure Over Scale: Proof-Weighted Knowledge Distillation (DOI: 10.57967/hf/8165)

Companion paper: Three Teachers to Dual Cognition (DOI: 10.57967/hf/8184) — covers the DualMind extension and ghost imprinting phenomenon.

License

Apache 2.0 — same as the base Qwen3 model.

Mathematical Foundations: Discrepancy Calculus (DISC)

This model's training pipeline is grounded in Discrepancy Calculus — a measure-theoretic framework that treats singularities as primary structure rather than pathology. Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division).

The Core Operator:

Df(x)=limε01εxx+εf(t)f(x)txdtDf(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|}\, dt

For smooth $f$: $Df(x) = |f'(x)|$. For rough $f$: $D$ localizes irregularity to null sets while preserving integral structure.

The Mesh Fundamental Identity — every BV function decomposes as:

f(b)f(a)=abf(x)dxsmooth (AC)+xJfΔf(x)jumps+Dcf(I)Cantor driftf(b) - f(a) = \underbrace{\int_a^b f'(x)\,dx}_{\text{smooth (AC)}} + \underbrace{\sum_{x \in J_f} \Delta f(x)}_{\text{jumps}} + \underbrace{D^c f(I)}_{\text{Cantor drift}}

Standard knowledge distillation captures only term 1. Topological Knowledge Distillation (TKD) preserves all three by treating the teacher's output distribution as a BV function and computing discrepancy energy, jump sets, and gap energy density before training begins.

Citation

@misc{colca2026distilqwen,
  title={Structure Over Scale: Proof-Weighted Knowledge Distillation from Qwen3-30B to 1.7B},
  author={Colca, Roy},
  year={2026},
  doi={10.57967/hf/8165},
  publisher={Convergent Intelligence LLC: Research Division}
}

Convergent Intelligence LLC: Research Division — 49 models, 22,598 downloads across the portfolio. Full portfolio | DistilQwen Collection | DualMind Collection


Convergent Intelligence Portfolio

Part of the DistilQwen Series by Convergent Intelligence LLC: Research Division

Related Models

Papers

Paper DOI
Structure Over Scale 10.57967/hf/8165
Three Teachers to Dual Cognition 10.57967/hf/8184
Discrepancy Calculus 10.57967/hf/8194

Last updated: 2026-03-31 by Convergent Intelligence LLC: Research Division

Downloads last month
4,835
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for reaperdoesntknow/Qwen3-1.7B-Thinking-Distil

Datasets used to train reaperdoesntknow/Qwen3-1.7B-Thinking-Distil

Collection including reaperdoesntknow/Qwen3-1.7B-Thinking-Distil