Qwen3-1.7B-Thinking-Distil

Extended Reasoning Distillation from Qwen3-30B-A3B-Thinking → 1.7B

Convergent Intelligence LLC: Research Division

What This Is

The most downloaded model in the Convergent Intelligence portfolio. Qwen3-1.7B-Thinking-Distil captures extended deliberation patterns from the Qwen3-30B-A3B Thinking teacher — the variant that generates long-form reasoning chains before committing to an answer — and compresses them into a 1.7B student via supervised fine-tuning on the longwriter-6k dataset.

The Thinking teacher produces the richest signal of the three teacher variants in the DistilQwen family (Instruct, Thinking, Coder). Where Instruct distillation captures clean instruction-following and Coder captures hierarchical decomposition, Thinking distillation captures the extended internal monologue — the model reasoning through uncertainty, backtracking, and re-evaluating before arriving at a conclusion. That deliberative depth is what makes this variant the highest-download model in the collection.

Architecture

Parameter	Value
Architecture	Qwen3ForCausalLM
Parameters	~2.03B (1.7B effective)
Hidden Size	2048
Layers	28
Attention Heads	16 (Q) / 8 (KV) — GQA
Intermediate	6144
Head Dimension	128
Context Length	40,960 tokens (max position)
Vocabulary	151,936
Precision	BF16
Activation	SiLU

Training

Teacher: Qwen3-30B-A3B-Thinking Student: Qwen3-1.7B Dataset: longwriter-6k — long-form generation samples that preserve extended reasoning chains Method: Supervised Fine-Tuning (SFT) via TRL

Parameter	Value
Max Sequence Length	4,096
Precision	BF16
Framework	TRL (SFTTrainer)
Hardware	NVIDIA H100

The training captures the teacher's extended thinking traces through direct SFT rather than logit-level KD. This is a deliberate design choice — the longwriter-6k dataset provides naturally long reasoning samples where the signal is in the structure of the generation (how the teacher approaches, reconsiders, and resolves), not just the final token probabilities.

For the full topology-aware distillation pipeline (BV decomposition, jump detection, curriculum ordering), see TopologicalQwen. This model is the SFT-direct variant — simpler, faster to train, and empirically the most downloaded for a reason: the Thinking teacher's extended chains transfer well through pure SFT.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "reaperdoesntknow/Qwen3-1.7B-Thinking-Distil",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "reaperdoesntknow/Qwen3-1.7B-Thinking-Distil"
)

messages = [
    {"role": "user", "content": "Explain why gradient descent can get stuck in saddle points but not local minima in high dimensions."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=2048,
    do_sample=True,
    top_p=0.9,
    temperature=0.7,
    repetition_penalty=1.15
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Generation Tips

Temperature 0.6–0.8 works best for reasoning tasks — low enough for coherence, high enough to activate the extended deliberation patterns from the Thinking teacher.
Repetition penalty 1.1–1.2 prevents the model from getting caught in reasoning loops during long generations.
Max tokens 1024–2048 — the model was trained on 4096 max seq, so it can generate long. Give it room.
The model inherits the Thinking teacher's tendency to reason before answering. Let it.

Distillation Position

Qwen3-30B-A3B-Thinking (teacher)
  ↓ SFT on longwriter-6k (4096 max seq)
Qwen3-1.7B-Thinking-Distil ← you are here

This model is the direct SFT path. The DistilQwen collection also includes models that go through additional refinement stages:

Qwen3-1.7B (base)
  → Qwen3-1.7B-Distilled-30B-A3B (Instruct teacher KD)
    → DiStil (uncensored SFT)
      → Disctil (DISC refinement)
        → TopologicalQwen (full TKD pipeline)

Different paths, different capabilities. This model prioritizes extended reasoning. TopologicalQwen prioritizes structural precision. The Coder variant prioritizes hierarchical decomposition. They're complementary.

DistilQwen Collection

Model	Downloads	What It Does
Qwen3-1.7B-Thinking-Distil	1,188	← this model. Thinking teacher SFT.
TopologicalQwen	1,134	Full TKD pipeline. BV decomposition + DualMind format.
DiStil-Qwen3-1.7B-uncensored	1,030	DISC-informed uncensored distillation.
Qwen3-1.7B-Coder-Distilled-SFT	966	Coder teacher. Hierarchical problem solving.
DistilQwen3-1.7B-uncensored	832	Base uncensored variant.

Full collection: DistilQwen on HuggingFace

Methodology

Full methodology paper: Structure Over Scale: Proof-Weighted Knowledge Distillation (DOI: 10.57967/hf/8165)

Companion paper: Three Teachers to Dual Cognition (DOI: 10.57967/hf/8184) — covers the DualMind extension and ghost imprinting phenomenon.

License

Apache 2.0 — same as the base Qwen3 model.

Mathematical Foundations: Discrepancy Calculus (DISC)

This model's training pipeline is grounded in Discrepancy Calculus — a measure-theoretic framework that treats singularities as primary structure rather than pathology. Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division).

The Core Operator:

$Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|}\, dt$

For smooth $f$: $Df(x) = |f'(x)|$. For rough $f$: $D$ localizes irregularity to null sets while preserving integral structure.

The Mesh Fundamental Identity — every BV function decomposes as:

$f(b) - f(a) = \underbrace{\int_a^b f'(x)\,dx}_{\text{smooth (AC)}} + \underbrace{\sum_{x \in J_f} \Delta f(x)}_{\text{jumps}} + \underbrace{D^c f(I)}_{\text{Cantor drift}}$

Standard knowledge distillation captures only term 1. Topological Knowledge Distillation (TKD) preserves all three by treating the teacher's output distribution as a BV function and computing discrepancy energy, jump sets, and gap energy density before training begins.

Citation

@misc{colca2026distilqwen,
  title={Structure Over Scale: Proof-Weighted Knowledge Distillation from Qwen3-30B to 1.7B},
  author={Colca, Roy},
  year={2026},
  doi={10.57967/hf/8165},
  publisher={Convergent Intelligence LLC: Research Division}
}

Convergent Intelligence LLC: Research Division — 49 models, 22,598 downloads across the portfolio. Full portfolio | DistilQwen Collection | DualMind Collection

Convergent Intelligence Portfolio

Part of the DistilQwen Series by Convergent Intelligence LLC: Research Division

Related Models

Model	Downloads	Format
TopologicalQwen	1,974	BF16
Qwen3-1.7B-Thinking-Distil	1,903	BF16
Qwen3-1.7B-Coder-Distilled-SFT	1,677	BF16
DiStil-Qwen3-1.7B-uncensored	1,602	BF16
DistilQwen3-1.7B-uncensored	1,574	BF16
Qwen3-1.7B-Distilled-30B-A3B	1,138	BF16