Qwen 3.6 27B — Opus CoT S1 / Hermes S2 SFT

Calling for independent benchmarks. I've only run internal smoke tests against this model — full eval numbers are not published here. If you have spare compute and a harness (AIME, HMMT, MMLU-Pro, SuperGPQA, SWE-bench Verified, LiveCodeBench, BFCL / tool-call evals, anything else), please open a Discussion on this repo. Happy to coordinate, share configs, and credit your numbers in this card.

A two-stage SFT fine-tune of Qwen 3.6 27B focused on (1) Claude-Opus-4.6-style chain-of-thought reasoning and (2) Hermes-format tool calling. Released in FP8 (~28 GB) for single-GPU serving.

Lineage: Qwen3.6-27Bstage 1 LoRA (reasoning) → merge → stage 2 LoRA (tool calling) → merge → FP8_DYNAMIC quantization → this checkpoint.

Related repositories

This release is split into three artifacts so you can pick the one that fits your workflow:

Artifact Repo Use when…
FP8 merged (this repo) samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT You want the production model — single-GPU vLLM/SGLang serving, ~28 GB on disk
Stage-1 BF16 merged samscrack/Qwen3.6-27B-Opus-CoT-Stage1 You want the reasoning-only base — for further finetuning, DPO/RLHF, or as the target to apply the stage-2 LoRA to
Stage-2 LoRA adapter samscrack/Qwen3.6-27B-Hermes-S2-LoRA You want only the Hermes tool-calling delta — small (~340 MB), apply via PEFT to the stage-1 base, or merge yourself at a different precision (BF16 / int8 / int4)

The split is intentional: re-quantize without re-training, swap the tool-calling stage for a different format, or graft the reasoning stage onto a different base — without redoing the four-hour reasoning run.

What's in this repo

File Purpose
model.safetensors (~28 GB) Stage-1 + stage-2 merged weights, FP8 quantized
config.json Architecture: Qwen3_5ForCausalLM (text-only causal LM head)
recipe.yaml Quantization recipe (FP8_DYNAMIC, all Linear modules, lm_head ignored)
chat_template.jinja Inherited from base; supports tool-call rendering
tokenizer.json, tokenizer_config.json Inherited from base

Intended use

  • Local serving on a single ≥32 GB GPU via vLLM, SGLang, or transformers.
  • Reasoning-heavy chat with optional tool calls in Hermes format (<tool_call>{...json...}</tool_call>).
  • General-purpose assistant tasks that benefit from chain-of-thought.

Not intended for high-stakes domains (medical, legal, safety-critical) without further evaluation.

Quick start (vLLM)

vllm serve samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT \
  --quantization fp8 \
  --tool-call-parser hermes \
  --enable-auto-tool-choice \
  --max-model-len 32768

Quick start (transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT")
model = AutoModelForCausalLM.from_pretrained(
    "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT",
    torch_dtype="auto",
    device_map="auto",
)
msgs = [{"role": "user", "content": "Why does ice float on water?"}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

Training pipeline

The two-stage recipe and dataset choices are adapted from Jackrong's Jackrong-llm-finetuning-guide (Colab notebook Qwopus3-5-27b-Colab.ipynb), ported to a local dual-GPU setup with no other changes to the data pipeline.

Stage 1 — Reasoning SFT (chain-of-thought)

Method Supervised fine-tuning, LoRA via Unsloth + TRL SFTTrainer
Base Qwen/Qwen3.6-27B (text-only causal LM; *ForConditionalGeneration rewritten to *ForCausalLM for SFT)
LoRA r=64, α=64, dropout=0, targets: q_proj, k_proj, v_proj, o_proj, out_proj, gate_proj, up_proj, down_proj
Optimizer / LR AdamW, 2e-4, cosine warmup, weight decay 0.01
Schedule 2 epochs, batch 4 × grad_accum 9 → effective batch 72, ctx 8192
Steps / final loss 346 / 0.250

Datasets (concatenated then shuffled):

Dataset Rows Provenance
nohurry/Opus-4.6-Reasoning-3000x-filtered 3,900 Claude Opus 4.6 CoT distillations
khazarai/qwen3.6-plus-high-reasoning-500x 500 Qwen 3.6 reasoning samples
Roman1111111/claude-opus-4.6-10000x 9,633 Claude Opus 4.6 CoT distillations

Stage-1 LoRA was merged onto the BF16 base (PEFT merge_and_unload + save_pretrained, with a size sanity check), producing the stage-1 base used for stage 2.

Stage 2 — Hermes-format tool-calling SFT

Method LoRA SFT on the stage-1-merged base
LoRA r=16, α=16, dropout=0, same targets as stage 1
Optimizer / LR AdamW, 5e-5, cosine schedule
Schedule 1 epoch, batch 2 × grad_accum 8 → effective batch 32, ctx 16384
Steps / final loss 74 / 0.334

Dataset:

Dataset Rows Notes
DJLougen/hermes-agent-traces-filtered 3,679 Pre-cleaned subset of lambda/hermes-agent-reasoning-traces; tool calls are valid JSON

Source corpus: lambda/hermes-agent-reasoning-traces (Kimi + GLM-5.1 configs, ~14.7k raw rows; ~13% are dropped by the trainer's JSON validator).

Tool-call rendering uses Hermes-style tags: <tool_call>{...JSON...}</tool_call>, with a strip_tool_response_wrappers helper to avoid the chat template re-wrapping <tool_response> tokens already present in the dataset.

Post-training: FP8 quantization

Quantized with llmcompressor-style recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: FP8_DYNAMIC
      bypass_divisibility_checks: false

All nn.Linear layers are FP8 (E4M3, dynamic per-token activation scales); lm_head and embeddings stay BF16.

Hardware & software

  • Hardware: 2 × NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB each), DDP via torchrun --standalone --nproc-per-node=2.
  • Software: PyTorch 2.8.0+cu128, Transformers 5.2.0, TRL 0.22.2, PEFT 0.19.1, Unsloth 2026.4.7, datasets 4.3.0.
  • Wall clock: 4 h stage 1 + ~10 min merge + ~1 h 45 m stage 2 ≈ **6 h end-to-end**, plus FP8 quantization (~10 min).

Benchmark scores

Single-run smoke tests against the FP8 release. Independent runs welcome and will be added here — see the call-out at the top of this card.

Benchmark Track / Subset Score Metric Run by Date
BFCL V4 live_simple 81.40% (210 / 258) exact-match AST accuracy self (samscrack) 2026-04-28

Notes on each run

  • BFCL V4 — live_simple. Single-turn function calling on real-world prompts. Served via vLLM 0.19.1 with --tool-call-parser qwen3_xml --reasoning-parser qwen3 --enable-auto-tool-choice, evaluated through the BFCL QwenFCHandler which natively expects the <tool_call>{...JSON...}</tool_call> format this model emits. Temperature 0.001, 8-thread concurrency, 2:01 wall clock. Of the 48 failures, 0 were tool-call format / JSON-validity errors — the breakdown is 26 string-value mismatches (mostly non-English prompts where the model picked a reasonable but non-canonical phrasing, e.g. "Divinópolis, Brazil" vs gold "Divinópolis, MG"), 13 wrong-count (model declined to call), and 9 type / optional-parameter mismatches. Format compliance is essentially perfect; the ceiling here is mostly localization quirks in the dataset.

Limitations

  • Inherits all limitations of Qwen3.6-27B — refusal patterns, knowledge cutoff, tokenizer biases.
  • Reasoning teacher distillation: Stage-1 data is largely Claude Opus 4.6 CoT, so reasoning style and refusal calibration partly reflect Claude's, not Qwen's.
  • Tool-calling format is opinionated: The model is tuned to emit <tool_call>{...}</tool_call> (Hermes/Qwen3 family). Servers that expect DeepSeek-native tool tokens, OpenAI function-call JSON, or Llama 3 <|python_tag|> form will need an adapter.
  • No RLHF / DPO step — supervised only.
  • FP8 quality regression: small but non-zero on a handful of edge tasks vs. the unquantized stage-2 BF16; pick the BF16 variant if quality > VRAM.
  • Eval harness coverage is limited: internal smoke tests on AIME 2025, HMMT Feb 2026, MMLU-Pro, SuperGPQA, SWE-bench Verified, LiveCodeBench v6 — full numbers not published here.

Acknowledgements

  • Jackrong — original training pipeline and notebook (Jackrong-llm-finetuning-guide). The two-stage recipe and helper code (data normalization, Hermes wrapping, merge utilities) are derived from this repo.
  • Dataset authors:
    • nohurryOpus-4.6-Reasoning-3000x-filtered
    • khazaraiqwen3.6-plus-high-reasoning-500x
    • Roman1111111claude-opus-4.6-10000x
    • DJLougenhermes-agent-traces-filtered
    • lambda — upstream hermes-agent-reasoning-traces
  • Tooling: Unsloth (training acceleration), TRL (SFTTrainer), PEFT (LoRA + merge), the vLLM and SGLang projects (serving), Qwen team (base model).

Citation

If you use this model, please also cite the upstream work:

@misc{qwen36-27b-opuscot-hermes-sft,
  author = {samscrack},
  title  = {Qwen 3.6 27B — Opus CoT S1 / Hermes S2 SFT (FP8)},
  year   = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT}}
}

@misc{jackrong-llm-finetuning,
  author = {Jackrong},
  title  = {Jackrong-llm-finetuning-guide: An Educational LLM Fine-Tuning Pipeline},
  year   = {2026},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/Jackrong/Jackrong-llm-finetuning-guide}}
}

@misc{vonwerra2022trl,
  title  = {TRL: Transformer Reinforcement Learning},
  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Galloué́dec},
  year   = {2020},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/huggingface/trl}}
}

License

Apache 2.0, inherited from the Qwen 3.6 base model. Dataset licenses apply to derived behavior — see each dataset card on the Hub.

Downloads last month
288
Safetensors
Model size
27B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT

Base model

Qwen/Qwen3.6-27B
Adapter
(123)
this model

Datasets used to train samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT