Instructions to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT") model = AutoModelForCausalLM.from_pretrained("samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - PEFT
How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT
- SGLang
How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio
How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT", max_seq_length=2048, ) - Docker Model Runner
How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with Docker Model Runner:
docker model run hf.co/samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT
Qwen 3.6 27B — Opus CoT S1 / Hermes S2 SFT
Calling for independent benchmarks. I've only run internal smoke tests against this model — full eval numbers are not published here. If you have spare compute and a harness (AIME, HMMT, MMLU-Pro, SuperGPQA, SWE-bench Verified, LiveCodeBench, BFCL / tool-call evals, anything else), please open a Discussion on this repo. Happy to coordinate, share configs, and credit your numbers in this card.
A two-stage SFT fine-tune of Qwen 3.6 27B focused on (1) Claude-Opus-4.6-style chain-of-thought reasoning and (2) Hermes-format tool calling. Released in FP8 (~28 GB) for single-GPU serving.
Lineage:
Qwen3.6-27B→ stage 1 LoRA (reasoning) → merge → stage 2 LoRA (tool calling) → merge → FP8_DYNAMIC quantization → this checkpoint.
Related repositories
This release is split into three artifacts so you can pick the one that fits your workflow:
| Artifact | Repo | Use when… |
|---|---|---|
| FP8 merged (this repo) | samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT |
You want the production model — single-GPU vLLM/SGLang serving, ~28 GB on disk |
| Stage-1 BF16 merged | samscrack/Qwen3.6-27B-Opus-CoT-Stage1 |
You want the reasoning-only base — for further finetuning, DPO/RLHF, or as the target to apply the stage-2 LoRA to |
| Stage-2 LoRA adapter | samscrack/Qwen3.6-27B-Hermes-S2-LoRA |
You want only the Hermes tool-calling delta — small (~340 MB), apply via PEFT to the stage-1 base, or merge yourself at a different precision (BF16 / int8 / int4) |
The split is intentional: re-quantize without re-training, swap the tool-calling stage for a different format, or graft the reasoning stage onto a different base — without redoing the four-hour reasoning run.
What's in this repo
| File | Purpose |
|---|---|
model.safetensors (~28 GB) |
Stage-1 + stage-2 merged weights, FP8 quantized |
config.json |
Architecture: Qwen3_5ForCausalLM (text-only causal LM head) |
recipe.yaml |
Quantization recipe (FP8_DYNAMIC, all Linear modules, lm_head ignored) |
chat_template.jinja |
Inherited from base; supports tool-call rendering |
tokenizer.json, tokenizer_config.json |
Inherited from base |
Intended use
- Local serving on a single ≥32 GB GPU via vLLM, SGLang, or transformers.
- Reasoning-heavy chat with optional tool calls in Hermes format (
<tool_call>{...json...}</tool_call>). - General-purpose assistant tasks that benefit from chain-of-thought.
Not intended for high-stakes domains (medical, legal, safety-critical) without further evaluation.
Quick start (vLLM)
vllm serve samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT \
--quantization fp8 \
--tool-call-parser hermes \
--enable-auto-tool-choice \
--max-model-len 32768
Quick start (transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT")
model = AutoModelForCausalLM.from_pretrained(
"samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT",
torch_dtype="auto",
device_map="auto",
)
msgs = [{"role": "user", "content": "Why does ice float on water?"}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
Training pipeline
The two-stage recipe and dataset choices are adapted from Jackrong's Jackrong-llm-finetuning-guide (Colab notebook Qwopus3-5-27b-Colab.ipynb), ported to a local dual-GPU setup with no other changes to the data pipeline.
Stage 1 — Reasoning SFT (chain-of-thought)
| Method | Supervised fine-tuning, LoRA via Unsloth + TRL SFTTrainer |
| Base | Qwen/Qwen3.6-27B (text-only causal LM; *ForConditionalGeneration rewritten to *ForCausalLM for SFT) |
| LoRA | r=64, α=64, dropout=0, targets: q_proj, k_proj, v_proj, o_proj, out_proj, gate_proj, up_proj, down_proj |
| Optimizer / LR | AdamW, 2e-4, cosine warmup, weight decay 0.01 |
| Schedule | 2 epochs, batch 4 × grad_accum 9 → effective batch 72, ctx 8192 |
| Steps / final loss | 346 / 0.250 |
Datasets (concatenated then shuffled):
| Dataset | Rows | Provenance |
|---|---|---|
nohurry/Opus-4.6-Reasoning-3000x-filtered |
3,900 | Claude Opus 4.6 CoT distillations |
khazarai/qwen3.6-plus-high-reasoning-500x |
500 | Qwen 3.6 reasoning samples |
Roman1111111/claude-opus-4.6-10000x |
9,633 | Claude Opus 4.6 CoT distillations |
Stage-1 LoRA was merged onto the BF16 base (PEFT merge_and_unload + save_pretrained, with a size sanity check), producing the stage-1 base used for stage 2.
Stage 2 — Hermes-format tool-calling SFT
| Method | LoRA SFT on the stage-1-merged base |
| LoRA | r=16, α=16, dropout=0, same targets as stage 1 |
| Optimizer / LR | AdamW, 5e-5, cosine schedule |
| Schedule | 1 epoch, batch 2 × grad_accum 8 → effective batch 32, ctx 16384 |
| Steps / final loss | 74 / 0.334 |
Dataset:
| Dataset | Rows | Notes |
|---|---|---|
DJLougen/hermes-agent-traces-filtered |
3,679 | Pre-cleaned subset of lambda/hermes-agent-reasoning-traces; tool calls are valid JSON |
Source corpus: lambda/hermes-agent-reasoning-traces (Kimi + GLM-5.1 configs, ~14.7k raw rows; ~13% are dropped by the trainer's JSON validator).
Tool-call rendering uses Hermes-style tags: <tool_call>{...JSON...}</tool_call>, with a strip_tool_response_wrappers helper to avoid the chat template re-wrapping <tool_response> tokens already present in the dataset.
Post-training: FP8 quantization
Quantized with llmcompressor-style recipe:
default_stage:
default_modifiers:
QuantizationModifier:
targets: [Linear]
ignore: [lm_head]
scheme: FP8_DYNAMIC
bypass_divisibility_checks: false
All nn.Linear layers are FP8 (E4M3, dynamic per-token activation scales); lm_head and embeddings stay BF16.
Hardware & software
- Hardware: 2 × NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB each), DDP via
torchrun --standalone --nproc-per-node=2. - Software: PyTorch 2.8.0+cu128, Transformers 5.2.0, TRL 0.22.2, PEFT 0.19.1, Unsloth 2026.4.7, datasets 4.3.0.
- Wall clock:
4 h stage 1 + ~10 min merge + ~1 h 45 m stage 2 ≈ **6 h end-to-end**, plus FP8 quantization (~10 min).
Benchmark scores
Single-run smoke tests against the FP8 release. Independent runs welcome and will be added here — see the call-out at the top of this card.
| Benchmark | Track / Subset | Score | Metric | Run by | Date |
|---|---|---|---|---|---|
| BFCL V4 | live_simple |
81.40% (210 / 258) | exact-match AST accuracy | self (samscrack) | 2026-04-28 |
Notes on each run
- BFCL V4 —
live_simple. Single-turn function calling on real-world prompts. Served via vLLM 0.19.1 with--tool-call-parser qwen3_xml --reasoning-parser qwen3 --enable-auto-tool-choice, evaluated through the BFCLQwenFCHandlerwhich natively expects the<tool_call>{...JSON...}</tool_call>format this model emits. Temperature 0.001, 8-thread concurrency, 2:01 wall clock. Of the 48 failures, 0 were tool-call format / JSON-validity errors — the breakdown is 26 string-value mismatches (mostly non-English prompts where the model picked a reasonable but non-canonical phrasing, e.g."Divinópolis, Brazil"vs gold"Divinópolis, MG"), 13 wrong-count (model declined to call), and 9 type / optional-parameter mismatches. Format compliance is essentially perfect; the ceiling here is mostly localization quirks in the dataset.
Limitations
- Inherits all limitations of
Qwen3.6-27B— refusal patterns, knowledge cutoff, tokenizer biases. - Reasoning teacher distillation: Stage-1 data is largely Claude Opus 4.6 CoT, so reasoning style and refusal calibration partly reflect Claude's, not Qwen's.
- Tool-calling format is opinionated: The model is tuned to emit
<tool_call>{...}</tool_call>(Hermes/Qwen3 family). Servers that expect DeepSeek-native tool tokens, OpenAI function-call JSON, or Llama 3<|python_tag|>form will need an adapter. - No RLHF / DPO step — supervised only.
- FP8 quality regression: small but non-zero on a handful of edge tasks vs. the unquantized stage-2 BF16; pick the BF16 variant if quality > VRAM.
- Eval harness coverage is limited: internal smoke tests on AIME 2025, HMMT Feb 2026, MMLU-Pro, SuperGPQA, SWE-bench Verified, LiveCodeBench v6 — full numbers not published here.
Acknowledgements
- Jackrong — original training pipeline and notebook (
Jackrong-llm-finetuning-guide). The two-stage recipe and helper code (data normalization, Hermes wrapping, merge utilities) are derived from this repo. - Dataset authors:
nohurry—Opus-4.6-Reasoning-3000x-filteredkhazarai—qwen3.6-plus-high-reasoning-500xRoman1111111—claude-opus-4.6-10000xDJLougen—hermes-agent-traces-filteredlambda— upstreamhermes-agent-reasoning-traces
- Tooling: Unsloth (training acceleration), TRL (
SFTTrainer), PEFT (LoRA + merge), the vLLM and SGLang projects (serving), Qwen team (base model).
Citation
If you use this model, please also cite the upstream work:
@misc{qwen36-27b-opuscot-hermes-sft,
author = {samscrack},
title = {Qwen 3.6 27B — Opus CoT S1 / Hermes S2 SFT (FP8)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT}}
}
@misc{jackrong-llm-finetuning,
author = {Jackrong},
title = {Jackrong-llm-finetuning-guide: An Educational LLM Fine-Tuning Pipeline},
year = {2026},
publisher = {GitHub},
howpublished = {\url{https://github.com/Jackrong/Jackrong-llm-finetuning-guide}}
}
@misc{vonwerra2022trl,
title = {TRL: Transformer Reinforcement Learning},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Galloué́dec},
year = {2020},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
License
Apache 2.0, inherited from the Qwen 3.6 base model. Dataset licenses apply to derived behavior — see each dataset card on the Hub.
- Downloads last month
- 288
Model tree for samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT
Base model
Qwen/Qwen3.6-27B