Instructions to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT")
model = AutoModelForCausalLM.from_pretrained("samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

PEFT
How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with PEFT:
```
Task type is invalid.
```
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT

SGLang

How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT",
    max_seq_length=2048,
)

Docker Model Runner
How to use samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT with Docker Model Runner:
```
docker model run hf.co/samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT
```

Qwen 3.6 27B — Opus CoT S1 / Hermes S2 SFT

Calling for independent benchmarks. I've only run internal smoke tests against this model — full eval numbers are not published here. If you have spare compute and a harness (AIME, HMMT, MMLU-Pro, SuperGPQA, SWE-bench Verified, LiveCodeBench, BFCL / tool-call evals, anything else), please open a Discussion on this repo. Happy to coordinate, share configs, and credit your numbers in this card.

A two-stage SFT fine-tune of Qwen 3.6 27B focused on (1) Claude-Opus-4.6-style chain-of-thought reasoning and (2) Hermes-format tool calling. Released in FP8 (~28 GB) for single-GPU serving.

Lineage: Qwen3.6-27B → stage 1 LoRA (reasoning) → merge → stage 2 LoRA (tool calling) → merge → FP8_DYNAMIC quantization → this checkpoint.

Related repositories

This release is split into three artifacts so you can pick the one that fits your workflow:

Artifact	Repo	Use when…
FP8 merged (this repo)	`samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT`	You want the production model — single-GPU vLLM/SGLang serving, ~28 GB on disk
Stage-1 BF16 merged	`samscrack/Qwen3.6-27B-Opus-CoT-Stage1`	You want the reasoning-only base — for further finetuning, DPO/RLHF, or as the target to apply the stage-2 LoRA to
Stage-2 LoRA adapter	`samscrack/Qwen3.6-27B-Hermes-S2-LoRA`	You want only the Hermes tool-calling delta — small (~340 MB), apply via PEFT to the stage-1 base, or merge yourself at a different precision (BF16 / int8 / int4)

The split is intentional: re-quantize without re-training, swap the tool-calling stage for a different format, or graft the reasoning stage onto a different base — without redoing the four-hour reasoning run.

What's in this repo

File	Purpose
`model.safetensors` (~28 GB)	Stage-1 + stage-2 merged weights, FP8 quantized
`config.json`	Architecture: `Qwen3_5ForCausalLM` (text-only causal LM head)
`recipe.yaml`	Quantization recipe (FP8_DYNAMIC, all `Linear` modules, `lm_head` ignored)
`chat_template.jinja`	Inherited from base; supports tool-call rendering
`tokenizer.json`, `tokenizer_config.json`	Inherited from base

Intended use

Local serving on a single ≥32 GB GPU via vLLM, SGLang, or transformers.
Reasoning-heavy chat with optional tool calls in Hermes format (<tool_call>{...json...}</tool_call>).
General-purpose assistant tasks that benefit from chain-of-thought.

Not intended for high-stakes domains (medical, legal, safety-critical) without further evaluation.

Quick start (vLLM)

vllm serve samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT \
  --quantization fp8 \
  --tool-call-parser hermes \
  --enable-auto-tool-choice \
  --max-model-len 32768

Quick start (transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT")
model = AutoModelForCausalLM.from_pretrained(
    "samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT",
    torch_dtype="auto",
    device_map="auto",
)
msgs = [{"role": "user", "content": "Why does ice float on water?"}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

Training pipeline

The two-stage recipe and dataset choices are adapted from Jackrong's Jackrong-llm-finetuning-guide (Colab notebook Qwopus3-5-27b-Colab.ipynb), ported to a local dual-GPU setup with no other changes to the data pipeline.

Stage 1 — Reasoning SFT (chain-of-thought)


Method	Supervised fine-tuning, LoRA via Unsloth + TRL `SFTTrainer`
Base	`Qwen/Qwen3.6-27B` (text-only causal LM; `ForConditionalGeneration` rewritten to `ForCausalLM` for SFT)
LoRA	r=64, α=64, dropout=0, targets: `q_proj, k_proj, v_proj, o_proj, out_proj, gate_proj, up_proj, down_proj`
Optimizer / LR	AdamW, 2e-4, cosine warmup, weight decay 0.01
Schedule	2 epochs, batch 4 × grad_accum 9 → effective batch 72, ctx 8192
Steps / final loss	346 / 0.250

Datasets (concatenated then shuffled):

Dataset	Rows	Provenance
`nohurry/Opus-4.6-Reasoning-3000x-filtered`	3,900	Claude Opus 4.6 CoT distillations
`khazarai/qwen3.6-plus-high-reasoning-500x`	500	Qwen 3.6 reasoning samples
`Roman1111111/claude-opus-4.6-10000x`	9,633	Claude Opus 4.6 CoT distillations

Stage-1 LoRA was merged onto the BF16 base (PEFT merge_and_unload + save_pretrained, with a size sanity check), producing the stage-1 base used for stage 2.

Stage 2 — Hermes-format tool-calling SFT


Method	LoRA SFT on the stage-1-merged base
LoRA	r=16, α=16, dropout=0, same targets as stage 1
Optimizer / LR	AdamW, 5e-5, cosine schedule
Schedule	1 epoch, batch 2 × grad_accum 8 → effective batch 32, ctx 16384
Steps / final loss	74 / 0.334

Dataset:

Dataset	Rows	Notes
`DJLougen/hermes-agent-traces-filtered`	3,679	Pre-cleaned subset of `lambda/hermes-agent-reasoning-traces`; tool calls are valid JSON

Source corpus: lambda/hermes-agent-reasoning-traces (Kimi + GLM-5.1 configs, ~14.7k raw rows; ~13% are dropped by the trainer's JSON validator).

Tool-call rendering uses Hermes-style tags: <tool_call>{...JSON...}</tool_call>, with a strip_tool_response_wrappers helper to avoid the chat template re-wrapping <tool_response> tokens already present in the dataset.

Post-training: FP8 quantization

Quantized with llmcompressor-style recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: FP8_DYNAMIC
      bypass_divisibility_checks: false

All nn.Linear layers are FP8 (E4M3, dynamic per-token activation scales); lm_head and embeddings stay BF16.

Hardware & software

Hardware: 2 × NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB each), DDP via torchrun --standalone --nproc-per-node=2.
Software: PyTorch 2.8.0+cu128, Transformers 5.2.0, TRL 0.22.2, PEFT 0.19.1, Unsloth 2026.4.7, datasets 4.3.0.
Wall clock: 4 h stage 1 + ~10 min merge + ~1 h 45 m stage 2 ≈ **6 h end-to-end**, plus FP8 quantization (~10 min).

Benchmark scores

Single-run smoke tests against the FP8 release. Independent runs welcome and will be added here — see the call-out at the top of this card.

Benchmark	Track / Subset	Score	Metric	Run by	Date
BFCL V4	`live_simple`	81.40% (210 / 258)	exact-match AST accuracy	self (samscrack)	2026-04-28

Notes on each run

BFCL V4 — live_simple. Single-turn function calling on real-world prompts. Served via vLLM 0.19.1 with --tool-call-parser qwen3_xml --reasoning-parser qwen3 --enable-auto-tool-choice, evaluated through the BFCL QwenFCHandler which natively expects the <tool_call>{...JSON...}</tool_call> format this model emits. Temperature 0.001, 8-thread concurrency, 2:01 wall clock. Of the 48 failures, 0 were tool-call format / JSON-validity errors — the breakdown is 26 string-value mismatches (mostly non-English prompts where the model picked a reasonable but non-canonical phrasing, e.g. "Divinópolis, Brazil" vs gold "Divinópolis, MG"), 13 wrong-count (model declined to call), and 9 type / optional-parameter mismatches. Format compliance is essentially perfect; the ceiling here is mostly localization quirks in the dataset.

Limitations

Inherits all limitations of Qwen3.6-27B — refusal patterns, knowledge cutoff, tokenizer biases.
Reasoning teacher distillation: Stage-1 data is largely Claude Opus 4.6 CoT, so reasoning style and refusal calibration partly reflect Claude's, not Qwen's.
Tool-calling format is opinionated: The model is tuned to emit <tool_call>{...}</tool_call> (Hermes/Qwen3 family). Servers that expect DeepSeek-native tool tokens, OpenAI function-call JSON, or Llama 3 <|python_tag|> form will need an adapter.
No RLHF / DPO step — supervised only.
FP8 quality regression: small but non-zero on a handful of edge tasks vs. the unquantized stage-2 BF16; pick the BF16 variant if quality > VRAM.
Eval harness coverage is limited: internal smoke tests on AIME 2025, HMMT Feb 2026, MMLU-Pro, SuperGPQA, SWE-bench Verified, LiveCodeBench v6 — full numbers not published here.

Acknowledgements

Jackrong — original training pipeline and notebook (Jackrong-llm-finetuning-guide). The two-stage recipe and helper code (data normalization, Hermes wrapping, merge utilities) are derived from this repo.
Dataset authors:
- nohurry — Opus-4.6-Reasoning-3000x-filtered
- khazarai — qwen3.6-plus-high-reasoning-500x
- Roman1111111 — claude-opus-4.6-10000x
- DJLougen — hermes-agent-traces-filtered
- lambda — upstream hermes-agent-reasoning-traces
Tooling: Unsloth (training acceleration), TRL (SFTTrainer), PEFT (LoRA + merge), the vLLM and SGLang projects (serving), Qwen team (base model).

Citation

If you use this model, please also cite the upstream work:

@misc{qwen36-27b-opuscot-hermes-sft,
  author = {samscrack},
  title  = {Qwen 3.6 27B — Opus CoT S1 / Hermes S2 SFT (FP8)},
  year   = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT}}
}

@misc{jackrong-llm-finetuning,
  author = {Jackrong},
  title  = {Jackrong-llm-finetuning-guide: An Educational LLM Fine-Tuning Pipeline},
  year   = {2026},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/Jackrong/Jackrong-llm-finetuning-guide}}
}

@misc{vonwerra2022trl,
  title  = {TRL: Transformer Reinforcement Learning},
  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Galloué́dec},
  year   = {2020},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/huggingface/trl}}
}

License

Apache 2.0, inherited from the Qwen 3.6 base model. Dataset licenses apply to derived behavior — see each dataset card on the Hub.

Downloads last month: 288

Safetensors

Model size

27B params

Tensor type

BF16

F8_E4M3

Model tree for samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT

Base model

Qwen/Qwen3.6-27B

Adapter

(123)

this model

samscrack
/

Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT