Instructions to use bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling
- SGLang
How to use bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling with Docker Model Runner:
docker model run hf.co/bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling
- Qwen3.6-27B · Text-Only · W4A16-g128 · MTP · Tool-Calling
Qwen3.6-27B · Text-Only · W4A16-g128 · MTP · Tool-Calling
Low-latency single-stream variant — ~+85% decode TPS via preserved Multi-Token Prediction head, single-user mode for interactive agents and tool calls.
This is a text-only, 4-bit weight (W4A16, group size 128, AutoRound) quantization of
Qwen/Qwen3.6-27B with the MTP head preserved
in BF16 so vLLM's native speculative decoding works out of the box.
💡 Pick the right variant for your workload:
- Multi-user / long-context server: use the high-concurrency sibling
bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-ToolCalling(FP8 KV path, 56K max context, ~4-8× concurrency, no MTP).- Single-user low-latency (you want decode TPS, you don't need many concurrent users): this variant, with vLLM
--speculative-config method=mtp.Both variants use the same base quantization. The only difference is whether the MTP head is grafted in (this variant) — which adds ~849 MB of weights and trades KV pool for decode speed.
Community quantization. Not an official Qwen release; not endorsed by or affiliated with the Qwen team or Alibaba Cloud. "Qwen3.6" is used only to identify the upstream model (Apache-2.0 §6).
TL;DR — measured on 1× RTX 3090 24 GB, vLLM 0.20.2 (no plugins)
| Config | KV pool | Concurrency @ 16K | Decode TPS (mean / median) | Use case |
|---|---|---|---|---|
--speculative-config num=1 |
20,024 tok | 1.22× | 53.9 / 54.3 | Stable single-user agent |
--speculative-config num=3 |
16,384 tok | 1.00× | 66.1 / 70.0 | Fastest single-user decode |
| (no spec-decode — equivalent to sibling variant) | ~63,000 tok | ~3.8× | ~36 | If you want multi-user, use the sibling instead |
MTP draft acceptance (vLLM-reported, steady state):
| Config | Per-position acceptance | Average | Mean accepted length |
|---|---|---|---|
num=1 |
75-83% (peaks 90.9%) | ~80% | 1.74-1.91 / 2 |
num=3 |
88% / 77% / 64-69% | ~76-78% | 3.29 / 4 |
num=1 numbers are directly comparable to other public MTP quantizations (e.g. Lorbus's
80-90% claim). num=3 trades lower per-position acceptance for more aggregate decode speed.
You can also stack FP8 KV on top of MTP (slightly lower TPS but ~+43-63% KV pool):
--kv-cache-dtype fp8 + MTP |
KV pool | Concurrency @ 16K | Decode TPS |
|---|---|---|---|
+ num=1 |
32,768 tok | 2.00× | 46.9 / 47.1 |
+ num=3 |
23,507 tok | 1.43× | 62.0 / 61.9 |
⚠️ Decode TPS numbers above are pre-Marlin Conch baselines (measured 2026-05-20). After the 2026-05-22 Marlin metadata update, expect approximately +30% across all rows (Marlin accelerates the W4A16 GEMM regardless of whether MTP is enabled). Re-bench is pending; the no-MTP sibling variant has been re-measured at 35.83 → 46.6 mean (+30%) on the same hardware.
What's new (2026-05-22) — Marlin kernel by default
The quantization_config metadata in both mm/ and self/ packagings was relabeled so
vLLM auto-selects the mainline MarlinLinearKernel instead of the ConchLinearKernel
Triton fallback. The W4A16 weights are byte-identical; the 15 BF16 mtp.* tensors are
also unchanged. Only three metadata fields changed:
| Field | Before | After |
|---|---|---|
quantization_config.quant_method |
auto-round |
gptq |
quantization_config.desc_act |
(absent) | false |
quantization_config.checkpoint_format |
(absent) | gptq |
This works because AutoRound was already configured with
packing_format: auto_round:auto_gptq — the on-disk weight layout is GPTQ-compatible, so
vLLM's gptq_marlin loader accepts the weights after relabeling. The extra_config section
of quantization_config (which marks the 8 linear mtp.* weights as bits=16, data_type=fp)
is untouched — the MTP head still loads correctly into vLLM's Qwen3_5MTP loader.
Expected impact for MTP configurations (re-bench pending — numbers below are projections):
| Config | Pre-Marlin Decode TPS (measured) | Post-Marlin Decode TPS (projected) |
|---|---|---|
num=1 (FP16 KV) |
53.9 / 54.3 | ~70 |
num=3 (FP16 KV) |
66.1 / 70.0 | ~86 |
+ fp8 KV + num=1 |
46.9 / 47.1 | ~61 |
+ fp8 KV + num=3 |
62.0 / 61.9 | ~80 |
The decode speedup is from MarlinLinearKernel accelerating the W4A16 GEMM (q/k/v/o_proj +
MLP) in both the main forward pass and the MTP head's verification pass — about half of the
model's compute. The 48 Gated DeltaNet linear-attention layers still go through the Triton/
FLA kernel and are the physical floor on prefill TTFT on Ampere SM86. MTP itself doesn't
accelerate prefill, so cold TTFT is unchanged. The no-MTP sibling has been re-measured at
35.83 → 46.6 mean (+30%).
To revert: each packaging ships config.json.bak-pre-marlin-20260522. Rename it back
over config.json (no other files to touch) and you're back on the Conch path.
Runtime requirement change after this update:
conch-triton-kernelsis no longer required at runtime — Marlin is in vLLM mainline. Install only if you intend to revert.- All other prerequisites (CUDA toolkit + ninja-build for FP8 KV) are unchanged.
Why this variant exists
Plain W4A16 quantization on a 24 GB GPU gives you ~36 tok/s sustained decode TPS, which is fine for most agentic workloads where prefill (the long system prompt + tool schemas) is the real bottleneck.
But for single-user, decode-bound workloads — long-form chat, streamed reasoning,
chained-of-thought tool calls — you want every decoded token as fast as possible. That's what
Multi-Token Prediction (vLLM's --speculative-config method=mtp) is for. It uses
Qwen3.6's built-in MTP head to draft 1-3 candidate tokens per step and verify them, giving
~1.85-1.9× sustained decode TPS when accepted (and our acceptance rate is 75-83% at n=1,
on par with public Lorbus benchmarks).
The MTP head is dropped by default during W4A16 quantization (it's quantized away or excluded
from the text backbone). This release grafts it back from the upstream BF16 base so vLLM's
Qwen3_5MTP loader can find it.
Who it's for: single-user agent UIs, chat clients, interactive REPLs where latency per token matters more than aggregate users. Who it's not for: multi-user servers (use the sibling variant), or workloads with very long prompts and short outputs (MTP doesn't accelerate prefill).
What's grafted into this variant
15 BF16 mtp.* tensors extracted from the upstream BF16 base
(Qwen/Qwen3.6-27B) and merged into both
packagings (self/ and mm/) as a separate shard model-mtp.safetensors (~810 MB):
mtp.fc.weight— the fusion layer (must stay BF16 per vLLMQwen3_5MTPloader)mtp.layers.0.{self_attn, mlp, *_layernorm}.weight— the single MTP decoder layermtp.norm.weight,mtp.pre_fc_norm_embedding.weight,mtp.pre_fc_norm_hidden.weight
These layers are kept BF16 unquantized; extra_config in quantization_config is updated
to mark the 8 linear mtp.* weights as bits=16, data_type=fp (defensive).
The W4A16 main-body weights are identical to the no-MTP sibling release — same AutoRound
0.12.3 output, same per-layer high-precision policy (lm_head / embed_tokens /
linear_attn.in_proj_a / linear_attn.in_proj_b retained BF16). Only the MTP tensors are
added.
Two packagings (identical content between this variant and the no-MTP sibling, except for the added mtp shard):
| Subfolder | architectures |
For |
|---|---|---|
self/ |
Qwen3_5ForCausalLM |
Generic Transformers / custom loaders |
mm/ |
Qwen3_5ForConditionalGeneration + language_model_only: true |
vLLM (recommended) |
Hardware & runtime
- Single NVIDIA 24 GB GPU. Validated on RTX 3090 (Ampere, SM86).
- Ampere kernel: as of the 2026-05-22 update,
vLLM auto-selects mainline
MarlinLinearKernelvia the relabeledgptqmetadata path.conch-triton-kernelsis no longer required at runtime; install only if you intend to revert the metadata to test against Conch. On Hopper+ mainline kernels apply throughout. - Versions: validated under vLLM 0.20.2.
transformers >= 5.8.1required for theqwen3_5architecture. - Served with
--dtype float16.
Usage — vLLM (use the mm/ packaging)
Standard single-user with MTP K=3 (max decode TPS)
vllm serve bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling \
--tokenizer bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling \
--served-model-name qwen3.6-27b-mtp \
--language-model-only \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--gpu-memory-utilization 0.93 --max-model-len 16384 --max-num-seqs 4 --dtype float16 \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
# point vLLM at the mm/ subfolder
Conservative single-user with MTP K=1 (higher per-token acceptance, more stable TPS)
# ... (same as above) ...
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Combined: FP8 KV + MTP (middle-ground; needs CUDA toolkit + ninja)
CUDA_HOME=/usr/local/cuda-13.0 PATH=/usr/local/cuda-13.0/bin:$PATH \
vllm serve ... \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
# KV pool ~33K, concurrency ~2x, decode TPS ~47 — middle ground
Qwen3.6 enables a "thinking" mode by default, which consumes extra tokens. For tool dispatch you may want to disable it (
enable_thinking=false) to reduce token usage and latency.
Validation
Functional validation only — not a full benchmark. Under vLLM 0.20.2 (no patches), the
mm/ packaging produced:
- Tool-calling: 24/24 on internal quick-test set
- Lightweight reasoning: 10/10
- Decode TPS: per the tables above (5 prompts × 2 runs, sequential single-stream)
- MTP draft acceptance: per-position 88/77/64-69, average ~76-78%, mean accepted length 3.29 (vLLM-reported)
Not a full lm-eval comparison against the BF16 base.
License & attribution
- Base model:
Qwen/Qwen3.6-27B, © 2026 Alibaba Cloud, Apache License 2.0. - This artifact is a modified derivative (vision removed; W4A16 quantization;
MTP head BF16-preserved by graft from upstream BF16 base), distributed under the same
Apache-2.0 license. Verbatim license copy in
LICENSE(§4(a)); modifications in What's grafted into this variant (§4(b)); upstream copyright/attribution retained (§4(c)); see alsoNOTICE. - Trademark (§6): "Qwen" / "Qwen3.6" are used only to identify the upstream model. This release is not official and implies no endorsement or affiliation.
- Provided "AS IS" (Apache-2.0 §7); quantization may change outputs vs. the base model.
Citation
Upstream model:
@misc{qwen3.6-27b,
title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
author = {{Qwen Team}},
month = {April},
year = {2026},
url = {https://qwen.ai/blog?id=qwen3.6-27b}
}
Acknowledgments · 致谢
This release is dedicated to my wife. When I worried about the business failing, she told me she'd happily support me at home so I could keep working on AI.
妻如此,夫复何求。
Released on 5月20日 · 520 — with love.
Model tree for bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling
Base model
Qwen/Qwen3.6-27B