Qwen3.6-27B · Text-Only · W4A16-g128 · MTP · Tool-Calling

Low-latency single-stream variant — ~+85% decode TPS via preserved Multi-Token Prediction head, single-user mode for interactive agents and tool calls.

This is a text-only, 4-bit weight (W4A16, group size 128, AutoRound) quantization of Qwen/Qwen3.6-27B with the MTP head preserved in BF16 so vLLM's native speculative decoding works out of the box.

💡 Pick the right variant for your workload:

  • Multi-user / long-context server: use the high-concurrency sibling bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-ToolCalling (FP8 KV path, 56K max context, ~4-8× concurrency, no MTP).
  • Single-user low-latency (you want decode TPS, you don't need many concurrent users): this variant, with vLLM --speculative-config method=mtp.

Both variants use the same base quantization. The only difference is whether the MTP head is grafted in (this variant) — which adds ~849 MB of weights and trades KV pool for decode speed.

Community quantization. Not an official Qwen release; not endorsed by or affiliated with the Qwen team or Alibaba Cloud. "Qwen3.6" is used only to identify the upstream model (Apache-2.0 §6).


TL;DR — measured on 1× RTX 3090 24 GB, vLLM 0.20.2 (no plugins)

Config KV pool Concurrency @ 16K Decode TPS (mean / median) Use case
--speculative-config num=1 20,024 tok 1.22× 53.9 / 54.3 Stable single-user agent
--speculative-config num=3 16,384 tok 1.00× 66.1 / 70.0 Fastest single-user decode
(no spec-decode — equivalent to sibling variant) ~63,000 tok ~3.8× ~36 If you want multi-user, use the sibling instead

MTP draft acceptance (vLLM-reported, steady state):

Config Per-position acceptance Average Mean accepted length
num=1 75-83% (peaks 90.9%) ~80% 1.74-1.91 / 2
num=3 88% / 77% / 64-69% ~76-78% 3.29 / 4

num=1 numbers are directly comparable to other public MTP quantizations (e.g. Lorbus's 80-90% claim). num=3 trades lower per-position acceptance for more aggregate decode speed.

You can also stack FP8 KV on top of MTP (slightly lower TPS but ~+43-63% KV pool):

--kv-cache-dtype fp8 + MTP KV pool Concurrency @ 16K Decode TPS
+ num=1 32,768 tok 2.00× 46.9 / 47.1
+ num=3 23,507 tok 1.43× 62.0 / 61.9

⚠️ Decode TPS numbers above are pre-Marlin Conch baselines (measured 2026-05-20). After the 2026-05-22 Marlin metadata update, expect approximately +30% across all rows (Marlin accelerates the W4A16 GEMM regardless of whether MTP is enabled). Re-bench is pending; the no-MTP sibling variant has been re-measured at 35.83 → 46.6 mean (+30%) on the same hardware.


What's new (2026-05-22) — Marlin kernel by default

The quantization_config metadata in both mm/ and self/ packagings was relabeled so vLLM auto-selects the mainline MarlinLinearKernel instead of the ConchLinearKernel Triton fallback. The W4A16 weights are byte-identical; the 15 BF16 mtp.* tensors are also unchanged. Only three metadata fields changed:

Field Before After
quantization_config.quant_method auto-round gptq
quantization_config.desc_act (absent) false
quantization_config.checkpoint_format (absent) gptq

This works because AutoRound was already configured with packing_format: auto_round:auto_gptq — the on-disk weight layout is GPTQ-compatible, so vLLM's gptq_marlin loader accepts the weights after relabeling. The extra_config section of quantization_config (which marks the 8 linear mtp.* weights as bits=16, data_type=fp) is untouched — the MTP head still loads correctly into vLLM's Qwen3_5MTP loader.

Expected impact for MTP configurations (re-bench pending — numbers below are projections):

Config Pre-Marlin Decode TPS (measured) Post-Marlin Decode TPS (projected)
num=1 (FP16 KV) 53.9 / 54.3 ~70
num=3 (FP16 KV) 66.1 / 70.0 ~86
+ fp8 KV + num=1 46.9 / 47.1 ~61
+ fp8 KV + num=3 62.0 / 61.9 ~80

The decode speedup is from MarlinLinearKernel accelerating the W4A16 GEMM (q/k/v/o_proj + MLP) in both the main forward pass and the MTP head's verification pass — about half of the model's compute. The 48 Gated DeltaNet linear-attention layers still go through the Triton/ FLA kernel and are the physical floor on prefill TTFT on Ampere SM86. MTP itself doesn't accelerate prefill, so cold TTFT is unchanged. The no-MTP sibling has been re-measured at 35.83 → 46.6 mean (+30%).

To revert: each packaging ships config.json.bak-pre-marlin-20260522. Rename it back over config.json (no other files to touch) and you're back on the Conch path.

Runtime requirement change after this update:

  • conch-triton-kernels is no longer required at runtime — Marlin is in vLLM mainline. Install only if you intend to revert.
  • All other prerequisites (CUDA toolkit + ninja-build for FP8 KV) are unchanged.

Why this variant exists

Plain W4A16 quantization on a 24 GB GPU gives you ~36 tok/s sustained decode TPS, which is fine for most agentic workloads where prefill (the long system prompt + tool schemas) is the real bottleneck.

But for single-user, decode-bound workloads — long-form chat, streamed reasoning, chained-of-thought tool calls — you want every decoded token as fast as possible. That's what Multi-Token Prediction (vLLM's --speculative-config method=mtp) is for. It uses Qwen3.6's built-in MTP head to draft 1-3 candidate tokens per step and verify them, giving ~1.85-1.9× sustained decode TPS when accepted (and our acceptance rate is 75-83% at n=1, on par with public Lorbus benchmarks).

The MTP head is dropped by default during W4A16 quantization (it's quantized away or excluded from the text backbone). This release grafts it back from the upstream BF16 base so vLLM's Qwen3_5MTP loader can find it.

Who it's for: single-user agent UIs, chat clients, interactive REPLs where latency per token matters more than aggregate users. Who it's not for: multi-user servers (use the sibling variant), or workloads with very long prompts and short outputs (MTP doesn't accelerate prefill).


What's grafted into this variant

15 BF16 mtp.* tensors extracted from the upstream BF16 base (Qwen/Qwen3.6-27B) and merged into both packagings (self/ and mm/) as a separate shard model-mtp.safetensors (~810 MB):

  • mtp.fc.weight — the fusion layer (must stay BF16 per vLLM Qwen3_5MTP loader)
  • mtp.layers.0.{self_attn, mlp, *_layernorm}.weight — the single MTP decoder layer
  • mtp.norm.weight, mtp.pre_fc_norm_embedding.weight, mtp.pre_fc_norm_hidden.weight

These layers are kept BF16 unquantized; extra_config in quantization_config is updated to mark the 8 linear mtp.* weights as bits=16, data_type=fp (defensive).

The W4A16 main-body weights are identical to the no-MTP sibling release — same AutoRound 0.12.3 output, same per-layer high-precision policy (lm_head / embed_tokens / linear_attn.in_proj_a / linear_attn.in_proj_b retained BF16). Only the MTP tensors are added.

Two packagings (identical content between this variant and the no-MTP sibling, except for the added mtp shard):

Subfolder architectures For
self/ Qwen3_5ForCausalLM Generic Transformers / custom loaders
mm/ Qwen3_5ForConditionalGeneration + language_model_only: true vLLM (recommended)

Hardware & runtime

  • Single NVIDIA 24 GB GPU. Validated on RTX 3090 (Ampere, SM86).
  • Ampere kernel: as of the 2026-05-22 update, vLLM auto-selects mainline MarlinLinearKernel via the relabeled gptq metadata path. conch-triton-kernels is no longer required at runtime; install only if you intend to revert the metadata to test against Conch. On Hopper+ mainline kernels apply throughout.
  • Versions: validated under vLLM 0.20.2. transformers >= 5.8.1 required for the qwen3_5 architecture.
  • Served with --dtype float16.

Usage — vLLM (use the mm/ packaging)

Standard single-user with MTP K=3 (max decode TPS)

vllm serve bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling \
  --tokenizer bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling \
  --served-model-name qwen3.6-27b-mtp \
  --language-model-only \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --gpu-memory-utilization 0.93 --max-model-len 16384 --max-num-seqs 4 --dtype float16 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
# point vLLM at the mm/ subfolder

Conservative single-user with MTP K=1 (higher per-token acceptance, more stable TPS)

# ... (same as above) ...
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Combined: FP8 KV + MTP (middle-ground; needs CUDA toolkit + ninja)

CUDA_HOME=/usr/local/cuda-13.0 PATH=/usr/local/cuda-13.0/bin:$PATH \
vllm serve ... \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'
# KV pool ~33K, concurrency ~2x, decode TPS ~47 — middle ground

Qwen3.6 enables a "thinking" mode by default, which consumes extra tokens. For tool dispatch you may want to disable it (enable_thinking=false) to reduce token usage and latency.


Validation

Functional validation only — not a full benchmark. Under vLLM 0.20.2 (no patches), the mm/ packaging produced:

  • Tool-calling: 24/24 on internal quick-test set
  • Lightweight reasoning: 10/10
  • Decode TPS: per the tables above (5 prompts × 2 runs, sequential single-stream)
  • MTP draft acceptance: per-position 88/77/64-69, average ~76-78%, mean accepted length 3.29 (vLLM-reported)

Not a full lm-eval comparison against the BF16 base.


License & attribution

  • Base model: Qwen/Qwen3.6-27B, © 2026 Alibaba Cloud, Apache License 2.0.
  • This artifact is a modified derivative (vision removed; W4A16 quantization; MTP head BF16-preserved by graft from upstream BF16 base), distributed under the same Apache-2.0 license. Verbatim license copy in LICENSE (§4(a)); modifications in What's grafted into this variant (§4(b)); upstream copyright/attribution retained (§4(c)); see also NOTICE.
  • Trademark (§6): "Qwen" / "Qwen3.6" are used only to identify the upstream model. This release is not official and implies no endorsement or affiliation.
  • Provided "AS IS" (Apache-2.0 §7); quantization may change outputs vs. the base model.

Citation

Upstream model:

@misc{qwen3.6-27b,
    title  = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
    author = {{Qwen Team}},
    month  = {April},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.6-27b}
}

Acknowledgments · 致谢

This release is dedicated to my wife. When I worried about the business failing, she told me she'd happily support me at home so I could keep working on AI.

妻如此,夫复何求。

Released on 5月20日 · 520 — with love.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bowmanslayer/Qwen3.6-27B-Text-Only-W4A16-g128-MTP-ToolCalling

Base model

Qwen/Qwen3.6-27B
Quantized
(421)
this model