osmAPI logo

Qwen3.6-27B-Claude-Opus-Reasoning-Distill-v2-abliterated-8bit-mlx

⚠️ THIS IS A TEXT-ONLY MODEL — NO VISION

The upstream abliteration pass stripped the vision tower. For vision-capable Qwen 3.6 27B Opus-Distill MLX quants, see our parallel repos at huggingface.co/osmapi (look for repos without -abliterated in the name).

8-bit affine MLX quantization of an abliterated Qwen 3.6 27B Claude-Opus reasoning distill, by the osmAPI team — "OpenRouter of India".

Indistinguishable from BF16 on every benchmark we measured (NLL drift < 0.005). Use this if you have the RAM and want zero quantization drift.


⚡ TL;DR

Disk size ~27 GB
Effective BPW 8.0
Scheme Affine 8-bit, group size 64
Recommended RAM 48 GB Apple Silicon (M4 Pro 48 GB, M4 Max, Studio Ultra)
Vision ❌ text-only (the upstream abliteration step stripped the ViT)
Made by osmAPIOpenRouter of India

🧬 Lineage

Qwen/Qwen3.6-27B                                              (Qwen Team — base pretrain)
        │
        ▼
TeichAI/Qwen3.6-27B-Claude-Opus-Reasoning-Distill-v2          (TeichAI — Claude-Opus reasoning distill)
        │
        ▼
abliterated (refusal-ablated) via OBLITERATUS v0.1.2          (multi-direction SVD, BF16)
        │
        ▼
this repo — 8-bit affine, MLX format                        (osmAPI team — quantization)

Direct upstream links:


📦 Use it

mlx-lm (recommended)

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("osmapi/Qwen3.6-27B-Claude-Opus-Reasoning-Distill-v2-abliterated-8bit-mlx")
prompt = "Explain the difference between SSM and softmax attention in three sentences."
out = generate(model, tokenizer, prompt=prompt, max_tokens=400)
print(out)

Chat template

messages = [
    {"role": "system", "content": "You are a helpful, candid reasoning assistant."},
    {"role": "user", "content": "Plan a 3-day Tokyo itinerary for a foodie."},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print(generate(model, tokenizer, prompt=prompt, max_tokens=600))

CLI

mlx_lm.generate --model osmapi/Qwen3.6-27B-Claude-Opus-Reasoning-Distill-v2-abliterated-8bit-mlx --prompt "Hello" --max-tokens 256

🧪 Quantization details

  • Source weights: BF16 abliterated checkpoint (28 shards, ~57 GB) derived from TeichAI/Qwen3.6-27B-Claude-Opus-Reasoning-Distill-v2 via OBLITERATUS multi-direction SVD ablation (preserves coherence; KL drift = 0.149 from base).
  • Quantization scheme: Affine 8-bit, group size 64.
  • Group size: 64.
  • Calibration corpus: mlx-lm calibration_v5 (~427 KB English text, used for OptiQ sensitivity ranking; uniform/affine variants do not require calibration).
  • Sanity check: forward perplexity on held-out calibration text within 1–3% of next-higher-precision sibling.

Architecture notes

The Qwen 3.6 27B family uses a hybrid attention stack — 4 GatedDeltaNet (linear-attention/SSM) layers followed by 1 full-softmax-attention layer, repeated 16× for 64 total layers, 5120 hidden, 248K vocab, 262K context. The SSM kernels lack a VJP path in MLX, so backward-pass-based quant methods (DWQ, dynamic quant) cannot be applied here — OptiQ's forward-only sensitivity approach is the only calibration-aware option that works on this architecture. That's why the OptiQ variants exist.


⚠️ Behavior caveats

  • Text-only — no vision. The abliteration pipeline (OBLITERATUS) ran on the LM tower and stripped the ViT. For vision-capable quants of the same Opus-Distill v2 lineage, use our parallel non-abliterated repos at huggingface.co/osmapi (any repo without -abliterated in the name).
  • This is an abliterated model — refusal directions were surgically removed from the parent. It will answer prompts the parent would refuse. Use responsibly and within applicable law.
  • Quantization preserves abliteration: the refusal rate measured at BF16 (~35% from a 100% baseline) stays in that range across our quants.

🙏 Credits

Quantization & release osmAPI team"OpenRouter of India"
Reasoning distill TeichAI (Claude-Opus 4.5/4.6 high-reasoning datasets)
Foundation model Qwen Team
Abliteration toolkit OBLITERATUS by elder-plinius
Quant toolkit mlx-lm, mlx-optiq

📜 License

Apache-2.0, inherited from the foundation and distill upstream.


Need a hosted endpoint, custom quant, or larger-scale inference? osmAPI — multi-provider LLM routing for the Indian developer ecosystem.

⚡ 3.3–3.7× faster decoding with DFlash (lossless, MLX)

This MLX build supports lossless block-diffusion speculative decoding via DFlash in mlx_vlmno requantization, no model changes. On an Apple M4 Max we measured 3.38× (8-bit) and 3.67× (bf16) decode speedups with byte-identical output; other MLX quants of this model should see a similar ~3×.

python3 -m mlx_vlm generate \
  --model osmapi/Qwen3.6-27B-Claude-Opus-Reasoning-Distill-v2-abliterated-8bit-mlx \
  --draft-model z-lab/Qwen3.6-27B-DFlash --draft-kind dflash \
  --prompt "Write a merge function for two sorted lists in Python." --max-tokens 256
Downloads last month
1,985
Safetensors
Model size
27B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for osmapi/Qwen3.6-27B-Claude-Opus-Reasoning-Distill-v2-abliterated-8bit-mlx

Base model

Qwen/Qwen3.6-27B
Quantized
(10)
this model