🧊 PolarQuant Q5 -- Qwen3.5-27B

The full 27B Qwen3.5 model compressed with PolarQuant Q5 -- PPL 5.37 (only +0.13 from BF16 5.24) with massive VRAM savings.

PolarQuant brings near-lossless 5-bit quantization to the larger Qwen3.5-27B model, enabling it to run on hardware that cannot fit the full FP16 model (56 GB).


🎯 Key Results

Metric Value
Method PolarQuant Q5 + torchao INT4
Perplexity (WikiText-2) 5.37
BF16 Baseline PPL 5.24
Delta from BF16 +0.13 (near-lossless)
Original Size ~56 GB (FP16)
Quantized Size Significantly reduced
Quantization PolarQuant Q5 (5-bit, block_size=128)

📊 Benchmark Comparison

PPL Comparison

Speed vs VRAM

Method PPL Delta Notes
BF16 baseline 5.24 -- Full precision
PolarQuant Q5 + torchao 5.37 +0.13 Near-lossless
torchao INT4 (absmax) ~5.50 +0.26 Standard quantization

The 27B model shows even better PolarQuant scaling than the 9B -- only +0.13 PPL degradation vs +0.19 for the 9B model. Larger models are more robust to quantization.


🔬 Why PolarQuant?

PolarQuant uses Hadamard rotation to transform weight distributions to Gaussian, then applies Lloyd-Max MSE-optimal centroids. This is mathematically optimal for the transformed distribution.

Original Weights --> Normalize --> Hadamard Rotate --> Lloyd-Max Quantize --> Store Codes
                                                                               |
Inference: Codes --> Centroid Lookup --> Inverse Hadamard --> Scale by Norms --> BF16 Weights
                                                                               |
                                                                 torchao INT4 --> cuBLAS

Key Insight

Larger models benefit more from PolarQuant because:

  • More parameters means better statistical convergence to Gaussian after rotation
  • The Lloyd-Max centroids become increasingly optimal as block statistics stabilize
  • Quantization error is distributed across more parameters, reducing per-token impact

🚀 Quick Start

CUDA (torchao INT4) -- Recommended

from transformers import AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import quantize_, Int4WeightOnlyConfig

model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Qwen3.5-27B-PolarQuant-Q5",
    dtype="bfloat16", device_map="auto", trust_remote_code=True
)
quantize_(model, Int4WeightOnlyConfig(group_size=128))

tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-27B-PolarQuant-Q5")
output = model.generate(
    **tokenizer("Explain the implications of quantum entanglement:", return_tensors="pt").to("cuda"),
    max_new_tokens=300
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

🖥️ Hardware Requirements

Configuration VRAM Required Notes
FP16 ~56 GB A100 80GB / H100
PolarQuant Q5 + torchao INT4 ~18-20 GB RTX 4090 / A6000 / RTX PRO 6000
Multi-GPU (2x) ~10 GB each 2x RTX 3090/4090

Tip: Qwen3.5-27B uses the DeltaNet architecture with flash-linear-attention for efficient inference. Ensure you have trust_remote_code=True and the latest transformers version.


🔧 Technical Details

Component Details
Base Model Qwen3.5-27B (DeltaNet architecture)
Quantization PolarQuant Q5 (5-bit, block_size=128)
Rotation 128x128 normalized Walsh-Hadamard matrix
Centroids Pre-computed MSE-optimal for N(0,1) via 100 Lloyd-Max iterations
Storage int8 codes + fp16 per-block norms + fp32 centroid table
Inference torchao INT4 cuBLAS (group_size=128)
Architecture DeltaNet (stateful, uses flash-linear-attention)

DeltaNet Notes

Qwen3.5 uses DeltaNet (linear attention with delta rule), which is stateful:

  • Requires flash-linear-attention for efficient inference
  • Speculative decoding is not currently supported
  • Use trust_remote_code=True when loading

🔗 Links


📖 Citation

@article{vicentino2026polarquant,
  title={PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.7424577},
  year={2026}
}

🙏 Acknowledgements

Built with PyTorch, torchao, flash-linear-attention, and the Qwen team's open-weight models.

Downloads last month
53
Safetensors
Model size
27B params
Tensor type
F32
·
BF16
·
F16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/Qwen3.5-27B-PolarQuant-Q5

Base model

Qwen/Qwen3.5-27B
Finetuned
(190)
this model

Collections including caiovicentino1/Qwen3.5-27B-PolarQuant-Q5