Attn-QAT: 4-Bit Attention With Quantization-Aware Training
Paper • 2603.00040 • Published
How to use FastVideo/FastWan-QAD-1.3B-SA2 with Diffusers:
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("FastVideo/FastWan-QAD-1.3B-SA2", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]
FastWan-QAD-1.3B-SA2 is a variant of FastWan-QAD-1.3B that swaps the SageAttention3 FP4 backend for SageAttention2++, trading a small amount of speed for improved visual quality. It generates a 5-second 480p video in approximately 2 seconds on an RTX 5090.
Like all FastWan-QAD models, it is built on Wan-AI/Wan2.1-T2V-1.3B-Diffusers and trained with quantization-aware distillation (QAD) for 3-step inference with NVFP4 linear layers.
Hardware requirement: RTX 5090 (sm100+). NVFP4 linear layers require Blackwell-native support. See FastWan-QAD-FP8-1.3B for RTX 4090 compatibility.
| Model | Hardware | Generation Time (5s 480p) |
|---|---|---|
| FastWan-QAD-1.3B | RTX 5090 | ~1.78s |
| FastWan-QAD-1.3B-SA2 | RTX 5090 | ~2.0s |
| FastWan-QAD-FP8-1.3B | RTX 4090 | ~3.4s |
| TurboDiffusion | RTX 5090 | 6.10s |
| LightX2V | RTX 5090 | 6.91s |
docker run --gpus all --ipc=host --rm -it ghcr.io/hao-ai-lab/fastvideo/fastvideo-dev:py3.12-sha-f889e6b bash
# should drop you in /FastVideo with venv already activated
git fetch && git checkout main
# build fastvideo-kernel
cd fastvideo-kernels/ && ./build.sh && cd ..
git clone https://github.com/madebyollin/taehv
uv pip install ./taehv
# run generation:
FASTVIDEO_DISABLE_ATTENTION_COMPILE=0 FASTVIDEO_ATTENTION_BACKEND=SAGE_ATTN python examples/inference/optimizations/nvfp4_sa2_wan_2_1_3b.py --model FastVideo/FastWan-QAD-1.3B-SA2 --distilled_model "" --taehv_checkpoint taehv/taew2_1.pth
More details coming soon.
It would be greatly appreciated if you cite our paper:
@article{Zhang2026AttnQAT,
title={Attn-QAT: 4-Bit Attention With Quantization-Aware Training},
author={Zhang, Peiyuan and Noto, Matthew and Tan, Wenxuan and Jiang, Chengquan and Lin, Will and Zhou, Wei and Zhang, Hao},
journal={arXiv preprint arXiv:2603.00040},
year={2026}
}