Instructions to use QuantTrio/GLM-5.2-Int4-Int8Mix with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/GLM-5.2-Int4-Int8Mix with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/GLM-5.2-Int4-Int8Mix")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-5.2-Int4-Int8Mix")
model = AutoModelForMultimodalLM.from_pretrained("QuantTrio/GLM-5.2-Int4-Int8Mix")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use QuantTrio/GLM-5.2-Int4-Int8Mix with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/GLM-5.2-Int4-Int8Mix"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5.2-Int4-Int8Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/GLM-5.2-Int4-Int8Mix

SGLang

How to use QuantTrio/GLM-5.2-Int4-Int8Mix with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/GLM-5.2-Int4-Int8Mix" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5.2-Int4-Int8Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/GLM-5.2-Int4-Int8Mix" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5.2-Int4-Int8Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/GLM-5.2-Int4-Int8Mix with Docker Model Runner:
```
docker model run hf.co/QuantTrio/GLM-5.2-Int4-Int8Mix
```

GLM-5.2-Int4-Int8Mix

Base model: zai-org/GLM-5.2

This repo quantizes the model using a data-free quantization tool.

(no calibration dataset was involved)

This release is prepared for vLLM with compressed-tensors weight-only A16 inference. It does not claim compatibility with SGLang or other runtimes.

The default reasoning_effort is changed to medium-high to reduce thinking-token cost. You can override it per request with "chat_template_kwargs": {"reasoning_effort": "max"}, or edit chat_template.jinja to set a different default.

【Quantization Policy】

Scope	Format
`model.layers.0`	BF16
`model.layers.1` - `model.layers.2`	W8A16, group size 128
`model.layers.3` - `model.layers.77` ordinary linear weights	W8A16, group size 128
`model.layers.3` - `model.layers.77` MoE expert weights	W4A16, group size 128
`model.layers.78` MTP block	W8A16, channelwise
`mlp.gate.*`	FP32
Attention indexer, norms, embeddings, and special heads	BF16

Accuracy reference, using the SGLang GLM-5.2 FP8 H200 / default / low-latency / single-node AIME25 recipe:

Model	Runtime	Quantization	Reasoning effort	AIME25 pass@1
ZhipuAI/GLM-5.2-FP8	SGLang	FP8	`max`	`87.7%`
tclf90/GLM-5.2-Int4-Int8Mix	vLLM	Int4-Int8Mix, W4A16/W8A16	`max`	`92.92%`
tclf90/GLM-5.2-Int4-Int8Mix	vLLM	Int4-Int8Mix, W4A16/W8A16	`medium-high`	`86.46%`

This is a lightweight reproduction reference rather than a full formal benchmark.

【Dependencies / Installation】

vllm==0.23.0
transformers==5.12.1

As of 2026-06-21, this model has been verified on an 8 x H200 machine with a Python 3.12 virtual environment, vLLM 0.23.0, and Transformers 5.12.1.

Create a fresh Python environment and install vLLM:

python3.12 -m venv venv
source venv/bin/activate
pip install vllm==0.23.0 transformers==5.12.1

vLLM Official Guide

【vLLM Startup Command】

Note: When launching with TP=8, include --enable-expert-parallel; otherwise the expert tensors may not be evenly sharded across GPU devices.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

vllm serve \
    __YOUR_PATH__/GLM-5.2-Int4-Int8Mix \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name GLM-5.2 \
    --trust-remote-code \
    --dtype bfloat16 \
    --quantization compressed-tensors \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --max-model-len auto \
    --gpu-memory-utilization 0.90 \
    --max-num-seqs 32 \
    --enable-auto-tool-choice \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1 \
    --disable-uvicorn-access-log

The first launch may spend several minutes on vLLM compilation and warmup.

【Logs】

2026-06-21
1. Initial vLLM-only INT4/INT8 compressed-tensors export.
2. MTP block uses W8A16 channelwise quantization.
3. Verified vLLM 0.23.0 TP=8 EP startup and /v1/chat/completions POST.

【Model Files】

File Size	Last Updated
`378GiB`	`2026-06-21`

【Model Download】

from huggingface_hub import snapshot_download
snapshot_download('QuantTrio/GLM-5.2-Int4-Int8Mix', cache_dir="your_local_path")

【Overview】

GLM-5.2

👋 Join our WeChat or Discord community.
📖 Check out the GLM-5.2 blog and GLM-5 Technical report.
📍 Use GLM-5.2 API services on Z.ai API Platform.
🔜 Try GLM-5.2 here.

[Paper] [GitHub]

Introduction

We're introducing GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context. GLM-5.2's new capabilities include:

Solid 1M Context: A solid 1M-token context that stably sustains long-horizon work
Advanced Coding with Flexible Effort: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency
Improved Architecture: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at a 1M context length. We also improve GLM-5.2’s MTP layer for speculative decoding, increasing the acceptance length by up to 20%
Pure Open: An MIT open-source license — no regional limits, technical access without borders

Benchmark

Benchmark	GLM-5.2	GLM-5.1	Qwen3.7-Max	MiniMax M3	DeepSeek-V4-Pro	Claude Opus 4.8	GPT-5.5	Gemini 3.1 Pro
Reasoning
HLE	40.5	31	41.4	37	37.7	49.8*	41.4*	45
HLE (w/ Tools)	54.7	52.3	53.5	-	48.2	57.9*	52.2*	51.4*
CritPt	16.7	4.6	13.4	3.7	12.9	20.9	27.1	17.7
AIME 2026	99.2	95.3	97	-	94.6	95.7	98.3	98.2
HMMT Nov. 2025	94.4	94	95	84.4	94.4	96.5	96.5	94.8
HMMT Feb. 2026	92.5	82.6	97.1	84.4	95.2	96.7	96.7	87.3
IMOAnswerBench	91.0	83.8	90	-	89.8	83.5	-	81
GPQA-Diamond	91.2	86.2	90	93	90.1	93.6	93.6	94.3
Coding
SWE-bench Pro	62.1	58.4	60.6	59	55.4	69.2	58.6	54.2
NL2Repo	48.9	42.7	47.2	42.1	35.5	69.7	50.7	33.4
DeepSWE	46.2	18	18	20	8	58	70	10
ProgramBench	63.7	50.9	-	-	47.8	71.9	70.8	39.5
Terminal Bench 2.1 (Terminus-2)	81.0	63.5	75	65	64	85	84	74
Terminal Bench 2.1 (Best Reported Harness)	82.7	69	-	-	-	78.9	83.4	70.7
FrontierSWE (Dominance)	74.4	30.5	-	-	29.0	75.1	72.6	39.6
PostTrainBench	34.3	20.1	-	-	-	37.2	28.4	21.6
SWE-Marathon	13.0	1.0	-	-	-	26.0	12.0	4.0
Agentic
MCP-Atlas (Public Set)	76.8	71.8	76.4	74.2	73.6	77.8	75.3	69.2
Tool-Decathlon	48.2	40.7	-	-	52.8	59.9	55.6	48.8

Serve GLM-5.2 Locally

The following open-source frameworks support local deployment of GLM-5.2:

SGLang (v0.5.13.post1+) — see cookbook
vLLM (v0.23.0+) — see recipes
xLLM (v0.10.0+) — see example
Transformers (v0.5.12+) — see transformers docs
KTransformers (v0.5.12+) — see tutorial

Citation

If you find GLM-5.2 useful in your research, please cite our technical report:

@misc{glm5team2026glm5vibecodingagentic,
      title={GLM-5: from Vibe Coding to Agentic Engineering},
      author={GLM-5-Team and : and Aohan Zeng and Xin Lv and Zhenyu Hou and Zhengxiao Du and Qinkai Zheng and Bin Chen and Da Yin and Chendi Ge and Chenghua Huang and Chengxing Xie and Chenzheng Zhu and Congfeng Yin and Cunxiang Wang and Gengzheng Pan and Hao Zeng and Haoke Zhang and Haoran Wang and Huilong Chen and Jiajie Zhang and Jian Jiao and Jiaqi Guo and Jingsen Wang and Jingzhao Du and Jinzhu Wu and Kedong Wang and Lei Li and Lin Fan and Lucen Zhong and Mingdao Liu and Mingming Zhao and Pengfan Du and Qian Dong and Rui Lu and Shuang-Li and Shulin Cao and Song Liu and Ting Jiang and Xiaodong Chen and Xiaohan Zhang and Xuancheng Huang and Xuezhen Dong and Yabo Xu and Yao Wei and Yifan An and Yilin Niu and Yitong Zhu and Yuanhao Wen and Yukuo Cen and Yushi Bai and Zhongpei Qiao and Zihan Wang and Zikang Wang and Zilin Zhu and Ziqiang Liu and Zixuan Li and Bojie Wang and Bosi Wen and Can Huang and Changpeng Cai and Chao Yu and Chen Li and Chengwei Hu and Chenhui Zhang and Dan Zhang and Daoyan Lin and Dayong Yang and Di Wang and Ding Ai and Erle Zhu and Fangzhou Yi and Feiyu Chen and Guohong Wen and Hailong Sun and Haisha Zhao and Haiyi Hu and Hanchen Zhang and Hanrui Liu and Hanyu Zhang and Hao Peng and Hao Tai and Haobo Zhang and He Liu and Hongwei Wang and Hongxi Yan and Hongyu Ge and Huan Liu and Huanpeng Chu and Jia'ni Zhao and Jiachen Wang and Jiajing Zhao and Jiamin Ren and Jiapeng Wang and Jiaxin Zhang and Jiayi Gui and Jiayue Zhao and Jijie Li and Jing An and Jing Li and Jingwei Yuan and Jinhua Du and Jinxin Liu and Junkai Zhi and Junwen Duan and Kaiyue Zhou and Kangjian Wei and Ke Wang and Keyun Luo and Laiqiang Zhang and Leigang Sha and Liang Xu and Lindong Wu and Lintao Ding and Lu Chen and Minghao Li and Nianyi Lin and Pan Ta and Qiang Zou and Rongjun Song and Ruiqi Yang and Shangqing Tu and Shangtong Yang and Shaoxiang Wu and Shengyan Zhang and Shijie Li and Shuang Li and Shuyi Fan and Wei Qin and Wei Tian and Weining Zhang and Wenbo Yu and Wenjie Liang and Xiang Kuang and Xiangmeng Cheng and Xiangyang Li and Xiaoquan Yan and Xiaowei Hu and Xiaoying Ling and Xing Fan and Xingye Xia and Xinyuan Zhang and Xinze Zhang and Xirui Pan and Xu Zou and Xunkai Zhang and Yadi Liu and Yandong Wu and Yanfu Li and Yidong Wang and Yifan Zhu and Yijun Tan and Yilin Zhou and Yiming Pan and Ying Zhang and Yinpei Su and Yipeng Geng and Yong Yan and Yonglin Tan and Yuean Bi and Yuhan Shen and Yuhao Yang and Yujiang Li and Yunan Liu and Yunqing Wang and Yuntao Li and Yurong Wu and Yutao Zhang and Yuxi Duan and Yuxuan Zhang and Zezhen Liu and Zhengtao Jiang and Zhenhe Yan and Zheyu Zhang and Zhixiang Wei and Zhuo Chen and Zhuoer Feng and Zijun Yao and Ziwei Chai and Ziyuan Wang and Zuzhou Zhang and Bin Xu and Minlie Huang and Hongning Wang and Juanzi Li and Yuxiao Dong and Jie Tang},
      year={2026},
      eprint={2602.15763},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.15763},
}