gemma-4-26B-A4B-it-FP8

FP8 quantized version of google/gemma-4-26B-A4B-it (MoE, 26B total / 4B active), produced by protoLabsAI.

BFCL v4 Function Calling (AST-graded, 900 tests)

Category	Gemma 4 MoE (26B)	Qwen 35B MoE	Gemma 4 E4B (8B)	Qwen 27B INT4
Simple	95.7%	92.2%	92.3%	89.8%
Multiple	94.0%	94.0%	92.0%	89.5%
Parallel	95.0%	90.5%	87.5%	89.5%
Irrelevance	92.9%	87.1%	78.3%	87.9%
Average	94.4%	90.9%	87.5%	89.2%

Speed (RTX PRO 6000 Blackwell, 96 GB VRAM)

Config	Decode	TTFT	VRAM	Context
FP8 1×GPU, FP8 KV	175 tok/s	83ms	25.7 GiB	256K
FP8 TP=2, FP8 KV	208 tok/s	254ms	13 GiB/GPU	256K
BF16 1×GPU	141 tok/s	52ms	48.5 GiB	32K

Single GPU serves the full 256K context at 175 tok/s with FP8 model weights + FP8 KV cache.

Usage with vLLM

# Recommended: on-the-fly FP8 (256K context, single GPU)
vllm serve google/gemma-4-26B-A4B-it \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 262144 --gpu-memory-utilization 0.92 \
  --enable-auto-tool-choice --tool-call-parser gemma4

Requires vLLM from main (>= PR #38826).

Produced By

protoLabsAI — AI inference lab, 2× RTX PRO 6000 Blackwell (192 GB VRAM).

Downloads last month: 65

Model tree for protoLabsAI/gemma-4-26B-A4B-it-FP8

Base model

google/gemma-4-26B-A4B-it

Quantized

(21)

this model