gemma-4-26B-A4B-it-FP8

FP8 quantized version of google/gemma-4-26B-A4B-it (MoE, 26B total / 4B active), produced by protoLabsAI.

BFCL v4 Function Calling (AST-graded, 900 tests)

Category Gemma 4 MoE (26B) Qwen 35B MoE Gemma 4 E4B (8B) Qwen 27B INT4
Simple 95.7% 92.2% 92.3% 89.8%
Multiple 94.0% 94.0% 92.0% 89.5%
Parallel 95.0% 90.5% 87.5% 89.5%
Irrelevance 92.9% 87.1% 78.3% 87.9%
Average 94.4% 90.9% 87.5% 89.2%

Speed (RTX PRO 6000 Blackwell, 96 GB VRAM)

Config Decode TTFT VRAM Context
FP8 1×GPU, FP8 KV 175 tok/s 83ms 25.7 GiB 256K
FP8 TP=2, FP8 KV 208 tok/s 254ms 13 GiB/GPU 256K
BF16 1×GPU 141 tok/s 52ms 48.5 GiB 32K

Single GPU serves the full 256K context at 175 tok/s with FP8 model weights + FP8 KV cache.

Usage with vLLM

# Recommended: on-the-fly FP8 (256K context, single GPU)
vllm serve google/gemma-4-26B-A4B-it \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 262144 --gpu-memory-utilization 0.92 \
  --enable-auto-tool-choice --tool-call-parser gemma4

Requires vLLM from main (>= PR #38826).

Produced By

protoLabsAI — AI inference lab, 2× RTX PRO 6000 Blackwell (192 GB VRAM).

Downloads last month
65
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for protoLabsAI/gemma-4-26B-A4B-it-FP8

Quantized
(21)
this model