gemma-4-26B-A4B-it-FP8
FP8 quantized version of google/gemma-4-26B-A4B-it (MoE, 26B total / 4B active), produced by protoLabsAI.
BFCL v4 Function Calling (AST-graded, 900 tests)
| Category | Gemma 4 MoE (26B) | Qwen 35B MoE | Gemma 4 E4B (8B) | Qwen 27B INT4 |
|---|---|---|---|---|
| Simple | 95.7% | 92.2% | 92.3% | 89.8% |
| Multiple | 94.0% | 94.0% | 92.0% | 89.5% |
| Parallel | 95.0% | 90.5% | 87.5% | 89.5% |
| Irrelevance | 92.9% | 87.1% | 78.3% | 87.9% |
| Average | 94.4% | 90.9% | 87.5% | 89.2% |
Speed (RTX PRO 6000 Blackwell, 96 GB VRAM)
| Config | Decode | TTFT | VRAM | Context |
|---|---|---|---|---|
| FP8 1×GPU, FP8 KV | 175 tok/s | 83ms | 25.7 GiB | 256K |
| FP8 TP=2, FP8 KV | 208 tok/s | 254ms | 13 GiB/GPU | 256K |
| BF16 1×GPU | 141 tok/s | 52ms | 48.5 GiB | 32K |
Single GPU serves the full 256K context at 175 tok/s with FP8 model weights + FP8 KV cache.
Usage with vLLM
# Recommended: on-the-fly FP8 (256K context, single GPU)
vllm serve google/gemma-4-26B-A4B-it \
--quantization fp8 --kv-cache-dtype fp8 \
--max-model-len 262144 --gpu-memory-utilization 0.92 \
--enable-auto-tool-choice --tool-call-parser gemma4
Requires vLLM from main (>= PR #38826).
Produced By
protoLabsAI — AI inference lab, 2× RTX PRO 6000 Blackwell (192 GB VRAM).
- Downloads last month
- 65
Model tree for protoLabsAI/gemma-4-26B-A4B-it-FP8
Base model
google/gemma-4-26B-A4B-it