Model Overview

Inferact/MiniMax-M3-EAGLE3 is an EAGLE3 draft model for accelerating inference of MiniMax-M3. It is served end-to-end with vLLM and was trained using TorchSpec โ€” a torch-native online speculative-decoding training framework that runs FSDP training and vLLM-based target inference concurrently, learning from MiniMax-M3-regenerated responses and live vLLM-generated hidden states to match the base model's exact token distribution.

The draft is a 1-layer dense Llama (LlamaForCausalLMEagle3, ~3.3 B params) operating on MiniMax-M3's hidden_size=6144 / vocab_size=200064; at serve time it shares the target's embedding and LM head (EAGLE3). See config.json for the full architecture.


Performance

All numbers are measured end-to-end against MiniMaxAI/MiniMax-M3-MXFP8 served with vLLM at tensor-parallel-size=4, num_speculative_tokens=3, and --enforce-eager. Greedy draft sampling (topk=1).

Category Dataset n Mean Accept Length Draft Accept Rate Per-pos Accept Rate
Dialogue MT-Bench 80 2.698 56.60% 0.749, 0.547, 0.402
Math GSM8K 200 3.518 83.93% 0.923, 0.839, 0.756
Code HumanEval 164 3.499 83.29% 0.922, 0.832, 0.744
Math MATH500 500 3.517 83.90% 0.929, 0.841, 0.747
Math AIME 30 3.291 76.36% 0.889, 0.763, 0.638
Synthetic speed-bench (16k, low-entropy) 64 2.776 59.21% 0.747, 0.576, 0.453

Training

Data: ~456,881 training conversations (the mix2 dataset: SWE-bench-Pro, SWE-bench, OpenCodeInstruct, kimi-mtp), with all responses regenerated by MiniMax-M3 โ€” preserving the target's reasoning traces and MiniMax-M3 chat formatting.

Method: EAGLE3 TTT, ttt_length=7, max_seq_length=32 768, AdamW at lr=1 ร— 10โปโด (cosine decay to 0, 2 % warmup, max_grad_norm=1.0), bf16 + gradient checkpointing, FlexAttention, 1 epoch (~14,277 steps). Trained on 5 ร— GB300 nodes (2 nodes FSDP2 draft training, dp=8, global batch 32 + 3 nodes vLLM TP=4 target inference). EAGLE3 aux hidden states from target layers (2, 30, 57) + the final layer. Embedding / LM head / final norm are shared from the target (M3 is a VL model, so these live under the language_model.* prefix).

Core training command โ€” torchspec.train_entry spawns the FSDP2 trainer and vLLM inference engines as decoupled Ray actors, streaming hidden states through Mooncake:

python3 -m torchspec.train_entry \
  --config configs/vllm_minimax_m3_mix2.yaml \
  model.draft_model_config=configs/draft_models/minimax_m3_eagle3.json \
  training.training_num_nodes=2 \
  training.training_num_gpus_per_node=4 \
  inference.inference_num_gpus=12 \
  inference.inference_num_gpus_per_engine=4 \
  inference.vllm.tp_size=4

Draft architecture, TTT depth, sequence length, cluster layout, and optimizer are all YAML-configurable โ€” retargeting or scaling is a config change. See the TorchSpec repo for full customization instructions.


Quick Start

Requirements

  • vLLM nightly with MiniMax-M3 support
  • Docker image vllm/vllm-openai:minimax-m3

Launch Server (vLLM)

vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --block-size 128 \
  --speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'
Downloads last month
1,098
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Inferact/MiniMax-M3-EAGLE3

Finetuned
(1)
this model