Instructions to use Inferact/MiniMax-M3-EAGLE3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Inferact/MiniMax-M3-EAGLE3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Inferact/MiniMax-M3-EAGLE3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("Inferact/MiniMax-M3-EAGLE3")
model = LlamaForCausalLMEagle3.from_pretrained("Inferact/MiniMax-M3-EAGLE3")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Inferact/MiniMax-M3-EAGLE3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Inferact/MiniMax-M3-EAGLE3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Inferact/MiniMax-M3-EAGLE3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Inferact/MiniMax-M3-EAGLE3

SGLang

How to use Inferact/MiniMax-M3-EAGLE3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Inferact/MiniMax-M3-EAGLE3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Inferact/MiniMax-M3-EAGLE3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Inferact/MiniMax-M3-EAGLE3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Inferact/MiniMax-M3-EAGLE3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Inferact/MiniMax-M3-EAGLE3 with Docker Model Runner:
```
docker model run hf.co/Inferact/MiniMax-M3-EAGLE3
```

Model Overview

Inferact/MiniMax-M3-EAGLE3 is an EAGLE3 draft model for accelerating inference of MiniMax-M3. It is served end-to-end with vLLM and was trained using TorchSpec — a torch-native online speculative-decoding training framework that runs FSDP training and vLLM-based target inference concurrently, learning from MiniMax-M3-regenerated responses and live vLLM-generated hidden states to match the base model's exact token distribution.

The draft is a 1-layer dense Llama (LlamaForCausalLMEagle3, ~3.3 B params) operating on MiniMax-M3's hidden_size=6144 / vocab_size=200064; at serve time it shares the target's embedding and LM head (EAGLE3). See config.json for the full architecture.

Performance

All numbers are measured end-to-end against MiniMaxAI/MiniMax-M3-MXFP8 served with vLLM at tensor-parallel-size=4, num_speculative_tokens=3, and --enforce-eager. Greedy draft sampling (topk=1).

Category	Dataset	n	Mean Accept Length	Draft Accept Rate	Per-pos Accept Rate
Dialogue	MT-Bench	80	2.698	56.60%	0.749, 0.547, 0.402
Math	GSM8K	200	3.518	83.93%	0.923, 0.839, 0.756
Code	HumanEval	164	3.499	83.29%	0.922, 0.832, 0.744
Math	MATH500	500	3.517	83.90%	0.929, 0.841, 0.747
Math	AIME	30	3.291	76.36%	0.889, 0.763, 0.638
Synthetic	speed-bench (16k, low-entropy)	64	2.776	59.21%	0.747, 0.576, 0.453

Training

Data: ~456,881 training conversations (the mix2 dataset: SWE-bench-Pro, SWE-bench, OpenCodeInstruct, kimi-mtp), with all responses regenerated by MiniMax-M3 — preserving the target's reasoning traces and MiniMax-M3 chat formatting.

Method: EAGLE3 TTT, ttt_length=7, max_seq_length=32 768, AdamW at lr=1 × 10⁻⁴ (cosine decay to 0, 2 % warmup, max_grad_norm=1.0), bf16 + gradient checkpointing, FlexAttention, 1 epoch (~14,277 steps). Trained on 5 × GB300 nodes (2 nodes FSDP2 draft training, dp=8, global batch 32 + 3 nodes vLLM TP=4 target inference). EAGLE3 aux hidden states from target layers (2, 30, 57) + the final layer. Embedding / LM head / final norm are shared from the target (M3 is a VL model, so these live under the language_model.* prefix).

Core training command — torchspec.train_entry spawns the FSDP2 trainer and vLLM inference engines as decoupled Ray actors, streaming hidden states through Mooncake:

python3 -m torchspec.train_entry \
  --config configs/vllm_minimax_m3_mix2.yaml \
  model.draft_model_config=configs/draft_models/minimax_m3_eagle3.json \
  training.training_num_nodes=2 \
  training.training_num_gpus_per_node=4 \
  inference.inference_num_gpus=12 \
  inference.inference_num_gpus_per_engine=4 \
  inference.vllm.tp_size=4

Draft architecture, TTT depth, sequence length, cluster layout, and optimizer are all YAML-configurable — retargeting or scaling is a config change. See the TorchSpec repo for full customization instructions.

Quick Start

Requirements

vLLM nightly with MiniMax-M3 support
Docker image vllm/vllm-openai:minimax-m3

Launch Server (vLLM)

vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --block-size 128 \
  --speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'

Downloads last month: 1,098

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for Inferact/MiniMax-M3-EAGLE3

Base model

MiniMaxAI/Minimax-M3-preview

Finetuned

(1)

this model