Can MiniMax-M3 run on 2x NVIDIA DGX Spark 4TB?

#16
by abosk - opened

Hi,

I’m evaluating whether MiniMax-M3 can run locally on 2x NVIDIA DGX Spark units, each with 128 GB unified memory and 4 TB NVMe storage.

From what I understand, vLLM seems to be the only inference backend that could realistically work well on DGX Spark. Could you confirm whether MiniMax-M3 can run on this setup using vLLM, and whether this hardware would be enough for practical inference?

Thanks in advance.

4-bit quantized it should be able to. I am running thru this experiment right now... using llama.cpp with unsloth minimax-m3 model.

4-bit quantized it should be able to. I am running thru this experiment right now... using llama.cpp with unsloth minimax-m3 model.

Not possible.

2x NVIDIA DGX Spark with 128GB of Unified Memory = 256 GB of Unified Memory

MiniMax-M3 Q4_K_M = 257 GB

MiniMax-M3 Q4_K_S = 241 GB

There is not enough headroom.

One of the Q3 quants might be possible though, like Q3_K_M = 203 GB or Q3_K_S = 184 GB, Q3_K_L = 220 GB might be possible if you use low settings for context length.

we realy need better more affordable hardware seriously.

we realy need better more affordable hardware seriously.

Yes but how?

Right now people are buying the RTX 5090 for up to 5K for 32GB of VRAM, the RTX PRO 6000 now has an MSRP of 13500 (which does not have any NVLINK support btw) and even with 2x RTX PRO 6000 (total cost 27000 at MSRP) and even at that price it still isn't even enough for this model at Q4.

Sign up or log in to comment