MiniMax-M2.5-NVFP4
NVFP4-quantized version of MiniMaxAI/MiniMax-M2.5 for deployment on NVIDIA Blackwell GPUs.
Model Details
| Base model | MiniMaxAI/MiniMax-M2.5 |
| Architecture | MiniMaxM2ForCausalLM (Mixture-of-Experts) |
| Parameters | 456B total |
| Layers | 62 (all MoE) |
| Experts | 256 per layer, 8 active per token |
| Hidden size | 3072 |
| Intermediate size | 1536 per expert |
| Attention | 48 heads, 8 KV heads (GQA) |
| Context length | 196,608 tokens |
| Vocabulary | 200,064 tokens |
Quantization
| Method | NVFP4 (4-bit floating point) |
| Tool | NVIDIA ModelOpt 0.41.0 |
| Group size | 16 |
| Calibration | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 |
| Quantized layers | MoE expert weights only (gate_up_proj, down_proj) |
| BF16 layers | Attention (Q/K/V/O projections), embeddings, router gates, score correction biases, layer norms, lm_head |
| Source precision | FP8 (dequantized to BF16 for calibration) |
Compression
| Format | Size |
|---|---|
| BF16 (theoretical) | ~456 GB |
| FP8 (source) | 287 GB |
| NVFP4 (this model) | 126 GB |
3.6x compression vs BF16 equivalent.
Running with vLLM
vLLM >= 0.15.1 supports this model natively with the modelopt quantization backend. Blackwell GPUs (SM100/SM120) are required for NVFP4 inference.
Requirements
- VRAM: ~126 GB total model weight. Two GPUs with ≥64 GB VRAM each can run via tensor parallelism; heterogeneous setups can use pipeline parallelism with CPU offloading.
- System RAM: If using
cpu_offload_gb, you need sufficient system RAM for pinned memory.
Installation
pip install "vllm>=0.15.1"
Environment Variables
export VLLM_USE_FLASHINFER_MOE_FP4=0 # Use VLLM_CUTLASS MoE backend (avoids OOM from flashinfer's weight reordering)
export CUDA_DEVICE_ORDER=PCI_BUS_ID # Consistent GPU ordering
Two-GPU Tensor Parallelism (2x ≥64 GB VRAM)
from vllm import LLM, SamplingParams
llm = LLM(
model="mconcat/MiniMax-M2.5-NVFP4",
quantization="modelopt",
trust_remote_code=True,
tensor_parallel_size=2,
max_model_len=4096,
max_num_seqs=64,
enforce_eager=True,
gpu_memory_utilization=0.95,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
Multi-GPU Pipeline Parallelism (Heterogeneous GPUs)
For setups with unequal VRAM (e.g., one large GPU + smaller GPUs), use pipeline parallelism:
import os
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
os.environ["VLLM_PP_LAYER_PARTITION"] = "40,11,11" # Adjust per your GPU VRAM ratios
from vllm import LLM, SamplingParams
llm = LLM(
model="mconcat/MiniMax-M2.5-NVFP4",
quantization="modelopt",
trust_remote_code=True,
pipeline_parallel_size=3,
cpu_offload_gb=10,
max_model_len=4096,
max_num_seqs=64,
enforce_eager=True,
gpu_memory_utilization=0.95,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
Tuning tips:
VLLM_PP_LAYER_PARTITIONcontrols how many of the 62 layers each GPU gets. Assign more layers to GPUs with more VRAM.- Each MoE layer is ~2 GB (NVFP4). Distribute so that
(layer_weights - cpu_offload_gb)fits on each GPU. cpu_offload_gbis per GPU. Ensure total pinned memory fits in system RAM.max_num_seqsmay need lowering for GPUs with ≤32 GB VRAM.
OpenAI-Compatible API Server
VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
--model mconcat/MiniMax-M2.5-NVFP4 \
--quantization modelopt \
--trust-remote-code \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--max-num-seqs 64 \
--enforce-eager \
--gpu-memory-utilization 0.95 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--port 8000
For pipeline parallelism, replace --tensor-parallel-size with --pipeline-parallel-size N --cpu-offload-gb X and set VLLM_PP_LAYER_PARTITION.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "mconcat/MiniMax-M2.5-NVFP4", "prompt": "Hello", "max_tokens": 64}'
Important Notes
- Blackwell required: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
- trust-remote-code: Required because MiniMax-M2.5 uses custom configuration code (
auto_mapin config.json). vLLM itself has nativeMiniMaxM2ForCausalLMsupport. - vLLM quantization flag: Use
--quantization modelopt. vLLM auto-detects the NVFP4 algorithm and resolves tomodelopt_fp4internally. - MoE backend: Set
VLLM_USE_FLASHINFER_MOE_FP4=0to use the VLLM_CUTLASS MoE backend. The default flashinfer backend can cause OOM from temporary allocations during weight reordering. - Tool calling: vLLM has a built-in
minimax_m2tool call parser. Use--enable-auto-tool-choice --tool-call-parser minimax_m2for OpenAI-compatible function calling. - Reasoning: Use
--reasoning-parser minimax_m2_append_thinkto extract<think>reasoning tokens.
Quantization Recipe
Following NVIDIA's MLP-only quantization strategy (similar to the DeepSeek-R1 NVFP4 recipe):
- Only MoE expert weights (
gate_up_proj,down_proj) are quantized to FP4 - All attention projections remain in BF16 to preserve quality
- Router gates (
mlp.gate) and score correction biases remain in BF16 - Embeddings and lm_head remain in BF16
Calibration Data
| Domain | Samples | Dataset |
|---|---|---|
| Korean | 128 | heegyu/open-korean-instructions |
| Code | 128 | m-a-p/CodeFeedback-Filtered-Instruction |
| Creative Writing | 128 | Gryphe/ChatGPT-4o-Writing-Prompts |
| General English | 128 | teknium/OpenHermes-2.5 |
Files
| File | Description |
|---|---|
model-00001-of-00032.safetensors ... model-00032-of-00032.safetensors |
Quantized model weights (32 shards, ~4 GB each) |
model.safetensors.index.json |
Weight shard index |
config.json |
Model configuration with quantization_config |
hf_quant_config.json |
ModelOpt quantization metadata |
configuration_minimax_m2.py |
Custom model configuration class |
modeling_minimax_m2.py |
Custom model implementation |
tokenizer.json |
Tokenizer |
tokenizer_config.json |
Tokenizer configuration |
chat_template.jinja |
Chat template |
Hardware
Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Quantization (calibration on A100) does not require Blackwell hardware; only inference with native FP4 execution does.
Limitations
- Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
- Quality may differ from the original FP8/BF16 model, particularly on tasks sensitive to numerical precision
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
- This quantization targets the MLP/expert layers only; KV cache is not quantized
License
Same license as the base model: Modified MIT.
- Downloads last month
- 7
Model tree for mconcat/MiniMax-M2.5-NVFP4
Base model
MiniMaxAI/MiniMax-M2.5