MiniMax-M2.5-NVFP4

NVFP4-quantized version of MiniMaxAI/MiniMax-M2.5 for deployment on NVIDIA Blackwell GPUs.

Model Details


Base model	MiniMaxAI/MiniMax-M2.5
Architecture	MiniMaxM2ForCausalLM (Mixture-of-Experts)
Parameters	456B total
Layers	62 (all MoE)
Experts	256 per layer, 8 active per token
Hidden size	3072
Intermediate size	1536 per expert
Attention	48 heads, 8 KV heads (GQA)
Context length	196,608 tokens
Vocabulary	200,064 tokens

Quantization


Method	NVFP4 (4-bit floating point)
Tool	NVIDIA ModelOpt 0.41.0
Group size	16
Calibration	512 samples (Korean, Code, Creative Writing, English), max_seq_length=512
Quantized layers	MoE expert weights only (`gate_up_proj`, `down_proj`)
BF16 layers	Attention (Q/K/V/O projections), embeddings, router gates, score correction biases, layer norms, lm_head
Source precision	FP8 (dequantized to BF16 for calibration)

Compression

Format	Size
BF16 (theoretical)	~456 GB
FP8 (source)	287 GB
NVFP4 (this model)	126 GB

3.6x compression vs BF16 equivalent.

Running with vLLM

vLLM >= 0.15.1 supports this model natively with the modelopt quantization backend. Blackwell GPUs (SM100/SM120) are required for NVFP4 inference.

Requirements

VRAM: ~126 GB total model weight. Two GPUs with ≥64 GB VRAM each can run via tensor parallelism; heterogeneous setups can use pipeline parallelism with CPU offloading.
System RAM: If using cpu_offload_gb, you need sufficient system RAM for pinned memory.

Installation

pip install "vllm>=0.15.1"

Environment Variables

export VLLM_USE_FLASHINFER_MOE_FP4=0   # Use VLLM_CUTLASS MoE backend (avoids OOM from flashinfer's weight reordering)
export CUDA_DEVICE_ORDER=PCI_BUS_ID     # Consistent GPU ordering

Two-GPU Tensor Parallelism (2x ≥64 GB VRAM)

from vllm import LLM, SamplingParams

llm = LLM(
    model="mconcat/MiniMax-M2.5-NVFP4",
    quantization="modelopt",
    trust_remote_code=True,
    tensor_parallel_size=2,
    max_model_len=4096,
    max_num_seqs=64,
    enforce_eager=True,
    gpu_memory_utilization=0.95,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)

Multi-GPU Pipeline Parallelism (Heterogeneous GPUs)

For setups with unequal VRAM (e.g., one large GPU + smaller GPUs), use pipeline parallelism:

import os
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
os.environ["VLLM_PP_LAYER_PARTITION"] = "40,11,11"  # Adjust per your GPU VRAM ratios

from vllm import LLM, SamplingParams

llm = LLM(
    model="mconcat/MiniMax-M2.5-NVFP4",
    quantization="modelopt",
    trust_remote_code=True,
    pipeline_parallel_size=3,
    cpu_offload_gb=10,
    max_model_len=4096,
    max_num_seqs=64,
    enforce_eager=True,
    gpu_memory_utilization=0.95,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)

Tuning tips:

VLLM_PP_LAYER_PARTITION controls how many of the 62 layers each GPU gets. Assign more layers to GPUs with more VRAM.
Each MoE layer is ~2 GB (NVFP4). Distribute so that (layer_weights - cpu_offload_gb) fits on each GPU.
cpu_offload_gb is per GPU. Ensure total pinned memory fits in system RAM.
max_num_seqs may need lowering for GPUs with ≤32 GB VRAM.

OpenAI-Compatible API Server

VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
    --model mconcat/MiniMax-M2.5-NVFP4 \
    --quantization modelopt \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --max-model-len 4096 \
    --max-num-seqs 64 \
    --enforce-eager \
    --gpu-memory-utilization 0.95 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --port 8000

For pipeline parallelism, replace --tensor-parallel-size with --pipeline-parallel-size N --cpu-offload-gb X and set VLLM_PP_LAYER_PARTITION.

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mconcat/MiniMax-M2.5-NVFP4", "prompt": "Hello", "max_tokens": 64}'

Important Notes

Blackwell required: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
trust-remote-code: Required because MiniMax-M2.5 uses custom configuration code (auto_map in config.json). vLLM itself has native MiniMaxM2ForCausalLM support.
vLLM quantization flag: Use --quantization modelopt. vLLM auto-detects the NVFP4 algorithm and resolves to modelopt_fp4 internally.
MoE backend: Set VLLM_USE_FLASHINFER_MOE_FP4=0 to use the VLLM_CUTLASS MoE backend. The default flashinfer backend can cause OOM from temporary allocations during weight reordering.
Tool calling: vLLM has a built-in minimax_m2 tool call parser. Use --enable-auto-tool-choice --tool-call-parser minimax_m2 for OpenAI-compatible function calling.
Reasoning: Use --reasoning-parser minimax_m2_append_think to extract <think> reasoning tokens.

Quantization Recipe

Following NVIDIA's MLP-only quantization strategy (similar to the DeepSeek-R1 NVFP4 recipe):

Only MoE expert weights (gate_up_proj, down_proj) are quantized to FP4
All attention projections remain in BF16 to preserve quality
Router gates (mlp.gate) and score correction biases remain in BF16
Embeddings and lm_head remain in BF16

Calibration Data

Domain	Samples	Dataset
Korean	128	heegyu/open-korean-instructions
Code	128	m-a-p/CodeFeedback-Filtered-Instruction
Creative Writing	128	Gryphe/ChatGPT-4o-Writing-Prompts
General English	128	teknium/OpenHermes-2.5

Files

File	Description
`model-00001-of-00032.safetensors` ... `model-00032-of-00032.safetensors`	Quantized model weights (32 shards, ~4 GB each)
`model.safetensors.index.json`	Weight shard index
`config.json`	Model configuration with `quantization_config`
`hf_quant_config.json`	ModelOpt quantization metadata
`configuration_minimax_m2.py`	Custom model configuration class
`modeling_minimax_m2.py`	Custom model implementation
`tokenizer.json`	Tokenizer
`tokenizer_config.json`	Tokenizer configuration
`chat_template.jinja`	Chat template

Hardware

Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Quantization (calibration on A100) does not require Blackwell hardware; only inference with native FP4 execution does.

Limitations

Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
Quality may differ from the original FP8/BF16 model, particularly on tasks sensitive to numerical precision
Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
This quantization targets the MLP/expert layers only; KV cache is not quantized

License

Same license as the base model: Modified MIT.

Downloads last month: 7

Safetensors

Model size

130B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mconcat/MiniMax-M2.5-NVFP4

Base model

MiniMaxAI/MiniMax-M2.5

Quantized

(55)

this model