DeepSeek-V3.2 MXFP4 GGUF

This is an MXFP4 quantized version of DeepSeek-V3.2, optimized for Apple Silicon and other hardware with MXFP4 support.

Model Details

  • Base Model: DeepSeek-V3.2-Exp-BF16
  • Quantization: MXFP4_MOE (4-bit mixed precision for MoE layers, Q8_0 for standard layers)
  • Original Size: 1.2 TB (BF16)
  • Quantized Size: 341 GB (72% reduction)
  • Quantization Tool: llama.cpp (December 2024 build)
  • Quantization Time: ~10 minutes on Mac Studio M3 Ultra

Performance

  • Hardware: Optimized for Apple Silicon (Metal), NVIDIA GPUs (CUDA), and other accelerators
  • Expected Speedup: 2-3x faster inference vs Q4_K_M with hardware acceleration
  • Context Length: 128K tokens

File Structure

This model is split into 18 chunks due to HuggingFace file size limits:

DeepSeek-V3.2-MXFP4-chunk-001-of-018.gguf
DeepSeek-V3.2-MXFP4-chunk-002-of-018.gguf
...
DeepSeek-V3.2-MXFP4-chunk-018-of-018.gguf

Reassembly Instructions

macOS/Linux

cat DeepSeek-V3.2-MXFP4-chunk-*.gguf > DeepSeek-V3.2-MXFP4.gguf

Windows (PowerShell)

Get-Content DeepSeek-V3.2-MXFP4-chunk-*.gguf -Raw -Encoding Byte | Set-Content DeepSeek-V3.2-MXFP4.gguf -Encoding Byte

Python

import glob

output_file = "DeepSeek-V3.2-MXFP4.gguf"
chunk_files = sorted(glob.glob("DeepSeek-V3.2-MXFP4-chunk-*.gguf"))

with open(output_file, 'wb') as outfile:
    for chunk_file in chunk_files:
        print(f"Merging {chunk_file}...")
        with open(chunk_file, 'rb') as infile:
            outfile.write(infile.read())

print(f"Reassembly complete: {output_file}")

Usage with Ollama

After reassembling the model:

# Create Modelfile
cat > Modelfile.deepseek << 'EOF'
FROM ./DeepSeek-V3.2-MXFP4.gguf

TEMPLATE """{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{- range $i, $_ := .Messages }}<|im_start|>{{ .Role }}
{{ .Content }}<|im_end|>
{{ end }}<|im_start|>assistant
"""

PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER num_ctx 128000
PARAMETER temperature 0.7
EOF

# Build model
ollama create deepseek-v3.2:mxfp4 -f Modelfile.deepseek

# Test
ollama run deepseek-v3.2:mxfp4 "Explain quantum computing"

Quantization Details

  • MoE Expert Layers: MXFP4_MOE (4-bit mixed precision)
  • Standard Layers: Q8_0 (8-bit)
  • Final Model Size: 349,286 MB (341 GB)
  • Quantization Script: Built with llama.cpp Metal support on Mac Studio

System Requirements

  • Disk Space: 341 GB
  • RAM: Minimum 64 GB (128+ GB recommended for full context)
  • Recommended Hardware:
    • Apple Silicon (M1/M2/M3 series with Metal)
    • NVIDIA GPU with CUDA support
    • Other accelerators with MXFP4 support

Credits

License

Apache 2.0 (inherits from original DeepSeek-V3.2 license)

Downloads last month
-
GGUF
Model size
672B params
Architecture
deepseek3_2
Hardware compatibility
Log In to view the estimation

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support