DeepSeek-V3.2-Retro

This repository hosts the model weights for DeepSeek-V3.2-Retro. For instructions and details, please refer to the GitHub.

1. Introduction

DeepSeek-V3.2 introduces the DeepSeek Sparse Attention (DSA) architecture, representing a significant architectural evolution over DeepSeek-V3 and DeepSeek-V3.1. However, as of now, an official open-source implementation compatible with Ampere-series GPUs has not been released.

To address this gap, we introduce DeepSeek-V3.2-Retro, targeting the following user groups:

  • Ampere GPU users who do not have access to Hopper or Blackwell architectures.
  • Users of general-purpose GPU platforms where DSA is not yet supported.

Key features of DeepSeek-V3.2-Retro include:

  • Removal of the DSA modules from the original V3.2 architecture.
  • Conversion of model parameters and computation to the BF16 data format.
  • Broad Compatibility: runs on any hardware platform that supports the V3 architecture.
  • Validated Performance: achieves performance on multiple benchmarks that is close to the officially reported results.

2. Performance Evaluation

As our primary target scenario is reasoning-oriented usage, we report accuracy results on several representative benchmarks after enabling the thinking feature. All evaluation metrics are taken from the corresponding official technical reports for consistency.

Benchmark DeepSeek-V3.2-Retro DeepSeek-V3.2-Thinking
MMLU-Pro 86.4 85.0
GPQA Diamond 82.12 82.4
AIME 2025 93.67 93.1
LiveCodeBench 80.72 83.3

In addition, we evaluate inference efficiency. Using SGLang v0.5.6 under identical settings, we observe that the throughput of DeepSeek-V3.2-Retro is on par with DeepSeek-V3.1. Output throughput is reported in tokens/s.

Model Output Throughput (qps=512, input=1k, output=10k)
DeepSeek-V3.2-Retro 2510.27
DeepSeek-V3.1 2515.34

These results indicate that removing the DSA structure and reverting to a V3-compatible architecture does not introduce noticeable performance regression in either reasoning accuracy or inference throughput on Ampere-class hardware.

3. Model Download

DeepSeek-V3.2-Retro model is available for download from Hugging Face and ModelScope. Please ensure that you have at least 1.5 TB of available disk space before downloading the model.

Model Total Params Hugging Face ModelScope
DeepSeek-V3.2-Retro 684 B 🤗 Hugging Face 🤖 ModelScope

4. Quickstart

We strongly recommend using SGLang for efficient inference of the DeepSeek series models. We provide example configurations for SGLang serving on four A100*8 nodes.

SGLang

Using Docker (Recommended)

# Pull latest image on four nodes and ensure RDMA network connectivity between the 4 nodes.
# https://hub.docker.com/r/lmsysorg/sglang/tags
docker pull lmsysorg/sglang:latest

Launch Command

# For high QPS scenarios, add --enable-dp-attention and --ep-size arguments to boost throughput, and use mtp to boost decoding speed.
# node 1
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 30000 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

# node 2
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 1 --trust-remote-code --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

# node 3
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 2 --trust-remote-code --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

# node 4
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 3 --trust-remote-code --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

5. License

This repository and the model weights are licensed under the MIT License, following the license of DeepSeek-V3.2. In addition, if you use DeepSeek-V3.2, you shall also comply with the terms and conditions of DeepSeek-V3.2.

6. Contact

If you have any questions, please raise an issue or contact us at opensource@zhejianglab.org.

Downloads last month
12
Safetensors
Model size
684B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support