DeepSeek-V3.2-Retro

This repository hosts the model weights for DeepSeek-V3.2-Retro. For instructions and details, please refer to the GitHub.

1. Introduction

DeepSeek-V3.2 introduces the DeepSeek Sparse Attention (DSA) architecture, representing a significant architectural evolution over DeepSeek-V3 and DeepSeek-V3.1. However, as of now, an official open-source implementation compatible with Ampere-series GPUs has not been released.

To address this gap, we introduce DeepSeek-V3.2-Retro, targeting the following user groups:

Ampere GPU users who do not have access to Hopper or Blackwell architectures.
Users of general-purpose GPU platforms where DSA is not yet supported.

Key features of DeepSeek-V3.2-Retro include:

Removal of the DSA modules from the original V3.2 architecture.
Conversion of model parameters and computation to the BF16 data format.
Broad Compatibility: runs on any hardware platform that supports the V3 architecture.
Validated Performance: achieves performance on multiple benchmarks that is close to the officially reported results.

2. Performance Evaluation

As our primary target scenario is reasoning-oriented usage, we report accuracy results on several representative benchmarks after enabling the thinking feature. All evaluation metrics are taken from the corresponding official technical reports for consistency.

Benchmark	DeepSeek-V3.2-Retro	DeepSeek-V3.2-Thinking
MMLU-Pro	86.4	85.0
GPQA Diamond	82.12	82.4
AIME 2025	93.67	93.1
LiveCodeBench	80.72	83.3

In addition, we evaluate inference efficiency. Using SGLang v0.5.6 under identical settings, we observe that the throughput of DeepSeek-V3.2-Retro is on par with DeepSeek-V3.1. Output throughput is reported in tokens/s.

Model	Output Throughput (qps=512, input=1k, output=10k)
DeepSeek-V3.2-Retro	2510.27
DeepSeek-V3.1	2515.34

These results indicate that removing the DSA structure and reverting to a V3-compatible architecture does not introduce noticeable performance regression in either reasoning accuracy or inference throughput on Ampere-class hardware.

3. Model Download

DeepSeek-V3.2-Retro model is available for download from Hugging Face and ModelScope. Please ensure that you have at least 1.5 TB of available disk space before downloading the model.

Model	Total Params	Hugging Face	ModelScope
DeepSeek-V3.2-Retro	684 B	🤗 Hugging Face	🤖 ModelScope

4. Quickstart

We strongly recommend using SGLang for efficient inference of the DeepSeek series models. We provide example configurations for SGLang serving on four A100*8 nodes.

SGLang

Using Docker (Recommended)

# Pull latest image on four nodes and ensure RDMA network connectivity between the 4 nodes.
# https://hub.docker.com/r/lmsysorg/sglang/tags
docker pull lmsysorg/sglang:latest

Launch Command

# For high QPS scenarios, add --enable-dp-attention and --ep-size arguments to boost throughput, and use mtp to boost decoding speed.
# node 1
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 30000 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

# node 2
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 1 --trust-remote-code --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

# node 3
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 2 --trust-remote-code --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

# node 4
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 3 --trust-remote-code --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

5. License

This repository and the model weights are licensed under the MIT License, following the license of DeepSeek-V3.2. In addition, if you use DeepSeek-V3.2, you shall also comply with the terms and conditions of DeepSeek-V3.2.

6. Contact

If you have any questions, please raise an issue or contact us at opensource@zhejianglab.org.

Downloads last month: 12

Safetensors

Model size

684B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support