openPangu-Ultra-MoE-718B在vllm-ascend部署指导文档

部署环境说明

Atlas 800T A2(64GB) 64卡可以部署openPangu-Ultra-MoE-718B(bf16)，32卡可部署盘古 Ultra MoE (int8)，选用vllm-ascend社区镜像v0.9.1-dev，多个节点都需拉取镜像。

docker pull quay.io/ascend/vllm-ascend:v0.9.1-dev

网络环境检测在每个节点上依次执行以下命令。所有结果必须为 success 且状态必须为 UP：

# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done 
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf

镜像启动和推理代码适配

以下操作需在每个节点都执行。

启动镜像。

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1-dev  # Use correct image id
export NAME=vllm-ascend  # Custom docker name

# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# To prevent device interference from other docker containers, add the argument "--privileged"
docker run --rm \
--name $NAME \
--network host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash

如果未进入容器，需以root用户进入容器。

docker exec -itu root $NAME /bin/bash

下载vllm(v0.9.2)，替换镜像内置的vllm代码。

pip install --no-deps vllm==0.9.2 pybase64==1.4.1

下载vllm-ascend (v0.9.2rc1)，替换镜像内置的vllm-ascend代码（/vllm-workspace/vllm-ascend/）。例如下载Assets中的Source code (tar.gz)得到v0.9.2rc1.tar.gz，然后解压并替换：

tar -zxvf vllm-ascend-0.9.2rc1.tar.gz -C /vllm-workspace/vllm-ascend/ --strip-components=1
export PYTHONPATH=/vllm-workspace/vllm-ascend/:${PYTHONPATH}

使用当前代码仓中适配盘古模型的vllm-ascend代码替换/vllm-workspace/vllm-ascend/vllm_ascend/中的部分代码。

yes | cp -r inference/vllm_ascend/* /vllm-workspace/vllm-ascend/vllm_ascend/

BF16推理

以下操作需在每个节点都执行。

运行命令：

# This obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
local_ip=`hostname -I | cut -d' ' -f1`
nic_name=$(ifconfig | grep -B 1 "$local_ip" | head -n 1 | awk '{print $1}' | sed 's/://')
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
export VLLM_ASCEND_ENABLE_TOP_N_SIGMA=1  # enable top-n-sigma sampling
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

MASTER_NODE_IP=xxx.xxx.xxx.xxx  # master/head node ip
NODE_RANK=xxx  # current node rank (0~7)
NUM_NODES=8  # number of nodes
NUM_NPUS_LOCAL=8  # number of NPUs per node
DATA_PARALLEL_SIZE_LOCAL=4  # DP size per node, can be set to 1, 2, or 4
LOCAL_CKPT_DIR=/root/.cache/pangu_ultra_moe  # The pangu_ultra_moe bf16 weight
# Specifying HOST=127.0.0.1 (localhost) means the server can only be accessed from the master device.
# Specifying HOST=0.0.0.0 allows the vLLM server to be accessed from other devices on the same network or even from the internet, provided proper network configuration (e.g., firewall rules, port forwarding) is in place.
HOST=xxx.xxx.xxx.xxx

if [[ $NODE_RANK -ne 0 ]]; then
    headless="--headless"
else
    headless=""
fi

vllm serve $LOCAL_CKPT_DIR \
--host $HOST \
--port 8004 \
--data-parallel-size $((NUM_NODES*DATA_PARALLEL_SIZE_LOCAL)) \
--data-parallel-size-local $DATA_PARALLEL_SIZE_LOCAL \
--data-parallel-start-rank $((DATA_PARALLEL_SIZE_LOCAL*NODE_RANK)) \
--data-parallel-address $MASTER_NODE_IP \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size $((NUM_NPUS_LOCAL/DATA_PARALLEL_SIZE_LOCAL)) \
--seed 1024 \
--served-model-name pangu_ultra_moe \
--enable-expert-parallel \
--max-num-seqs 8 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
${headless} \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'

发请求测试

服务启动后，在主节点或者其他节点向主节点发送测试请求：

curl http://${MASTER_NODE_IP}:8004/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "pangu_ultra_moe",
        "messages": [
            {
                "role": "user",
                "content": "Who are you?"
            }
        ],
        "max_tokens": 512,
        "temperature": 0.7,
        "top_p": 1.0,
        "top_k": -1,
        "vllm_xargs": {"top_n_sigma": 0.05}
    }'

Int8推理

ModelSlim量化

openPangu-Ultra-MoE-718B模型支持使用开源量化框架ModelSlim进行量化，当前模型支持W8A8权重激活量化。

openPangu-Ultra-MoE-718B W8A8 动态量化

python3 quant_pangu_ultra_moe_w8a8.py --model_path {浮点权重路径} --save_path {W8A8量化权重路径} --dynamic

openPangu-Ultra-MoE-718B W8A8 混合量化 + MTP 量化

生成openPangu-Ultra-MoE-718B模型W8A8量化权重（含MTP）

python3 quant_pangu_ultra_moe_w8a8.py --model_path {浮点权重路径} --save_path {W8A8量化权重路径} --dynamic --quant_mtp mix

相较于BF16模型，int8量化模型的config.json增加以下字段：

"mla_quantize": "w8a8",
"quantize": "w8a8_dynamic",

如果MTP量化，增加字段：

"mtp_quantize": "w8a8_dynamic",

ModelSlim量化脚本生成量化模型后会自动追加上述字段到config.json中。

Int8推理

相较于BF16模型推理，int8量化模型推理仅需使用4节点（32卡），修改变量

NUM_NODES=4

启动命令需要修改为对应的量化权重路径，另外增加--quantization ascend：

LOCAL_CKPT_DIR=/root/.cache/pang_ultra_moe_w8a8

vllm serve $LOCAL_CKPT_DIR \
...
--quantization ascend
...