feihu.hf
commited on
Commit
·
b9b7053
1
Parent(s):
706865b
update README
Browse files
README.md
CHANGED
|
@@ -204,6 +204,13 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
|
|
| 204 |
|
| 205 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
| 206 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
#### Step 2: Launch Model Server
|
| 208 |
|
| 209 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
|
@@ -222,8 +229,8 @@ Then launch the server with Dual Chunk Flash Attention enabled:
|
|
| 222 |
|
| 223 |
```bash
|
| 224 |
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
| 225 |
-
vllm serve
|
| 226 |
-
--tensor-parallel-size
|
| 227 |
--max-model-len 1010000 \
|
| 228 |
--enable-chunked-prefill \
|
| 229 |
--max-num-batched-tokens 131072 \
|
|
@@ -258,11 +265,11 @@ Launch the server with DCA support:
|
|
| 258 |
|
| 259 |
```bash
|
| 260 |
python3 -m sglang.launch_server \
|
| 261 |
-
--model-path
|
| 262 |
--context-length 1010000 \
|
| 263 |
--mem-frac 0.75 \
|
| 264 |
--attention-backend dual_chunk_flash_attn \
|
| 265 |
-
--tp
|
| 266 |
--chunked-prefill-size 131072
|
| 267 |
```
|
| 268 |
|
|
@@ -273,7 +280,7 @@ python3 -m sglang.launch_server \
|
|
| 273 |
| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
|
| 274 |
| `--context-length 1010000` | Defines max input length |
|
| 275 |
| `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
|
| 276 |
-
| `--tp
|
| 277 |
| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
|
| 278 |
|
| 279 |
#### Troubleshooting:
|
|
|
|
| 204 |
|
| 205 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
| 206 |
|
| 207 |
+
```bash
|
| 208 |
+
export MODELNAME=Qwen3-30B-A3B-Instruct-2507
|
| 209 |
+
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
|
| 210 |
+
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
|
| 211 |
+
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
|
| 212 |
+
```
|
| 213 |
+
|
| 214 |
#### Step 2: Launch Model Server
|
| 215 |
|
| 216 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
|
|
|
| 229 |
|
| 230 |
```bash
|
| 231 |
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
| 232 |
+
vllm serve ./Qwen3-30B-A3B-Instruct-2507 \
|
| 233 |
+
--tensor-parallel-size 4 \
|
| 234 |
--max-model-len 1010000 \
|
| 235 |
--enable-chunked-prefill \
|
| 236 |
--max-num-batched-tokens 131072 \
|
|
|
|
| 265 |
|
| 266 |
```bash
|
| 267 |
python3 -m sglang.launch_server \
|
| 268 |
+
--model-path ./Qwen3-30B-A3B-Instruct-2507 \
|
| 269 |
--context-length 1010000 \
|
| 270 |
--mem-frac 0.75 \
|
| 271 |
--attention-backend dual_chunk_flash_attn \
|
| 272 |
+
--tp 4 \
|
| 273 |
--chunked-prefill-size 131072
|
| 274 |
```
|
| 275 |
|
|
|
|
| 280 |
| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
|
| 281 |
| `--context-length 1010000` | Defines max input length |
|
| 282 |
| `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
|
| 283 |
+
| `--tp 4` | Tensor parallelism size (matches model sharding) |
|
| 284 |
| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
|
| 285 |
|
| 286 |
#### Troubleshooting:
|