feihu.hf commited on
Commit
b9b7053
·
1 Parent(s): 706865b

update README

Browse files
Files changed (1) hide show
  1. README.md +12 -5
README.md CHANGED
@@ -204,6 +204,13 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
204
 
205
  Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
206
 
 
 
 
 
 
 
 
207
  #### Step 2: Launch Model Server
208
 
209
  After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
@@ -222,8 +229,8 @@ Then launch the server with Dual Chunk Flash Attention enabled:
222
 
223
  ```bash
224
  VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
225
- vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
226
- --tensor-parallel-size 8 \
227
  --max-model-len 1010000 \
228
  --enable-chunked-prefill \
229
  --max-num-batched-tokens 131072 \
@@ -258,11 +265,11 @@ Launch the server with DCA support:
258
 
259
  ```bash
260
  python3 -m sglang.launch_server \
261
- --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
262
  --context-length 1010000 \
263
  --mem-frac 0.75 \
264
  --attention-backend dual_chunk_flash_attn \
265
- --tp 8 \
266
  --chunked-prefill-size 131072
267
  ```
268
 
@@ -273,7 +280,7 @@ python3 -m sglang.launch_server \
273
  | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
274
  | `--context-length 1010000` | Defines max input length |
275
  | `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
276
- | `--tp 8` | Tensor parallelism size (matches model sharding) |
277
  | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
278
 
279
  #### Troubleshooting:
 
204
 
205
  Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
206
 
207
+ ```bash
208
+ export MODELNAME=Qwen3-30B-A3B-Instruct-2507
209
+ huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
210
+ mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
211
+ mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
212
+ ```
213
+
214
  #### Step 2: Launch Model Server
215
 
216
  After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
 
229
 
230
  ```bash
231
  VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
232
+ vllm serve ./Qwen3-30B-A3B-Instruct-2507 \
233
+ --tensor-parallel-size 4 \
234
  --max-model-len 1010000 \
235
  --enable-chunked-prefill \
236
  --max-num-batched-tokens 131072 \
 
265
 
266
  ```bash
267
  python3 -m sglang.launch_server \
268
+ --model-path ./Qwen3-30B-A3B-Instruct-2507 \
269
  --context-length 1010000 \
270
  --mem-frac 0.75 \
271
  --attention-backend dual_chunk_flash_attn \
272
+ --tp 4 \
273
  --chunked-prefill-size 131072
274
  ```
275
 
 
280
  | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
281
  | `--context-length 1010000` | Defines max input length |
282
  | `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
283
+ | `--tp 4` | Tensor parallelism size (matches model sharding) |
284
  | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
285
 
286
  #### Troubleshooting: