Update README.md
Browse files
README.md
CHANGED
|
@@ -322,11 +322,24 @@ vllm serve --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
|
|
| 322 |
--max-model-len 262144 \
|
| 323 |
--port 8000 \
|
| 324 |
--trust-remote-code \
|
|
|
|
| 325 |
--tool-call-parser qwen3_coder \
|
| 326 |
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
|
| 327 |
--reasoning-parser nano_v3
|
| 328 |
```
|
| 329 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 330 |
If you’d like to use reasoning off with vLLM, you can do the following:
|
| 331 |
vLLM OpenAI curl request:
|
| 332 |
|
|
|
|
| 322 |
--max-model-len 262144 \
|
| 323 |
--port 8000 \
|
| 324 |
--trust-remote-code \
|
| 325 |
+
--enable-auto-tool-choice \
|
| 326 |
--tool-call-parser qwen3_coder \
|
| 327 |
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
|
| 328 |
--reasoning-parser nano_v3
|
| 329 |
```
|
| 330 |
|
| 331 |
+
Here is an example client code for vLLM. By default, the endpoint has reasoning enabled. We recommend setting a high value (e.g., 10,000) for `max_tokens`.
|
| 332 |
+
|
| 333 |
+
```shell
|
| 334 |
+
curl http://localhost:8000/v1/chat/completions \
|
| 335 |
+
-H "Content-Type: application/json" \
|
| 336 |
+
-d '{
|
| 337 |
+
"model": "model",
|
| 338 |
+
"messages":[{"role": "user", "content": "Write a haiku about GPUs"}],
|
| 339 |
+
"max_tokens": 10000
|
| 340 |
+
}'
|
| 341 |
+
```
|
| 342 |
+
|
| 343 |
If you’d like to use reasoning off with vLLM, you can do the following:
|
| 344 |
vLLM OpenAI curl request:
|
| 345 |
|