suhara commited on
Commit
b95f63f
·
verified ·
1 Parent(s): eab7377

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -0
README.md CHANGED
@@ -322,11 +322,24 @@ vllm serve --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
322
  --max-model-len 262144 \
323
  --port 8000 \
324
  --trust-remote-code \
 
325
  --tool-call-parser qwen3_coder \
326
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
327
  --reasoning-parser nano_v3
328
  ```
329
 
 
 
 
 
 
 
 
 
 
 
 
 
330
  If you’d like to use reasoning off with vLLM, you can do the following:
331
  vLLM OpenAI curl request:
332
 
 
322
  --max-model-len 262144 \
323
  --port 8000 \
324
  --trust-remote-code \
325
+ --enable-auto-tool-choice \
326
  --tool-call-parser qwen3_coder \
327
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
328
  --reasoning-parser nano_v3
329
  ```
330
 
331
+ Here is an example client code for vLLM. By default, the endpoint has reasoning enabled. We recommend setting a high value (e.g., 10,000) for `max_tokens`.
332
+
333
+ ```shell
334
+ curl http://localhost:8000/v1/chat/completions \
335
+ -H "Content-Type: application/json" \
336
+ -d '{
337
+ "model": "model",
338
+ "messages":[{"role": "user", "content": "Write a haiku about GPUs"}],
339
+ "max_tokens": 10000
340
+ }'
341
+ ```
342
+
343
  If you’d like to use reasoning off with vLLM, you can do the following:
344
  vLLM OpenAI curl request:
345