nvidia
/

NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Text Generation

Model card Files Files and versions

suhara commited on 9 days ago

Commit

b95f63f

·

verified ·

1 Parent(s): eab7377

Update README.md

Files changed (1) hide show

README.md +13 -0

README.md CHANGED Viewed

@@ -322,11 +322,24 @@ vllm serve --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
   --max-model-len 262144 \
   --port 8000 \
   --trust-remote-code \
   --tool-call-parser qwen3_coder \
   --reasoning-parser-plugin nano_v3_reasoning_parser.py \
   --reasoning-parser nano_v3
 ```
 If you’d like to use reasoning off with vLLM, you can do the following:
 vLLM OpenAI curl request:

   --max-model-len 262144 \
   --port 8000 \
   --trust-remote-code \
+  --enable-auto-tool-choice \
   --tool-call-parser qwen3_coder \
   --reasoning-parser-plugin nano_v3_reasoning_parser.py \
   --reasoning-parser nano_v3
 ```
+Here is an example client code for vLLM. By default, the endpoint has reasoning enabled. We recommend setting a high value (e.g., 10,000) for `max_tokens`.
+```shell
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "model",
+        "messages":[{"role": "user", "content": "Write a haiku about GPUs"}],
+        "max_tokens": 10000
+    }'
+```
 If you’d like to use reasoning off with vLLM, you can do the following:
 vLLM OpenAI curl request: