<think>...</think> in the response
I'm using llama-server (from llama.cpp) to run the model. I know the model supports reasoning, so it includes ... tags in the response.
However, unlike some other reasoning models (e.g., https://huggingface.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF), the thinking content is not placed into the response's reasoning_content field. This can break applications that expect to parse the response in that format.
Is there any way to update the response format to support reasoning_content for the thinking content?
Alternatively, is there a way to disable reasoning when using the model with llama-server?
Fixes: NVIDIA Nemotron 3 parsing #18077
A way to disable reasoning:
CLI : --reasoning-budget 0 or --chat-template-kwargs '{ "enable_thinking": false }'
In API request: { "chat_template_kwargs": { "enable_thinking": false } }
@duc0812112
look at https://github.com/ggml-org/llama.cpp/tree/master/tools/server for the --reasoning-format option.
You'll find exactly what you are looking for ;)
@owao
I tried all options, the thoughts always remain in the response content without any visible tags
Ah! It seems weird. What's your llama-server serve command?
We will also see how it turns for
@duc0812112
The model card of this model says:
Note
<think>and</think>are separate tokens, so use --special if needed.
I tried it with --special, again with all variants of --resoning-format, no change
"Nemotron-3-Nano-30B-A3B_131k_think_tool":
cmd: |
/home/user/llama.cpp/build/bin/llama-server
--model /mnt/storage/GGUFs/Nemotron-3-Nano-30B-A3B/Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL.gguf
--no-warmup
--ctx-size 131000
--no-context-shift
--n-gpu-layers 100
--temp 0.6
--top-p 0.95
--repeat-penalty 1.05
--jinja
--host 0.0.0.0
--port ${PORT}
--flash-attn on
--chat-template-kwargs '{"enable_thinking":true}'
curl -X POST http://localhost:8678/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Nemotron-3-Nano-30B-A3B_131k_think_notool","messages":[{"role":"user","content":"How many Rs are in strawberry?"}],"temperature":0,"max_tokens":1024,"stream":false}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"The user asks: \"How many Rs are in strawberry?\" Likely they want count of letter 'R' in the word \"strawberry\". The word \"strawberry\" letters: s t r a w b e r r y. Count of 'r' (case-insensitive) appears? Let's count: positions: s(1), t(2), r(3) -> one r, then later e r r y: there are two r's at the end? Actually \"strawberry\" spelled s t r a w b e r r y. So letters: s, t, r, a, w, b, e, r, r, y. That's three r's? Let's count: after 'e' we have r, then another r, then y. So total r's = 3? Wait check: The word \"strawberry\" has letters: s t r a w b e r r y. That's indeed three r's: one after t, and two consecutive at the end. So answer: 3 Rs.\n\nThus respond with answer.","content":"The word **“strawberry”** contains **3** instances of the letter **R**."}}],"created":1765908537,"model":"Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL.gguf","system_fingerprint":"b7433-7b1db3d3b","object":"chat.completion","usage":{"completion_tokens":252,"prompt_tokens":25,"total_tokens":277},"id":"chatcmpl-jmRBCdvJ4yqRS2yaeM0f75gq9L86TM6L","timings":{"cache_n":0,"prompt_n":25,"prompt_ms":11.183,"prompt_per_token_ms":0.44732,"prompt_per_second":2235.5360815523563,"predicted_n":252,"predicted_ms":1501.447,"predicted_per_token_ms":5.958123015873015,"predicted_per_second":167.8380921870702}}
reformatted:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "The user asks: \"How many Rs are in strawberry?\" Likely they want count of letter 'R' in the word \"strawberry\". The word \"strawberry\" letters: s t r a w b e r r y. Count of 'r' (case-insensitive) appears? Let's count: positions: s(1), t(2), r(3) -> one r, then later e r r y: there are two r's at the end? Actually \"strawberry\" spelled s t r a w b e r r y. So letters: s, t, r, a, w, b, e, r, r, y. That's three r's? Let's count: after 'e' we have r, then another r, then y. So total r's = 3? Wait check: The word \"strawberry\" has letters: s t r a w b e r r y. That's indeed three r's: one after t, and two consecutive at the end. So answer: 3 Rs.\n\nThus respond with answer.",
"content": "The word **“strawberry”** contains **3** instances of the letter **R**."
}
}
]
}
How do you start your server?
i've upgraded llama.cpp to newer version (i'm on 7442 now, cc
@ceoofcapybaras
) and it worked like a charm. the thinking content is now placed into the response's reasoning_content field. thanks for your help, guys!
I've compared it to Qwen3-Coder-30B-A3B-Instruct-Q4_0.gguf, and the results aren't matching what's advertised: 41 t/s for Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL.gguf vs. 57 t/s for Qwen3-Coder-30B-A3B-Instruct-Q4_0.gguf on Mac M1 max.
@owao How fast is this model compared to the Qwen3 30B A3B models on your side?
@owao
thanks, --chat-template-kwargs {"enable_thinking":true} this worked!
nevermind, I tested it on the wrong nemotron
yeah, rebuilding llama.cpp helped
IQ4 on 4090 gets 232 t/s with full gpu mode and 8.5 t/s with --cpu-moe
RTX 3090 (slightly undervolted and underclocked - just enough not to trigger throttling):
Name Prompt Generated Prompt processing Generation
-----------------------------------------------------------------------------------------------
Qwen3-Coder-30B-A3B-Q4_K_XL 707 2,048 1491.27t/s 157.92t/s
Nemotron-3-Nano-30B-A3B_Q4_K_XL 725 2,048 1598.71t/s 171.59t/s