<think>...</think> in the response

#3
by duc0812112 - opened

I'm using llama-server (from llama.cpp) to run the model. I know the model supports reasoning, so it includes ... tags in the response.

However, unlike some other reasoning models (e.g., https://huggingface.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF), the thinking content is not placed into the response's reasoning_content field. This can break applications that expect to parse the response in that format.

Is there any way to update the response format to support reasoning_content for the thinking content?

Alternatively, is there a way to disable reasoning when using the model with llama-server?

Fixes: NVIDIA Nemotron 3 parsing #18077

A way to disable reasoning:
CLI : --reasoning-budget 0 or --chat-template-kwargs '{ "enable_thinking": false }'
In API request: { "chat_template_kwargs": { "enable_thinking": false } }

@duc0812112 look at https://github.com/ggml-org/llama.cpp/tree/master/tools/server for the --reasoning-format option.
You'll find exactly what you are looking for ;)

@owao I tried all options, the thoughts always remain in the response content without any visible tags
image

Ah! It seems weird. What's your llama-server serve command?
We will also see how it turns for @duc0812112

The model card of this model says:

Note <think> and </think> are separate tokens, so use --special if needed.

I tried it with --special, again with all variants of --resoning-format, no change

@ceoofcapybaras

  "Nemotron-3-Nano-30B-A3B_131k_think_tool":
    cmd: |
      /home/user/llama.cpp/build/bin/llama-server
      --model /mnt/storage/GGUFs/Nemotron-3-Nano-30B-A3B/Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL.gguf
      --no-warmup
      --ctx-size 131000
      --no-context-shift
      --n-gpu-layers 100
      --temp 0.6
      --top-p 0.95
      --repeat-penalty 1.05
      --jinja
      --host 0.0.0.0
      --port ${PORT}
      --flash-attn on
      --chat-template-kwargs '{"enable_thinking":true}'
curl -X POST http://localhost:8678/v1/chat/completions \
                                -H "Content-Type: application/json" \
                                -d '{"model":"Nemotron-3-Nano-30B-A3B_131k_think_notool","messages":[{"role":"user","content":"How many Rs are in strawberry?"}],"temperature":0,"max_tokens":1024,"stream":false}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"The user asks: \"How many Rs are in strawberry?\" Likely they want count of letter 'R' in the word \"strawberry\". The word \"strawberry\" letters: s t r a w b e r r y. Count of 'r' (case-insensitive) appears? Let's count: positions: s(1), t(2), r(3) -> one r, then later e r r y: there are two r's at the end? Actually \"strawberry\" spelled s t r a w b e r r y. So letters: s, t, r, a, w, b, e, r, r, y. That's three r's? Let's count: after 'e' we have r, then another r, then y. So total r's = 3? Wait check: The word \"strawberry\" has letters: s t r a w b e r r y. That's indeed three r's: one after t, and two consecutive at the end. So answer: 3 Rs.\n\nThus respond with answer.","content":"The word **“strawberry”** contains **3** instances of the letter **R**."}}],"created":1765908537,"model":"Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL.gguf","system_fingerprint":"b7433-7b1db3d3b","object":"chat.completion","usage":{"completion_tokens":252,"prompt_tokens":25,"total_tokens":277},"id":"chatcmpl-jmRBCdvJ4yqRS2yaeM0f75gq9L86TM6L","timings":{"cache_n":0,"prompt_n":25,"prompt_ms":11.183,"prompt_per_token_ms":0.44732,"prompt_per_second":2235.5360815523563,"predicted_n":252,"predicted_ms":1501.447,"predicted_per_token_ms":5.958123015873015,"predicted_per_second":167.8380921870702}}

reformatted:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "The user asks: \"How many Rs are in strawberry?\" Likely they want count of letter 'R' in the word \"strawberry\". The word \"strawberry\" letters: s t r a w b e r r y. Count of 'r' (case-insensitive) appears? Let's count: positions: s(1), t(2), r(3) -> one r, then later e r r y: there are two r's at the end? Actually \"strawberry\" spelled s t r a w b e r r y. So letters: s, t, r, a, w, b, e, r, r, y. That's three r's? Let's count: after 'e' we have r, then another r, then y. So total r's = 3? Wait check: The word \"strawberry\" has letters: s t r a w b e r r y. That's indeed three r's: one after t, and two consecutive at the end. So answer: 3 Rs.\n\nThus respond with answer.",
        "content": "The word **“strawberry”** contains **3** instances of the letter **R**."
      }
    }
  ]
}

How do you start your server?

i've upgraded llama.cpp to newer version (i'm on 7442 now, cc @ceoofcapybaras ) and it worked like a charm. the thinking content is now placed into the response's reasoning_content field. thanks for your help, guys!

I've compared it to Qwen3-Coder-30B-A3B-Instruct-Q4_0.gguf, and the results aren't matching what's advertised: 41 t/s for Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL.gguf vs. 57 t/s for Qwen3-Coder-30B-A3B-Instruct-Q4_0.gguf on Mac M1 max.

@owao How fast is this model compared to the Qwen3 30B A3B models on your side?

@owao thanks, --chat-template-kwargs {"enable_thinking":true} this worked!
nevermind, I tested it on the wrong nemotron
yeah, rebuilding llama.cpp helped
IQ4 on 4090 gets 232 t/s with full gpu mode and 8.5 t/s with --cpu-moe

@duc0812112

RTX 3090 (slightly undervolted and underclocked - just enough not to trigger throttling):

Name                                   Prompt     Generated   Prompt processing   Generation
-----------------------------------------------------------------------------------------------
Qwen3-Coder-30B-A3B-Q4_K_XL            707        2,048       1491.27t/s          157.92t/s
Nemotron-3-Nano-30B-A3B_Q4_K_XL        725        2,048       1598.71t/s          171.59t/s

Sign up or log in to comment