bartowski/Qwen_Qwen3-235B-A22B-Instruct-2507-GGUF

Aug 2

•

Hi!

I have to do some actual work with this model and even though for chat use I tipically use ik_llama.cpp, it doesn't support tool call very well and it needs a proxy. So I'm back to llama.cpp and I had to choose a model that fits in 4x3090 and 2x5090 to pair with a coding assistants, in my case Opencode and Roocode.
Q4_K_M and _S would not fit at 128K tokens and I didn't want to use any KV cache, so the biggest size it would fit is IQ4_XS, size of 117GB on disk.

Running:
nvidia-smi
--query-gpu=memory.used
--format=csv,noheader,nounits |
awk '{ total += $1 } END { print total " MiB in use" }'

I get that with the model loaded at 128K context it takes 151388 MiB in use (i have 160G VRAM total)

I had to choose which gguf to run, so there were basically 3 options:
barts IQ4_XS 117GB
unsloth IQ4_XS 117GB
unsloth UD-Q4_K_XL 125GB

So I run ppl test for all 3 to check which one to select from.

Bart IQ4_XS: PPL = 4.2772 +/- 0.0243
Unsloth UD-Q4_K_XL: PPL = 4.2866 +/- 0.02433
Unsloth IQ4_XS: 4.2918 +/- 0.02435

Unsloth's has more downloads but it seems like bartowski give a bit better PPL, so I'm sticking with it. Chat template works fine, it outputs the thinking tags and works well with the tool call if I add the --jinja to the llama.cpp server cmd

gghfez

Aug 2

Yeah, that's usually the case, because Bartowski doesn't crush the output.weight ( Q5_K vs Q4_K )

Is this the best local mode for RooCode with 160gb VRAM at the moment?

bullerwins

Aug 2

Yeah, that's usually the case, because Bartowski doesn't crush the output.weight ( Q5_K vs Q4_K )

Is this the best local mode for RooCode with 160gb VRAM at the moment?

If you want to have fast prompt processing and text gen probably. If you can stomach the slower speed, maybe using the 480B code version is better of course, but you would need to have like half the layers in normal RAM, so we are probably talking 5-6t/s maybe?

Ik_llama.cpp would be faster, but again, the support for tool call is not good (or supported at all I think cc @ubergarm maybe he has done some more test) I still have to test the proxy https://github.com/Teachings/FastAgentAPI that adds tool calling to unsupported open ai api's

ubergarm

Aug 2

•

edited Aug 2

I've had it suggested to me by @AesSedai to use https://github.com/crashr/llama-stream as a proxy in front of either (ik_)llama.cpp with Cline or RooCode works okay.

Example config.yaml

# Configuration for the llama-stream Reverse Proxy

# Port on which the reverse proxy server will listen
proxy_port: 10000

# Target server configuration
# This is the backend server to which requests will be forwarded.
# Examples:
#   For HTTP: "http://backend-service:8000"
#   For HTTPS with certificate verification: "https://api.example.com"
#   For HTTPS ignoring certificate errors (e.g., self-signed certs in dev): "https://localhost:8443"
target_url: "http://192.168.20.31:9999" # Replace with your actual target

# SSL verification for the target server (only applicable if target_url starts with "https://")
# true: Verify SSL certificate (default if target is HTTPS and this key is missing)
# false: Do not verify SSL certificate (useful for self-signed certificates)
# string: Path to a CA bundle file or directory with certificates of trusted CAs
verify_ssl: false

# For the _simulate_streaming function in POST requests
# Defines the chunk size for simulating streaming of text content.
streaming_chunk_size: 50

# Optional: Specify allowed paths for GET requests.
# If not specified or empty, all GET paths will be attempted (and likely result in 404 if not /v1/models by current logic).
# For now, the logic only explicitly handles /v1/models.
allowed_get_paths:
  - "/v1/models"
  - "/v1/chat/completion"

# Optional: Set default request timeout for requests to the target server (in seconds)
request_timeout: 99999

# Optional: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
log_level: "INFO"

Command to run after pip installing everything

python llama-stream.py /home/username/development/llama-stream/config.yaml

Otherwise I have not personally tested function/tool calling, though there is still some work coming in on this PR here: https://github.com/ikawrakow/ik_llama.cpp/pull/661 and here: https://github.com/ikawrakow/ik_llama.cpp/pull/670

bartowski

Owner Aug 2

Wow @bullerwins thanks so much for putting that together!

This coming week I'm putting together my new server so I'm hoping to make improvements to imatrix, but it's great to know that as it stands my setup already does a pretty great job :O

gghfez

Aug 4

Thanks for the replies / I forgot to respond.
I saw some of those proxy solutions but didn't have time to test them yet. I ended up taking the lazy option and using exllamav3 4.0bpw which works well (albiet slowly on Ampere)

but it's great to know that as it stands my setup already does a pretty great job

Yeah mate; if I'm not using ik_llama, I tend to use your qaunts when they're available.

Hisma

Aug 7

•

edited Aug 7

Hi!

I have to do some actual work with this model and even though for chat use I tipically use ik_llama.cpp, it doesn't support tool call very well and it needs a proxy. So I'm back to llama.cpp and I had to choose a model that fits in 4x3090 and 2x5090 to pair with a coding assistants, in my case Opencode and Roocode.
Q4_K_M and _S would not fit at 128K tokens and I didn't want to use any KV cache, so the biggest size it would fit is IQ4_XS, size of 117GB on disk.

Running:
nvidia-smi
--query-gpu=memory.used
--format=csv,noheader,nounits |
awk '{ total += $1 } END { print total " MiB in use" }'

I get that with the model loaded at 128K context it takes 151388 MiB in use (i have 160G VRAM total)

I had to choose which gguf to run, so there were basically 3 options:
barts IQ4_XS 117GB
unsloth IQ4_XS 117GB
unsloth UD-Q4_K_XL 125GB

So I run ppl test for all 3 to check which one to select from.

Bart IQ4_XS: PPL = 4.2772 +/- 0.0243
Unsloth UD-Q4_K_XL: PPL = 4.2866 +/- 0.02433
Unsloth IQ4_XS: 4.2918 +/- 0.02435

Unsloth's has more downloads but it seems like bartowski give a bit better PPL, so I'm sticking with it. Chat template works fine, it outputs the thinking tags and works well with the tool call if I add the --jinja to the llama.cpp server cmd

hey, how do you use the --jinja command with llama-server? It's an unrecognized command when I use it.
Shouldn't the chat template automatically load based on the metadata embedded in the GGUF? It's part of the tokenizer.json or whatever right?
For now I'm using the reverse proxy that llama-stream that was mentioned earlier. It seems to help a little bit but still not as good as I'd like, and it's obviously clunky to have to run 2 servers just to get tool calling right.

edit: nevermind - I was using ik_llama.cpp. I switch to vanilla llama.cpp and the command is available. By adding the --jinja flag tools are now working in cline!
Thanks guys!

ubergarm

Aug 8

So if anyone can test --jinja may be available on ik_llama.cpp now in this PR: https://github.com/ikawrakow/ik_llama.cpp/pull/677

Hisma

Aug 9

•

edited Aug 9

I commented in the PR but I'll also mention here, unfortunately that PR in it's current state is unable to get the --jinja flag to pass using ik_llama.cpp.
edit: after getting the PR rebased and sync with the main branch, --jinja command now works with the changes in that PR. At least for that GLM 4.5.

bartowski
/

Qwen_Qwen3-235B-A22B-Instruct-2507-GGUF

Quick PPL comparasion