Custom jinja template and draft model usage
I saw this report on ik_llama.cpp issues: https://github.com/ikawrakow/ik_llama.cpp/pull/677#issuecomment-3229022320 by one nimishchaudhari (don't know their hf account).
You can pass in your own chat-template jinja to support newer/special/different tool calls and that kind of stuff now. Also some more features coming into ik_llama.cpp here about OAI complaint API endpoints with more tool call stuff: https://github.com/ikawrakow/ik_llama.cpp/pull/723
Curious if anyone else is having luck with the draft model, as early reports suggested it was not helping as it had at most maybe 30% "acceptable tokens pass rate" or whatever it is that prints out on mainline.
A few comments on that users command:
/home/nimish/Programs/ik_llama.cpp/build/bin/llama-server \
-m '/home/nimish/Dev/Models/ik_llama/GLM-4.5-Air-IQ2_KL.gguf' \
-ctkd q8_0 -ctvd q8_0 -md '/home/nimish/Dev/Models/ik_llama/GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf' \
-c 32768 \
--chat-template-file '/home/nimish/Dev/Models/ik_llama/glm_4.5_template.txt' --jinja \
-fa -fmoe -amb 512 \
-ctk q8_0 -ctv q8_0 \
-ngl 24 \
--gpu-layers-draft 99\
--threads 8 \
--port ${PORT}
- I never realized
-ctkdand-ctvdexisted and that the draft model could have different kv-cache than the main model? I'll have to play with that next time. - No need for
-amb 512here as that only applies to MLA models e.g. DeepSeek 671B's and Kimi-K2s. It doesn't hurt, just does nothing. - Odd the user has enough VRAM to offload all attn/shexp/first N dense layers but didn't do that, I would suggest they remove
-ngl 24and replace it with this the standard MoE approach of:
-ngl 99 \
-ot exps=CPU \
And then add any more routed exps onto CUDA's as possible while keeping enough VRAM for desired context in the middle with -ot ... = CUDA1 etc..
Hey, thanks for making this discussion @ubergarm . I tried the config you suggested ( -ot exps = CPU , -ngl 99), my VRAM usage went drastically down and so did the token generation.
Is there a guide on how to configure "-ot" parameter or something?
A common usage of GLM-4.5-Air architechture model would be like so:
./build/bin/llama-sweep-bench \
--model GLM-Steam-106B-A12B-IQ3_KS-00001-of-00002.gguf \
-c 20480 \
-fa -rtr -fmoe \
--no-mmap \
-ngl 99 \
-ot "blk\.(1|2|3|4|5|6|7|8|9|10|11|12|13|14)\.ffn_.*=CUDA0" \
-ot exps=CPU \
-ub 4096 -b 4096 \
--threads 8 \
--warmup-batch
If you have multiple GPUs you would do like this:
-ngl 99 \
-ot "blk\.(1|2|3|4|5|6|7|8)\.ffn_.*=CUDA0" \
-ot "blk\.(9|10|11|12|13|14)\.ffn_.*=CUDA1" \
-ot exps=CPU \
etc...
I had someone try the draft model and most people have said it makes it go slower TG given the draft model takes up a little VRAM so they cannot offload as many routed exps layers...
So I was very surprised you suggested it goes much faster for you, trying to figure out why e.g. is it:
- you are using q8_0 for both draft and regular model (though f16 tends to drop off slower TG speed than f16 in some other testing)
- your baseline with
-ngl 24was just so sub-optimal that using the draft helps that a lot?
Not sure what is going on! Some discussion on it on the ai beaver discord (join link on the model card) : https://discord.com/channels/1238219753324281886/1402010925354979380/1410666422886989914
Hey @ubergarm I tried your suggested config but I never manage to load the model, here's my command (I'm on a 24 + 6 GB VRAM)
/home/nimish/Programs/ik_llama.cpp/build/bin/llama-server \
-m '/home/nimish/Dev/Models/ik_llama/GLM-4.5-Air-IQ2_KL.gguf' -ctk q8_0 -ctv q8_0 \
-c 32768 \
-fa -rtr -fmoe \
--no-mmap \
-ngl 99 \
-ot "blk\.(1|2|3|4|5|6|7|8|9|10)\.ffn_.*=CUDA0" \
-ot "blk\.(11|12|13)\.ffn_.*=CUDA1" \
-ot exps=CPU \
-ub 4096 -b 4096 \
--threads 8 \
--warmup-batch \
--port ${PORT}
llm_load_tensors: offloading 47 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 48/48 layers to GPU
llm_load_tensors: CPU buffer size = 27830.00 MiB
llm_load_tensors: CUDA_Host buffer size = 333.00 MiB
llm_load_tensors: CUDA0 buffer size = 11566.47 MiB
llm_load_tensors: CUDA1 buffer size = 3600.22 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 2652.02 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 544.00 MiB
llama_new_context_with_model: KV self size = 3196.00 MiB, K (q8_0): 1598.00 MiB, V (q8_0): 1598.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2432.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 2550136832
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '/home/nimish/Dev/Models/ik_llama/GLM-4.5-Air-IQ2_KL.gguf'
============ Repacked 32 tensors
ERR [ load_model] unable to load model | tid="135385027604480" timestamp=1756458696 model="/home/nimish/Dev/Models/ik_llama/GLM-4.5-Air-IQ2_KL.gguf"
When I do -ngl 24 , both my GPUs are approximately full and gives me 32k context. I don't know what's going on lately but in the last two builds of ik-llama.cpp the TG speed has dropped drastically, but at least it works.
Update:
Thanks a lot for the -ot parameters, I am now hitting some speeds I never had hit before
Prompt
- Tokens: 271
- Time: 3671.533 ms
- Speed: 73.8 t/s
Generation - Tokens: 1969
- Time: 137719.307 ms
- Speed: 14.3 t/s
I'm on the following params:
-m '/home/nimish/Dev/Models/ik_llama/GLM-4.5-Air-IQ2_KL.gguf' -ctk q8_0 -ctv q8_0
--chat-template-file '/home/nimish/Dev/Models/ik_llama/glm_4.5_template.txt' --jinja
-c 32768
-fa -rtr -fmoe
--no-mmap
-ngl 99
-ot "blk.(1|2|3|4|5|6|7|8|9|10|11|12|13|15|16).ffn_.=CUDA0"
-ot "blk.(14).ffn_.=CUDA1" \
-ot exps=CPU
-ub 4096 -b 4096
--threads 8
--warmup-batch
--port ${PORT}
I tried using a draft model like I did earlier but now the performance is dropping a lot. I might try to restrict the draft model to GPU 2 (6 GB VRAM) to see if that helps unlock some even better speeds.
I appreciate your help @ubergarm . Thanks a lot.
Can I ask you for a guide / material link to understand these architectural differences? I would like to eventually contribute to ik-llama.cpp :)
Thanks,
Nimish
Thanks a lot for the -ot parameters, I am now hitting some speeds I never had hit before
Okay great! Glad you got it running and now similar to other reports the draft model does not increase speed any for now in most cases of which I've heard reports. (might depend on how closely your prompts match the draft model used's training).
Can I ask you for a guide / material link to understand these architectural differences? I would like to eventually contribute to ik-llama.cpp :)
Documentation is difficult to come by unfortunately. Your best bet is to read through closed PRs when a new feature was introduced as generally ik will explain what is going on. I have a couple rough guides for the basics of running ik_llama.cpp as well as cooking your own quants, but they are getting out of date already:
- https://github.com/ikawrakow/ik_llama.cpp/discussions/258 (some things in this are probably wrong now lol)
*https://github.com/ikawrakow/ik_llama.cpp/discussions/434
Feel free to open a discussion over there or hit me up if there is anything you're interested in working on or contributing. Sometimes people bring over features from llama.cpp to ik though the two forks have diverged quite a bit by now. Also testing new features like this would be welcome: https://github.com/ikawrakow/ik_llama.cpp/pull/723
Thanks!
Curious if anyone else is having luck with the draft model, as early reports suggested it was not helping as it had at most maybe 30% "acceptable tokens pass rate" or whatever it is that prints out on mainline.
I haven't really tested it that much but it should probably be higher than this so long as there is some repetitiveness (eg: coding, summarisation, etc).
The other thing working against the "air" model is the much smaller active parameters:
106 billion total parameters and 12 billion active parameters
as the potential gains from speculative decoding are relative to the ratio of the model sizes.
I'm not sure if using a draft for models with less than 20-30 billion active parameters will ever really show significant gains.
Just started using the draft with GLM-4.5 (non-Air) and it seems to be working really well for me, so wonder if there is some bug for GLM-4.5-Air causing it to perform much worse or if whoever reported it wasn't working was trying to use a draft for tasks that aren't very "draftable"?
For example, here is a 1-shot refactoring task of splitting a class into 2 classes / 2 files:
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 5425
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 5425, n_tokens = 5425, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 5425, n_tokens = 5425
slot release: id 0 | task 0 | stop processing: n_past = 14658, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 31298.75 ms / 5425 tokens ( 5.77 ms per token, 173.33 tokens per second)
eval time = 590306.05 ms / 9234 tokens ( 63.93 ms per token, 15.64 tokens per second)
total time = 621604.81 ms / 14659 tokens
slot print_timing: id 0 | task 0 |
draft acceptance rate = 0.94156 ( 7556 accepted / 8025 generated)
and that's probably about as good as it gets for draft models in llama.cpp... Even though GLM-4.5-Air has much fewer active parameters, you should still be getting a reasonable acceptance rate.
If people are still only getting ~30% acceptance rate on "draftable" tasks then there must be a bug IMO?
Hi all, I am not sure if on topic but I do have my problem with GLM-4.5-Air and tool calling here against later ik-llama on master.
I am using the same template linked above and this is my cmd line on a single 3090:
LLAMA_ARGS=--host 0.0.0.0 --port 10434 --alias GLM-4.5 --model models/ubergarm/GLM-4.5-Air-GGUF/IQ5_KS/GLM-4.5-Air-IQ5_KS-00001-of-00002.gguf --ctx-size 32768 --jinja --chat-template-file chat-templates/GLM-4.5-Air-ik.jinja --n-gpu-layers 99 --no-mmap --n-cpu-moe 42"
Enabling flash attention and caches to v8_0 does not help...I get some garbled text (some in Chinese).
Correction: disabling flash attention and leaving caches along is actually better but then the machine (lol) does not call the tool, only listing it in the chat response.
Has anybody succeeded?
if whoever reported it wasn't working was trying to use a draft for tasks that aren't very "draftable"?
yeah in this case iirc the report was from a role player on ai beavers discord server, so most likely they were not doing code refactoring and hence the lower than expected acceptance rate probably.
Hrmm running with or without -fa should give similar quality output (but not using flash attention will likely be slower). -ctk q8_0 -ctv q8_0 will likely not hurt quality much either (in practice it is noticible but very small difference in perplexity for wiki.test.raw).
What client are you using to make the tool calling and is it hitting the correct API endpoint? I know recent PR on ik_llama.cpp has different behavior depending on which endpoint you're hitting similar to mainline e.g. you might need to hit/completions/ endpoint now for "open ai api compliant style" tool calling. The older behavior is still available on /v1/completions etc, but check the recent PR for details and feel free to open a discussion over there to see if other folks have some more details.
This is probably the best recent PR about tool calling in general and has the api endpoints mentioned: https://github.com/ikawrakow/ik_llama.cpp/pull/723
yeah in this case iirc the report was from a role player on ai beavers discord server
For creative tasks, the only time I've had a draft model actually help with Mistral-Large-2407 + Mistral-7B-Instruct-v0.3 using exllamav2. (We're talking ~20 t/s -> ~35 t/s)
I also had a moderate speed boost with Command-A + command-r7b-12-2024 but only for repetitive / coding tasks. But a few things I noticed:
- Samplers have a huge impact. The more deterministic you make it, the more likely it will hit. Using logit bias, and fancy samplers like DRY really kills it.
- With llama.cpp, as you discovered, there's a draft model equivalent of a lot of the parms (
-ngld) etc.
You probably want-cdas well if you're using-cfor the main model. π - I got much better draft model performance in ExllamaV2 vs llama.cpp, and I suspect this is due to more sane defaults, and the tabbyAPI config yaml is easier to manage than cli params.
We can probably find a way to improve things by tweaking eg:
--draft-max, --draft, --draft-n N number of tokens to draft for speculative decoding (default: 16)
--draft-p-min P minimum speculative decoding probability (greedy) (default: 0.8)
- With ExllamaV2 and the mistral combo, I saw no noticeable performance hit setting the draft model's cache to Q4
Given I use GLM for coding a lot, I'll have to try the draft model for this one.
p.s. Oh and don't forget -devd to choose which device(s) to put the draft model on.