Good job
your work is on time and so cool
and a good news:glm4moe is adapted by llama.cpp (b6085)
@huccjj yep folks have been working together to get supported added to (ik_)llama.cpp.
be aware i'll end up deleting this existing EXPERIMENTAL gguf and replace it with new ones once this PR on ik_llama.cpp is updated and finalized:https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3152087986
Coming Soon :tm: haha... thanks!
I'll wait till new one is out :) been downloading like crazy recently π thank sir for the hard work!!
I'll wait till new one is out :) been downloading like crazy recently π thank sir for the hard work!!
The new air is available, with instructions on how to get it working until PR support is merged into main of ik_llama.cpp
And right!? so many models OMG haha.. But GLM-4.5 and Air version seem pretty good for the size. I hope to have some quants of the bigger one tomorrow and the imatrix is already cooking!
lol ok! i just downloaded and i am building ik. :)
i purchased 5X5090s and I am planning to sell the 2X6000 Pros. I cannot justify keeping them. I will still have significantly good processing power and 160GB of VRAM in 5x5090s though.
I will be able to run GLM air models on 3X5090s and one of R1-0528 models on 1X5090 and some image generation on 1X5090. I think I have this all setup correctly how I want it. :)
Interesting, 3x 5090s for GLM Air...
I'm running a 6000 Pro myself and I'm curious if there's any noticeable performance boost using ik_llama.cpp over standard mainline llama.cpp for pure-CUDA inference, given that there are some MoE-specific optimizations?
I'd otherwise try hybrid inference on the larger GLM 4.5 lol, if it weren't for the fact that the big GPU is confined to my Windows gaming machine for the moment, and Windows is suboptimal for hybrid inference (doesn't seem to be a way to avoid RAM OOM or paging out if the weights are larger than system RAM).
I'm running a 6000 Pro myself and I'm curious if there's any noticeable performance boost using ik_llama.cpp over standard mainline llama.cpp for pure-CUDA inference, given that there are some MoE-specific optimizations?
For full offload situation it can vary, a lot of the CUDA implementations are somewhat similar to mainline, however you have access to better quality quants and right stuff like -fmoe. You can test A/B test the specific quant and offload configuration using llama-sweep-bench (basically just replace your usual llama-server ... command and add --warmup-batch and use like 20k context. You can use my mainline branch which has it too here: https://github.com/ubergarm/llama.cpp/tree/ug/port-sweep-bench
For windows right some folks have complained about multi-GPU issues with windows, and yeah I'd not expect it to handle paging/full RAM situation as gracefully (maybe could disable swap on windows somehow would help?). There are some windows builds here if you want to test: https://github.com/Thireus/ik_llama.cpp/releases but not sure if the new GLM-4.5 branch that just got merged into main is in that release yet.
Keep us posted what you find and feel free to share your full commands for workshopping etc!
Interestingly enough, running the same Q5_K_S quant on ik_llama.cpp instead of llama.cpp with pure CUDA inference resulted in both slower prompt processing and slower token gen. I essentially took my llama.cpp config and removed all unsupported args and added -fmoe.
"C:\ML\ik_llama.cpp\build\bin\Release\llama-server.exe" -m "C:\ML\GGUF\GLM-4.5-Air-Q5_K_S.gguf" -c 131072 -ngl 999 -fa -fmoe --host 0.0.0.0 --port 5000 -a GLM-4.5-Air-Q5_K_S --no-mmap
Not sure if there are other optimizations that are needed here.
Regarding CPU+GPU inference on Windows, the main problem I've noticed is that for some reason, layers/tensors offloaded to the GPU still occupy system RAM. Not really sure why. This excess memory consumption shows up as "in use" if mmap is enabled, and "committed" if mmap is disabled - regardless the only solution is to fall back to pagefile if the model is larger than your system RAM alone. This probably isn't an issue if you have a big server board with 768gb of RAM or something (although I'm not sure why Windows is installed on such a box lol), but with a consumer CPU/mobo with less RAM it's more problematic.
I essentially took my llama.cpp config and removed all unsupported args and added -fmoe.
Not sure if there are other optimizations that are needed here.
Right in general Linux seems to be giving better results than Windows from the reports I hear, especially multi-GPU, but you're using a single 96GB VRAM blackwell right? Not sure if you need to explicitly compile with some flags for that arch sm120 and also i always use -DGGML_SCHED_MAX_COPIES=1 which may not effect your setup much in this case though.
As to your command, you will want to add a few things to boost PP a lot:
"C:\ML\ik_llama.cpp\build\bin\Release\llama-server.exe" -m "C:\ML\GGUF\GLM-4.5-Air-Q5_K_S.gguf" -c 131072 -ngl 999 -fa -fmoe --host 0.0.0.0 --port 5000 -a GLM-4.5-Air-Q5_K_S --no-mmap`
Give you're full offload try adding: -ub 4096 -b 4096 -t 1 and you'll probably get more PP speed with the increased batches and setting CPU threads to 1 which helps a few % when full offload.
Hmmm -ub 4096 -b 4096 makes me go OOM due to the larger compute buffers (I was already on the edge of full VRAM with the default batch settings). I can't in any situation get ik_llama.cpp to be performant on my setup. It's not a small gap either - mainline smokes ik_llama.cpp here.
Mainline llama.cpp e54d41be:
./llama-bench.exe -m "C:\ML\GGUF\GLM-4.5-Air-Q5_K_S.gguf" -p 8192 -n 512 -t 1 --n-gpu-layers 999 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | threads | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|
| glm4moe 106B.A12B Q5_K - Small | 72.88 GiB | 110.47 B | CUDA | 999 | 1 | 1 | pp8192 | 2011.47 Β± 25.36 |
| glm4moe 106B.A12B Q5_K - Small | 72.88 GiB | 110.47 B | CUDA | 999 | 1 | 1 | tg512 | 96.17 Β± 0.22 |
ik_llama.cpp 58f3bda0:
./llama-bench.exe -m "C:\ML\GGUF\GLM-4.5-Air-Q5_K_S.gguf" -p 8192 -n 512 -t 1 --n-gpu-layers 999 -fa 1 -fmoe 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | threads | fa | fmoe | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| glm4moe 106B.A12B Q5_K - Small | 70.52 GiB | 106.85 B | CUDA | 999 | 1 | 1 | 1 | pp8192 | 1057.01 Β± 5.70 |
| glm4moe 106B.A12B Q5_K - Small | 70.52 GiB | 106.85 B | CUDA | 999 | 1 | 1 | 1 | tg512 | 62.14 Β± 0.55 |
Interesting, a few thoughts:
- guessing you can't get
-ub 1024or-ub 2048to fit either then? - are you compiling yourself with the sm120 arch flags?
- you seem to be using a Q5_K quant and not my ik_llama.cpp quants in your test, the quant used can effect speed especially PP. The
KTquants are probably the best quality for smallest size in full GPU offload situation like this. they are trellis quants like QTIP paper similar to EXL3. If you've not tried exllamav3 they may have added GLM support recently with their EXL3 quants: https://huggingface.co/turboderp/GLM-4.5-Air-exl3. but they might be a bit small for your rig. you might be able to fit my IQ5_KS which could offer better speeds or similar speeds at better quality than a mainline Q5_K quant. keep in mind that mainlineQ5_K_Sis not at all the same as myIQ5_KS. confusing i know hah
Thanks for all the testing and also you might consider llama-sweep-bench for a better look at the speed curve across more points than just tg512 etc.
Finally, a Q4_0/Q6_0 quant mix with vulkan backend might go even faster for TG hah... i haven't made one yet but in testing smaller MoEs q4_0 and q8_0 are very performant compared with CUDA which blew my mind.
Here is an example of llama-sweep-bench output:
Both mainline and ik binaries are compiled locally via native arch detection (which would only be sm_120 since no other GPU is installed).
I used Q5_K_S because I wanted to a head-to-head comparison with mainline llama.cpp (your IQ5_KS is almost exactly the same size but IIUC mainline wouldn't run that quant at all). Is ik_llama.cpp expected to only provide improved performance with ik_llama.cpp-specific quants and not with ones built for llama.cpp main?
I've already quanted this model to 5.0bpw-h6 exl3 myself and tested that - the token gen speed is fairly slow despite being a smaller quant (compared to ~5.6bpw for Q5_K_S), capping out around 42 T/s.
Is ik_llama.cpp expected to only provide improved performance with ik_llama.cpp-specific quants and not with ones built for llama.cpp main?
So each quantization type has different performance depending on backend e.g. CUDA, Vulkan (only mainline quants work on ik's vulkan for now), CPU avx_vnni, CPU avx2, CPU NEON, etc. Its hard to give a blanket statement. At least for CPU/RAM backends ik tends to be faster across the board.
Oh cool thanks for testing the EXL3 and giving some info there too. I've not found an easy way to compare perplexity between exllamav3 and ik_llama.cpp but you can compare with mainline reasonably well using the built in eval tools.
I rebuilt both ik_llama.cpp and mainline llama.cpp with manually specifying arch 120 and ran llama-sweep-bench on the same Q5_K_S quant. The llama-sweep-bench for mainline from your fork doesn't seem to support --warmup-batch so I'm not sure if that's contributing to the weird shape of the graphs lol.
Also may I ask what the purpose of -DGGML_SCHED_MAX_COPIES=1 is?
Mainline llama.cpp:
./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-Air-Q5_K_S.gguf" -c 32768 -ngl 999 -fa --no-mmap -t 1

ik_llama.cpp:
./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-Air-Q5_K_S.gguf" -c 32768 -ngl 999 -fa -fmoe --no-mmap -t 1 --warmup-batch

I rebuilt both ik_llama.cpp and mainline llama.cpp with manually specifying arch 120 and ran llama-sweep-bench on the same Q5_K_S quant.
Oh nice, thanks for running that mainline quant on both ik and mainline lcpp. I'm curious how it would perform against one of my quants, though you can't test my quants on mainline.
your fork doesn't seem to support --warmup-batch
My mainline fork has that built in as I didn't want to add an argument outside of the single cpp file. So it is the same as ik with --warmup-batch. It shouldn't have anything to do with that cliff at ~8k context which could have something to do with the how mainline CUDA kernel is utilizing the GPU resources.
Also may I ask what the purpose of -DGGML_SCHED_MAX_COPIES=1 is?
By default it is 4, and mainly I recommend it for multi-GPU situations as without it on some models it causes the CUDA buffers to allocate way too big and OOM. So with it it does not OOM but I forget exactly which models it effects (possibly on MLA type like DeepSeek, but I just recommend it across the board now to keep things simple hah). Sometimes it will print out debuggin on startup llama_new_context_with_model: pipeline parallelism enabled (n_copies=1) with the value compiled in. You could look at the early PRs adding it on llama.cpp to find out more details about why it was originally implemented and what they were trying to do.
The title on this next one is wrong it is really an IQ2_KT trellis quant. Keep in mind that DeepSeek has more active weights at 37B over GLM-4.5-Air's 12B. Interesting to see it running on 2x RTX 6000 PRO Blackwells:
Might want to link this comment on one of the relevent ik_llama.cpp discussions or the closed GLM-4.5 PR and mention you did a comparison using a mainline quant.
Here's another one with your IQ5_K_S on ik_llama.cpp:
./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-Air-IQ5_KS-00001-of-00002.gguf" -c 32768 -ngl 999 -fa -fmoe --no-mmap -t 1 --warmup-batch
It's not significantly different from the plain Q5_K_S quant in terms of speed. I'm not sure if there's something in particular about my rig that makes ik so much slower than mainline lol. It's a 7950X with 128gb DDR5 on Windows 11 with a single RTX PRO 6000 96gb. Also are there significant CUDA performance improvements on mainline that haven't been ported to ik?
IQ5_K_S
Thanks much for running my IQ5_KS (which is very different than IQ5_K_S haha...
Also are there significant CUDA performance improvements on mainline that haven't been ported to ik?
Right, I'm wondering if something changed on mainline recently too. My impression is JG (Johanness Gaessler) does most of the CUDA stuff on mainline lcpp. ik was on vacation for a couple weeks and if something did change it would be good to let him know I guess. If I have some time I'll try to repro on 2x RTX 6000 (non-pro older versions w/ 48GB vram each) with a smaller quant.
It could possibly be something with windows and cuda drivers there, but it does warrant checking into it as in general for CUDA implementations ik's fork is similar to mainline for most quants trading blows depending on exact quants used. It usually is not this far behind.
I mentioned this thread over on ik_llama.cpp github here: https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3172722318 if you want to join in there or follow along.
I mentioned this thread over on ik_llama.cpp github here:
Thanks for mentioning it there, else I wouldn't have known.
The benchmark results are on Windows, right?
It would be useful to have someone with the same//similar configuration confirm on Linux. I have heard from several users that ik_llama.cpp and also llama.cpp both run significantly slower on Windows. Perhaps mainline has been able to resolve the Windows issue, and this is the reason it is now faster?
The benchmark results are on Windows, right?
Correct, these benchmarks came from a Windows rig with a single RTX 6000 Pro Blackwell GPU compiled by them explicitly specifying 120 arch.
Maybe @Thireus can test it out as well.
I can confirm the observations made by @Doctor-Shotgun . The mainline llama.cpp is much faster on NVIDIA RTX PRO 6000 Blackwell Workstation Edition!
llama.cpp - Windows build used: https://github.com/Thireus/llama.cpp/releases/tag/b6130
$ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 ~/llama-b6130-bin-win-cuda-12.8-x64/llama-bench -m GLM-4.5-Air-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00804.gguf -fa 1 \
-ctk f16 \
-ngl 99 \
-b 4096 -ub 4096 \
--threads 36 \
--main-gpu 0 -p 8192 -n 512 --mmap 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from C:\cygwin64\home\Thireus\llama-b6130-bin-win-cuda-12.8-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\cygwin64\home\Thireus\llama-b6130-bin-win-cuda-12.8-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\cygwin64\home\Thireus\llama-b6130-bin-win-cuda-12.8-x64\ggml-cpu-skylakex.dll
| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ---: | --------------: | -------------------: |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 51.14 GiB | 110.47 B | CUDA,RPC | 99 | 36 | 4096 | 4096 | 1 | 0 | pp8192 | 1990.88 Β± 6.36 |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 51.14 GiB | 110.47 B | CUDA,RPC | 99 | 36 | 4096 | 4096 | 1 | 0 | tg512 | 103.09 Β± 0.15 |
build: d7b5465f (6130)
ik_llama.cpp - Windows build used: https://github.com/Thireus/ik_llama.cpp/releases/tag/main-b4071-cd0d7f0
$ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 ~/ik_llama-main-b4071-cd0d7f0-bin-win-cuda-12.8-x64-avx512/llama-bench -m GLM-4.5-Air-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00804.gguf -fa 1 \
-ctk f16 \
-ngl 99 \
-b 4096 -ub 4096 \
--threads 36 \
--main-gpu 0 -p 8192 -n 512 --mmap 0 -fmoe 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | mmap | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ---: | ---: | ------------: | ---------------: |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 49.30 GiB | 106.85 B | CUDA | 99 | 36 | 4096 | 4096 | 1 | 0 | 1 | pp8192 | 2458.72 Β± 99.73 |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 49.30 GiB | 106.85 B | CUDA | 99 | 36 | 4096 | 4096 | 1 | 0 | 1 | tg512 | 41.49 Β± 0.05 |
build: cd0d7f0 (1)
Recipe used:
## Quant mix recipe created using Thireus' GGUF Tool Suite - https://gguf.thireus.com/
# Model name: GLM-4.5-Air
# Link to the original model: https://huggingface.co/zai-org/GLM-4.5-Air
## Model head & embeddings β qbits: 32 8 5
output_norm\.weight=f32
token_embd\.weight=q5_K
output\.weight=q8_0
## Multi-headed attention parameters β qbits: 32 4
blk\.([0-9]|[1-3][0-9]|4[0-6])\.attn_k\.bias=f32
blk\.([0-9]|[1-3][0-9]|4[0-6])\.attn_output\.weight=iq4_xs
blk\.([0-9]|[1-3][0-9]|4[0-6])\.attn_k\.weight=iq4_xs
blk\.([0-9]|[1-3][0-9]|4[0-6])\.attn_q\.weight=iq4_xs
blk\.([0-9]|[1-3][0-9]|4[0-6])\.attn_q\.bias=f32
blk\.([0-9]|[1-3][0-9]|4[0-6])\.attn_v\.bias=f32
blk\.([0-9]|[1-3][0-9]|4[0-6])\.attn_v\.weight=iq4_xs
blk\.([0-9]|[1-3][0-9]|4[0-6])\.attn_norm\.weight=f32
## Core FFN weights β qbits: 32 8
blk\.0\.ffn_gate\.weight=q8_0
blk\.([1-9]|[1-3][0-9]|4[0-6])\.ffn_gate_inp\.weight=f32
blk\.0\.ffn_down\.weight=q8_0
blk\.0\.ffn_up\.weight=q8_0
## Other tensors β qbits: 32 4
blk\.([0-9]|[1-3][0-9]|4[0-6])\.post_attention_norm\.weight=f32
blk\.46\.nextn\.shared_head_head\.weight=iq4_xs
blk\.46\.nextn\.embed_tokens\.weight=iq4_xs
blk\.46\.nextn\.shared_head_norm\.weight=f32
blk\.([1-9]|[1-3][0-9]|4[0-6])\.exp_probs_b\.bias=f32
blk\.46\.nextn\.enorm\.weight=f32
blk\.46\.nextn\.hnorm\.weight=f32
blk\.46\.nextn\.eh_proj\.weight=iq4_xs
## GPU-loaded ffn_*_shexp
# ffn_down_shexp (down-projection) β qbits: 4
blk\.([1-9]|[1-3][0-9]|4[0-6])\.ffn_down_shexp\.weight=iq4_xs
# ffn_up_shexp (up-projection) β qbits: 8 6 5
blk\.(1|[3-4]|[7-8]|17|19|30|32|38|2[3-5]|4[0-5]|2[8-9]|2[0-1]|1[2-3]|3[4-5])\.ffn_up_shexp\.weight=q8_0
blk\.([5-6]|22|31|33|39|46|[2-3][6-7]|1[5-6]|1[0-1])\.ffn_up_shexp\.weight=q6_K
blk\.(2|9|14|18)\.ffn_up_shexp\.weight=q5_K
# ffn_gate_shexp (gate-projection) β qbits: 8 6 5
blk\.([1-3]|[6-9]|19|20|30|33|4[0-5]|1[0-2]|2[3-9])\.ffn_gate_shexp\.weight=q8_0
blk\.(4|13|18|46|[2-3][1-2]|3[4-9]|1[5-6])\.ffn_gate_shexp\.weight=q6_K
blk\.(5|14|17)\.ffn_gate_shexp\.weight=q5_K
## CPU-loaded ffn_*_exps
# ffn_down_exps (down-extraction) β qbits: 4
blk\.([1-9]|[1-3][0-9]|4[0-6])\.ffn_down_exps\.weight=iq4_xs
# ffn_up_exps (up-extraction) β qbits: 4 3 2
blk\.(25|43|[2-3][7-9]|3[1-5])\.ffn_up_exps\.weight=q4_K
blk\.46\.ffn_up_exps\.weight=iq4_xs
blk\.(3|[5-6]|10|15|26|30|36|4[0-2]|2[0-4]|1[8-9]|1[2-3]|4[4-5])\.ffn_up_exps\.weight=iq3_xxs
blk\.([1-2]|4|[7-9]|11|14|1[6-7])\.ffn_up_exps\.weight=q2_K
# ffn_gate_exps (gate-extraction) β qbits: 4 3 2
blk\.(10|40|3[1-4]|3[7-9]|2[6-9]|4[4-5])\.ffn_gate_exps\.weight=q4_K
blk\.46\.ffn_gate_exps\.weight=iq4_xs
blk\.(3|6|30|2[0-5]|3[5-6]|4[1-3]|1[3-9])\.ffn_gate_exps\.weight=iq3_xxs
blk\.([1-2]|[4-5]|[7-9]|1[1-2])\.ffn_gate_exps\.weight=q2_K
## Summary of tensor sizes per class
# GPU Total: 5.015 GiB (94.9%) | 5.29 GiB max, if all were q8_0 | 4.64 GiB min, if all were q5_K
# CPU Total: 44.951 GiB (84.3%) | 53.32 GiB max, if all were q4_K | 38.82 GiB min, if all were q2_K
# GPU+CPU Total: 49.966 GiB (89.6%)
## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE Count BPW Assigned GiB % Assigned Max GiB (all)
# +f32 331 32.0 0.09 GiB - -
# +q8_0 1 8.5 0.04 GiB - -
# q8_0 57 8.5 1.01 GiB 54.9% 1.84
# q6_K 31 6 0.14 GiB 9.6% 1.42
# q5_K 8 5 0.42 GiB 35.5% 1.19
# +iq4_xs 237 4.25 3.31 GiB - -
#
# CPU-loaded quants:
# QTYPE Count BPW Assigned GiB % Assigned Max GiB (all)
# +iq4_xs 48 4.25 18.52 GiB - -
# q4_K 28 4 10.83 GiB 31.1% 34.80
# iq3_xxs 43 3.0625 11.32 GiB 47.8% 23.69
# q2_K 19 2 4.29 GiB 21.1% 20.30
#
# -Average BPW: 3.5907
#
# -Notes:
# - '+' means user-defined pre-assigned tensors, or tensor missing from csv data or f32 tensors
# - Recipe produced on the 2025-08-10 20:48:36 UTC+0000 using Thireus' GGUF tools (https://gguf.thireus.com/)
# - Script SHA-256: a02563df96ccec6c78ab7c716771153ae5f5ef4e9ee6a04d372f273eb1662e9c
# - Calibration dataset 'ppl_results.csv' SHA-256: c596235f01c582988d23f97e1e6809a83923ae3f5321e3cde00625c9c92952f3
# - tensors.bf16.map SHA-256: f440313db9b7ce593240c0b0acb723182ee3ae9570eca868dc6eb440112fdd67
# - tensors.bf16.map model name: GLM-4.5-Air-THIREUS-BF16-SPECIAL_TENSOR-00804-of-00804
# - tensors.q4_K.map SHA-256: a08c1fc62d315b440330991c1070e3e1b380bd97a29567eb0554f211e6bb7b8d
# - tensors.q4_K.map model name: GLM-4.5-Air-THIREUS-Q4_K-SPECIAL_TENSOR-00804-of-00804
# - tensors.iq3_xxs.map SHA-256: 772a3925f63494986e1b580feae65f973ff74e9974207791f6bf05563dc239c3
# - tensors.iq3_xxs.map model name: GLM-4.5-Air-THIREUS-IQ3_XXS-SPECIAL_TENSOR-00804-of-00804
# - tensors.q2_K.map SHA-256: 35835d7d56a8161d07c31c73f42008cba7c7e2025d4bb60be2a82b254497f183
# - tensors.q2_K.map model name: GLM-4.5-Air-THIREUS-Q2_K-SPECIAL_TENSOR-00804-of-00804
# - tensors.q8_0.map SHA-256: c00093e70a6c32aab72b404457c12a7b238b0e030975267d93d2b09a30796151
# - tensors.q8_0.map model name: GLM-4.5-Air-THIREUS-Q8_0-SPECIAL_TENSOR-00804-of-00804
# - tensors.q5_K.map SHA-256: b60aadca788055846a572cad5121e1d93bfa9bbbd520ae6350c84f52319f945f
# - tensors.q5_K.map model name: GLM-4.5-Air-THIREUS-Q5_K-SPECIAL_TENSOR-00804-of-00804
# - tensors.q6_K.map SHA-256: 5165939ae192b9008b49432f574da6df0a8df9989faf337cc3a062d04f80aef2
# - tensors.q6_K.map model name: GLM-4.5-Air-THIREUS-Q6_K-SPECIAL_TENSOR-00804-of-00804
# - tensors.iq2_ks.map SHA-256: efed8f3d7d712a6ad99c5904f6e2f4b89387cc78e4008d9ca557bd04da1f2b31
# - tensors.iq2_ks.map model name: GLM-4.5-Air-THIREUS-IQ2_KS-SPECIAL_TENSOR-00804-of-00804
# - tensors.iq4_xs.map SHA-256: 28c799175c45409d6f59d609e82c5f0ed2bba3240b7c5697afbdc76824b1b046
# - tensors.iq4_xs.map model name: GLM-4.5-Air-THIREUS-IQ4_XS-SPECIAL_TENSOR-00804-of-00804
# - GPG signatures: PASSED
# - Command used:
# ../../quant_assign.py ppl_results.csv --tolerance 0.01 --cpu-irq-k 1.5 --gpu-irq-k 1.5 --gpu-assign-qtype iq4_xs \
# --cpu-tensors-max-size 45 --gpu-tensors-max-size 95% --exponential-factor 8 --cpu-tensors \
# 'blk\.([1-9]|[1-3][0-9]|4[0-5])\.ffn_up_exps\.weight' 'blk\.([1-9]|[1-3][0-9]|4[0-5])\.ffn_gate_exps\.weight' \
# --gpu-tensors '.*' --cpu-quants q4_K iq3_xxs q2_K --gpu-quants q8_0 q5_K q6_K --cpu-assign-tensors \
# 'blk\.(46)\.ffn_up_exps\.weight=iq4_xs' 'blk\.(46)\.ffn_gate_exps\.weight=iq4_xs' \
# 'blk\.([1-9]|[1-3][0-9]|4[0-6])\.ffn_down_exps\.weight=iq4_xs' --gpu-assign-tensors \
# 'blk\.(0)\.ffn_down\.weight=q8_0'
## THE END!
Edit: Forgot -fmoe 1 for ik_llama.cpp, that improves the results but still far away from llama.cpp.
Edit2: Also forgot -b 4096 -ub 4096 for ik_llama.cpp... bench results edited. This brings PP t/s above llama.cpp.
Interesting, so ik can potentially exceed mainline's prompt processing speed at high prompt batch sizes, but is slower at the default (512?), while the token gen speed remains significantly slower than mainline in this hardware setup? Would be interesting to see the sweep bench with these settings as well. Unfortunately I don't have the vram to increase -b 4096 -ub 4096 with Q5_K_S or IQ5_KS while keeping the same cache lol.
@Thireus
Your quantization recipe leads to -fmoe not being used for many layers. -fmoe requires the type of ffn_up_exps to be the same as the type of ffn_gate_exps in a given layer. It is of course possible to generalize, but you are the first "quant cook" to set the quantization type of ffn_up_exps and ffn_gate_exps fully independently from each other. With a more sane quantization recipe ik_llama.cpp performance will improve. Another thing is that you shouldn't be using 36 threads with a model fully offloaded to the GPU. In that case I use 1 CPU thread. Mainline uses a thread pool, while ik_llama.cpp starts the CPU threads for each generated threads. Hence, unlike mainline, ik_llama.cpp pays thread creation cost for each token. I'm curious to see if using 1 CPU thread will have a significant impact on CPU performance.
But if TG performance difference remains after considering the above points, my guess is that mainline's advantage on Windows is due to usage of CUDA graphs. In ik_llama.cpp CUDA graphs are disabled for MoE models. CUDA graphs for MoE models were enabled in mainline relatively recently (~2 months ago?). I haven't put in the effort to do so because on Linux (which is my development platform) this makes a barely noticeable difference. But on Windows it is known that GPU kernel launches are very expensive, so I guess it makes a significant difference there.
@ikawrakow
- --threads 1 does not make a difference on Intel platform (tested numerous times before) and in this context, also using all CPU threads lead to better benchmark when some layers are on CPU (again, this is Intel-specific).
Evidence:
$ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 ~/ik_llama-main-b4071-cd0d7f0-bin-win-cuda-12.8-x64-avx512/llama-bench -m GLM-4.5-Air-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00804.gguf -fa 1 \
-ctk f16 \
-ngl 99 \
-b 4096 -ub 4096 \
--threads 1 \
--main-gpu 0 -p 8192 -n 512 --mmap 0 -fmoe 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | mmap | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ---: | ---: | ------------: | ---------------: |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 49.30 GiB | 106.85 B | CUDA | 99 | 1 | 4096 | 4096 | 1 | 0 | 1 | pp8192 | 2442.39 Β± 130.76 |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 49.30 GiB | 106.85 B | CUDA | 99 | 1 | 4096 | 4096 | 1 | 0 | 1 | tg512 | 41.42 Β± 0.11 |
Same outcome on llama.cpp:
$ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 ~/llama-b6130-bin-win-cuda-12.8-x64/llama-bench -m GLM-4.5-Air-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00804.gguf -fa 1 \
-ctk f16 \
-ngl 99 \
-b 4096 -ub 4096 \
--threads 1 \
--main-gpu 0 -p 8192 -n 512 --mmap 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from C:\cygwin64\home\Thireus\llama-b6130-bin-win-cuda-12.8-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\cygwin64\home\Thireus\llama-b6130-bin-win-cuda-12.8-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\cygwin64\home\Thireus\llama-b6130-bin-win-cuda-12.8-x64\ggml-cpu-skylakex.dll
| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ---: | --------------: | -------------------: |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 51.14 GiB | 110.47 B | CUDA,RPC | 99 | 1 | 4096 | 4096 | 1 | 0 | pp8192 | 1984.37 Β± 2.70 |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 51.14 GiB | 110.47 B | CUDA,RPC | 99 | 1 | 4096 | 4096 | 1 | 0 | tg512 | 103.06 Β± 0.16 |
build: d7b5465f (6130)
I can confirm that you harmonizing suggestion bumps the perfs when using fmoe 1 (only available on ik_llama.cpp, no perf improvement observed on llama.cpp) - I've set all ffn tensors to iq3_xxs:
$ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 ~/ik_llama-main-b4071-cd0d7f0-bin-win-cuda-12.8-x64-avx512/llama-bench -m GLM-4.5-Air-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00804.gguf -fa 1 \
-ctk f16 \
-ngl 99 \
-b 4096 -ub 4096 \
--threads 36 \
--main-gpu 0 -p 8192 -n 512 --mmap 0 -fmoe 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | mmap | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ---: | ---: | ------------: | ---------------: |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 46.20 GiB | 106.85 B | CUDA | 99 | 36 | 4096 | 4096 | 1 | 0 | 1 | pp8192 | 2932.89 Β± 91.90 |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 46.20 GiB | 106.85 B | CUDA | 99 | 36 | 4096 | 4096 | 1 | 0 | 1 | tg512 | 45.92 Β± 0.06 |
This is a 20% gain on PP t/s and 11% gain on TG t/s. I'll most certainly add an option in my tool suite to create harmonised quant recipes when users plan to use fmoe 1. Thank you.
llama_new_context_with_model: graph nodes = 2149
llama_new_context_with_model: graph splits = 2
Full logs:
$ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 ~/ik_llama-main-b4071-cd0d7f0-bin-win-cuda-12.8-x64-avx512/llama-sweep-bench -m GLM-4.5-Air-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00804.gguf -fa \
-fmoe \
-ctk f16 \
-c 131072 \
-ngl 99 \
-b 4096 -ub 4096 \
--warmup-batch \
--no-mmap \
--threads 36 \
--main-gpu 0
llama_model_loader: max stdio successfully set to 2048
llama_model_loader: additional 803 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 44 key-value pairs and 803 tensors from GLM-4.5-Air-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00804.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = glm4moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = GLM 4.5 Air
llama_model_loader: - kv 3: general.size_label str = 128x9.4B
llama_model_loader: - kv 4: general.license str = mit
llama_model_loader: - kv 5: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 6: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 7: glm4moe.block_count u32 = 47
llama_model_loader: - kv 8: glm4moe.context_length u32 = 131072
llama_model_loader: - kv 9: glm4moe.embedding_length u32 = 4096
llama_model_loader: - kv 10: glm4moe.feed_forward_length u32 = 10944
llama_model_loader: - kv 11: glm4moe.attention.head_count u32 = 96
llama_model_loader: - kv 12: glm4moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: glm4moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: glm4moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: glm4moe.expert_used_count u32 = 8
llama_model_loader: - kv 16: glm4moe.attention.key_length u32 = 128
llama_model_loader: - kv 17: glm4moe.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 24
llama_model_loader: - kv 19: glm4moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 20: glm4moe.expert_count u32 = 128
llama_model_loader: - kv 21: glm4moe.expert_feed_forward_length u32 = 1408
llama_model_loader: - kv 22: glm4moe.expert_shared_count u32 = 1
llama_model_loader: - kv 23: glm4moe.leading_dense_block_count u32 = 1
llama_model_loader: - kv 24: glm4moe.expert_gating_func u32 = 2
llama_model_loader: - kv 25: glm4moe.expert_weights_scale f32 = 1.000000
llama_model_loader: - kv 26: glm4moe.expert_weights_norm bool = true
llama_model_loader: - kv 27: glm4moe.nextn_predict_layers u32 = 1
llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 29: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,318088] = ["βΓ‘ βΓ‘", "βΓ‘ βΓ‘βΓ‘βΓ‘", "βΓ‘βΓ‘ βΓ‘βΓ‘", "...
llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 151329
llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 151329
llama_model_loader: - kv 35: tokenizer.ggml.bos_token_id u32 = 151331
llama_model_loader: - kv 36: tokenizer.ggml.eot_token_id u32 = 151336
llama_model_loader: - kv 37: tokenizer.ggml.unknown_token_id u32 = 151329
llama_model_loader: - kv 38: tokenizer.ggml.eom_token_id u32 = 151338
llama_model_loader: - kv 39: tokenizer.chat_template str = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv 40: general.quantization_version u32 = 2
llama_model_loader: - kv 41: split.no u16 = 0
llama_model_loader: - kv 42: split.count u16 = 804
llama_model_loader: - kv 43: split.tensors.count i32 = 803
llama_model_loader: - type f32: 331 tensors
llama_model_loader: - type q8_0: 48 tensors
llama_model_loader: - type q5_K: 1 tensors
llama_model_loader: - type iq3_xxs: 186 tensors
llama_model_loader: - type iq4_nl: 93 tensors
llama_model_loader: - type iq4_xs: 144 tensors
llm_load_vocab: special tokens cache size = 36
llm_load_vocab: token to piece cache size = 0.9713 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = glm4moe
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151552
llm_load_print_meta: n_merges = 318088
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 47
llm_load_print_meta: n_head = 96
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 12
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 10944
llm_load_print_meta: n_expert = 128
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 106B.A12B
llm_load_print_meta: model ftype = IQ1_S - 1.5625 bpw
llm_load_print_meta: model params = 110.469 B
llm_load_print_meta: model size = 47.828 GiB (3.719 BPW)
llm_load_print_meta: repeating layers = 46.817 GiB (3.682 BPW, 109.227 B parameters)
llm_load_print_meta: general.name = GLM 4.5 Air
llm_load_print_meta: BOS token = 151331 '[gMASK]'
llm_load_print_meta: EOS token = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token = 151329 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'βΓ€'
llm_load_print_meta: FIM PRE token = 151347 '<|code_prefix|>'
llm_load_print_meta: FIM SUF token = 151349 '<|code_suffix|>'
llm_load_print_meta: FIM MID token = 151348 '<|code_middle|>'
llm_load_print_meta: EOT token = 151336 '<|user|>'
llm_load_print_meta: max token length = 1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.66 MiB
model has unused tensor blk.46.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.attn_q.weight (size = 26738688 bytes) -- ignoring
model has unused tensor blk.46.attn_k.weight (size = 2228224 bytes) -- ignoring
model has unused tensor blk.46.attn_v.weight (size = 2228224 bytes) -- ignoring
model has unused tensor blk.46.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.46.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.46.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.46.attn_output.weight (size = 53477376 bytes) -- ignoring
model has unused tensor blk.46.post_attention_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_inp.weight (size = 2097152 bytes) -- ignoring
model has unused tensor blk.46.exp_probs_b.bias (size = 512 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_exps.weight (size = 282591232 bytes) -- ignoring
model has unused tensor blk.46.ffn_down_exps.weight (size = 415236096 bytes) -- ignoring
model has unused tensor blk.46.ffn_up_exps.weight (size = 282591232 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_shexp.weight (size = 2207744 bytes) -- ignoring
model has unused tensor blk.46.ffn_down_shexp.weight (size = 3244032 bytes) -- ignoring
model has unused tensor blk.46.ffn_up_shexp.weight (size = 2207744 bytes) -- ignoring
model has unused tensor blk.46.nextn.eh_proj.weight (size = 17825792 bytes) -- ignoring
model has unused tensor blk.46.nextn.embed_tokens.weight (size = 329777152 bytes) -- ignoring
model has unused tensor blk.46.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.nextn.shared_head_head.weight (size = 329777152 bytes) -- ignoring
model has unused tensor blk.46.nextn.shared_head_norm.weight (size = 16384 bytes) -- ignoring
llm_load_tensors: offloading 47 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 48/48 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 407.00 MiB
llm_load_tensors: CUDA0 buffer size = 46897.99 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 24064.00 MiB
llama_new_context_with_model: KV self size = 24064.00 MiB, K (f16): 12032.00 MiB, V (f16): 12032.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 3584.03 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2112.05 MiB
llama_new_context_with_model: graph nodes = 2149
llama_new_context_with_model: graph splits = 2
main: n_kv_max = 131072, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 36, n_threads_batch = 36
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 1.258 | 3254.97 | 22.311 | 45.90 |
| 4096 | 1024 | 4096 | 1.509 | 2714.54 | 23.554 | 43.47 |
OK, so PP performance is now in line with the expectation. If you have the time and patience to test one more thing, can you build and test mainline llama.cpp with CUDA graphs disabled?
cmake -DGGML_CUDA_GRAPHS=OFF $other_cmake_args
This will test the hypothesis that CUDA graphs have a huge performance impact on Windows. Thanks!
Can confirm the same issues for speed vs mainline on EPYC + 13x3090s on Ubuntu.
Fully offloading the UD-Q5-K-XL and IQ5_K.
Getting 1.5T/s towards the max context while mainline finishes around 4T/s
@ikawrakow
, your hypothesis is confirmed. with -DGGML_CUDA_GRAPHS=OFF the TG t/s perfs drop to ik_llama level.
$ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 ~/llama-ik_hypothesis-b6131-ac67504-bin-win-cuda-12.8-x64/llama-bench -m GLM-4.5-Air-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00804.gguf -fa 1 \
-ctk f16 \
-ngl 99 \
-b 4096 -ub 4096 \
--threads 1 \
--main-gpu 0 -p 8192 -n 512 --mmap 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from C:\cygwin64\home\Thireus\llama-ik_hypothesis-b6131-ac67504-bin-win-cuda-12.8-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\cygwin64\home\Thireus\llama-ik_hypothesis-b6131-ac67504-bin-win-cuda-12.8-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\cygwin64\home\Thireus\llama-ik_hypothesis-b6131-ac67504-bin-win-cuda-12.8-x64\ggml-cpu-skylakex.dll
| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ---: | --------------: | -------------------: |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 47.83 GiB | 110.47 B | CUDA,RPC | 99 | 1 | 4096 | 4096 | 1 | 0 | pp8192 | 2035.76 Β± 3.06 |
| glm4moe 106B.A12B IQ1_S - 1.5625 bpw | 47.83 GiB | 110.47 B | CUDA,RPC | 99 | 1 | 4096 | 4096 | 1 | 0 | tg512 | 46.52 Β± 0.12 |
build: ac675042 (6131)
your hypothesis is confirmed. with -DGGML_CUDA_GRAPHS=OFF the TG t/s perfs drop to ik_llama level.
Thanks! This tells us two things:
- On Linux
ik_llama.cppwill be about the same as mainline for TG and 50+% faster for PP for this model - I should put some effort into making CUDA graphs work work with MoE models in
ik_llama.cpp. Or, alternatively, Windows users should simply use mainlinellama.cpp. The latter maybe the better course of action as I don't develop/test on Windows, don't provide Windows builds, etc.
- On Linux
ik_llama.cppwill be about the same as mainline for TG and 50+% faster for PP for this model
I'm not observing the same TG performance. It's still about half compared to mainline on Ubuntu as context fills up.
I'm not observing the same TG performance. It's still about half compared to mainline on Ubuntu as context fills up.
Can you post more detailed results (TG performance as a function of context length)? In ik_llama.cpp you can use llama-sweep-bench for that. In mainline you need to either use
@ubergarm
's ported llama-sweep-bench, or work with the -depth parameter in llama-bench.
My go-to for doing A/B inference speed comparison between quants or between ik/mainline lcpp is to use llama-sweep-bench and I maintain a branch for mainline here: https://github.com/ubergarm/llama.cpp/tree/ug/port-sweep-bench
This would be an example command of how to use it
# get code for mainline llama.cpp
cd llama.cpp
git remote add ubergarm https://github.com/ubergarm/llama.cpp.git
git fetch ubergarm
git checkout ug/port-sweep-bench
# compile like normal for multi-GPU e.g.
cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
# run a sweep across full kv-cache depth up to the specified context size
./build/bin/llama-sweep-bench \
--model "$model"\
-fa \
-ctk q8_0 -ctv q8_0 \
-c 20480 \
-ngl 99 \
--threads 1
When running on ik_llama.cpp you can add -fmoe -ub 4096 -b 4096 --warmup-batch and increase the context a bit as it will sample every 4096 instead of default 512 batches but no big deal just the end of the graph will clip off.
Thanks all for testing and confirming the Windows CUDA graph stuff, I'll let folks know if I see this come up in discussions in other forums.
CUDA graphs for MoE models were enabled in mainline relatively recently (~2 months ago?).
I am unable to locate the PR :(, am I looking into the right code or are there other regions? https://github.com/ggml-org/llama.cpp/blame/master/ggml/src/ggml-cuda/ggml-cuda.cu#L2617
I am unable to locate the PR :(, am I looking into the right code or are there other regions?
CUDA graphs have been in llama.cpp for a while (don't remember if since before I forked, or at least before I last synced ik_llama.cpp with mainline in August of last year).
But they were disabled for MoE models (and that's what ik_llama.cpp inherited).
I'm not finding the actual PR, bur IIRC, it came after this one: https://github.com/ggml-org/llama.cpp/pull/12970.
There is also this one that makes changes to the CUDA graph logic: https://github.com/ggml-org/llama.cpp/pull/13814
This PR also mentions CUDA graphs: https://github.com/ggml-org/llama.cpp/pull/13014
The logic around CUDA graphs has changed many times since last August (as with everything else in llama.cpp). Porting the stuff that enables CUDA graphs for MoE models to ik_llama.cpp will be a non-trivial undertaking.
I've started attempting to port all the things, but the work to port everything is truly overwhelming, especially considering that this isn't a clean port as there's been quite a few ik_llama.cpp additions since. I might just switch to Linux or switch back to llama.cpp.
Is there any particular reason why -ub 4096 -b 4096 seems to significantly improve PP performance in ik_llama.cpp but not in mainline?
I wasn't able to test this myself as I didn't have enough VRAM headroom on top of the weights and the k/v cache to push above the default 512, but it was a several-fold increase for ik in Thireus' testing while mainline's PP didn't really change.
Is there any particular reason why -ub 4096 -b 4096 seems to significantly improve PP performance in ik_llama.cpp but not in mainline?
This ik_llama.cpp PR and similar might have some more information on some specifics for at least one possible reason: https://github.com/ikawrakow/ik_llama.cpp/pull/559 ...dequantize+cuBLAS is faster than MMQ for the iqX_k_r4 quants when the batch size is larger than some threshold. I couldn't find another one which I thought was related where I was showing that even for MoEs increasing batch sizes enough that the non _r4 quants begin to outpace the _r4 quants for PP and hence partially why I stopped releasing pre-repacked _r4 quants for more flexibility (e.g. end user can use -rtr if they desire).
So performance can vary quite a bit depending many variables including the exact quantization type used, which backend is computing inference, batch size used, OS, etc. The best place to find detailed information for questions like yours is to search the closed PRs both on ik and mainline.
@Thireus Your quantization recipe leads to
-fmoenot being used for many layers.-fmoerequires the type offfn_up_expsto be the same as the type offfn_gate_expsin a given layer. It is of course possible to generalize, but you are the first "quant cook" to set the quantization type offfn_up_expsandffn_gate_expsfully independently from each other. With a more sane quantization recipeik_llama.cppperformance will improve.
@ikawrakow
, would this apply to ffn_up_shexp and ffn_gate_shexp as well?
would this apply to ffn_up_shexp and ffn_gate_shexp as well?
These are not fused, so it doesn't matter (yet).
But when I come around to do it, I'll fuse them as this saves one quantization of the activations. I expect this to result in a very minor performance gain, that's why it has not been done yet.
would this apply to ffn_up_shexp and ffn_gate_shexp as well?
These are not fused, so it doesn't matter (yet).
But when I come around to do it, I'll fuse them as this saves one quantization of the activations. I expect this to result in a very minor performance gain, that's why it has not been done yet.
Thank you. I've implemented an option to harmonize the mixture of quants for selected tensor groups for the recipes the tool suite produces. Also giving the flexibility to the user to select which tensor groups should be harmonized.
And now I have to update all my recipe examples. π
Example of partial output:
...
# ffn_up_exps (up-extraction) β qbits: 4 3 2 1
blk\.(46|48|50)\.ffn_up_exps\.weight=iq4_ks
blk\.(24|47|49|60|4[0-5]|5[1-9]|3[0-9]|2[6-9])\.ffn_up_exps\.weight=iq3_k
blk\.(5|25|1[2-9]|2[0-3])\.ffn_up_exps\.weight=iq2_ks
blk\.([3-4]|[6-9]|1[0-1])\.ffn_up_exps\.weight=iq1_m_r4
# ffn_gate_exps (gate-extraction) β qbits: 4 3 2 1
blk\.(46|48|50)\.ffn_gate_exps\.weight=iq4_ks
blk\.(24|47|49|60|4[0-5]|5[1-9]|3[0-9]|2[6-9])\.ffn_gate_exps\.weight=iq3_k
blk\.(5|25|1[2-9]|2[0-3])\.ffn_gate_exps\.weight=iq2_ks
blk\.([3-4]|[6-9]|1[0-1])\.ffn_gate_exps\.weight=iq1_m_r4
...
Take a peek at this new ik_llama.cpp PR689: https://github.com/ikawrakow/ik_llama.cpp/pull/689 which enables CUDA graphs for MoE models. This could translate into 5%-25% quite a bit more token generation speed in mostly offloaded cases and especially help windows users. So as an RTX 6000 Pro user might be worth a peek!
This could translate into 5%-25% more token generation speed
In a quick test, @Thireus is reporting 2X acceleration for TG on Windows and full offload, see this comment in the PR.
Take a peek at this new ik_llama.cpp PR689: https://github.com/ikawrakow/ik_llama.cpp/pull/689 which enables CUDA graphs for MoE models. This could translate into
5%-25%quite a bit more token generation speed in mostly offloaded cases and especially help windows users. So as an RTX 6000 Pro user might be worth a peek!
Drastic speedup in TG on Windows lol, practically matching mainline. Prompt prefill speed is still slow compared to mainline when using the default batch size though (512?). -ub 4096 -b 4096 is faster but I don't have enough VRAM for it when loading the model with full context length π.
./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-Air-Q5_K_S.gguf" -c 32768 -ngl 999 -fa -fmoe --no-mmap -t 1 --warmup-batch
Oh nice, thanks for confirming much better TG speeds on windows with this PR689!
default batch size though (512?). -ub 4096 -b 4096 is faster but I don't have enough VRAM for it when loading the model with full context length
Right the default batch sizes are -ub 512 -b 2048 so you could try to squeeze a little more speed for PP (prefill) without much VRAM going with the more modest -ub 1024 or -ub 2048 which are both <= the default -b 2048 so no need to add both parameters in that case.
Right the default batch sizes are
-ub 512 -b 2048so you could try to squeeze a little more speed for PP (prefill) without much VRAM going with the more modest-ub 1024or-ub 2048which are both <= the default-b 2048so no need to add both parameters in that case.
Welp, prompt processing on ik_llama.cpp at -ub 1024 -b 2048 is still slower than mainline llama.cpp at -ub 512 -b 2048. ik_llama.cpp overtakes mainline llama.cpp at -ub 2048 -b 2048 but this also results in OOM when I try to load full context.
Is code too diverged at this point, or would it be possible for ik_llama.cpp to adopt mainline llama.cpp's prompt processing behavior when -ub is low (<2048)?
Is code too diverged at this point, or would it be possible for ik_llama.cpp to adopt mainline llama.cpp's prompt processing behavior when -ub is low (<2048)?
Yes, the code bases have massively diverged. But even if they hadn't, considering that GLM-4.5-Air fully offloaded to a Blackwell GPU and running on a Windows machine is the first and only report of llama.cpp being faster than ik_llama.cpp for MoE models, I wouldn't know why I would want to change ik_llama.cpp to do what llama.cpp does. Also, considering the results reported here, to me it looks like the performance difference comes from the attention implementation, and not from the FFN part.
Fair enough - I'll just stick to mainline llama.cpp because it works better on my setup. I'll perhaps revisit ik_llama.cpp or ktransformers if I end up building the new Linux machine with server hardware for CPU+GPU inference of Deepseek/GLM 355B.
Also specifically for GLM-4.5 this performance PR just got made
You beat me to it. I wanted to wait for the sweep-bench results from
@Thireus
before announcing the PR to a wider audience.
Concerning the -amb flag: the flag has no effect when using flash attention and not dealing with a DeepSeek model. It was implemented specifically for DeepSeek R1, V3, etc. Initially to reduce the compute buffer size without FA because FA did not work for DeepSeek. Later it was extended to also cover the more advanced MLA implementation in ik_llama.cpp (mla = 2 or mla = 3). So, this will not solve the OOM problem. But depending on use case (e.g., best possible PP performance is more important than TG performance), one can just leave the routed experts for a few layers on the CPU. How many layers are needed depends on how much VRAM one has, the quantization type of the experts, and the -b / -ub one wants to use. If the number of required layers is small, this will have a relatively minor impact on TG performance. For GLM-4.5-Air, the routed experts are 2.2B parameters per layer, so for the Q5_K_S model that was discussed above each layer left on the CPU will free up 1.42 GiB of VRAM to use for compute buffers. I see the CUDA compute buffer being 813 MiB with the default batch/u-batch size, and 2432 MiB with -b 4096 -ub 4096, so potentially just a single layer is enough, and almost for sure not more than 2.
So, people have confirmed that the ik_llama.cpp performance issue for GLM-4.5 models has been fixed with this PR. Depending on GPU, OS, and context length, ik_llama.cpp is now between slightly slower and much faster than llama.cpp.
Built and tested the new PR for GLM 4.5 - it seems that now TG speed is on par with mainline llama.cpp on my setup.
./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-Air-Q5_K_S.gguf" -c 32768 -ngl 999 -fa -fmoe --no-mmap -t 1 --warmup-batch

Prompt processing in ik_llama.cpp is still far slower than mainline llama.cpp at -ub 512 though, which is what I'm still trying to figure out. It's not a small gap at all, around 40-50% slower depending on the context length. ik_llama.cpp seems to be very sensitive to -ub, where smaller values tank the prompt processing speed, while mainline llama.cpp doesn't experience this. Empirically like at least -ub 2048 is needed for ik_llama.cpp to surpass mainline llama.cpp's prompt processing speed on my setup - wondering if this is a bug? Mainline here (also at -ub 512):
Okay this is great! That PR solved the performance issue for me.
GLM-4.5 (big version) is now on par with llama.cpp for text generation, and faster for prompt processing. This is on Linux with 6x3090 fully offloaded.
Question: Is there a way to stop the full prompt/response being dumped to the terminal? The latest build is doing this.
Question: Is there a way to stop the full prompt/response being dumped to the terminal? The latest build is doing this.
I noticed that recently too that client requests are logging to stdout or stderr or something on llama-server. I'm not sure how to turn that off, I don't recall that being a thing until maybe a couple weeks ago?
$ ./build/bin/llama-server --help
-v, --verbose print verbose information
--verbosity N set specific verbosity level (default: 0)
--verbose-prompt print a verbose prompt before generation (default: false)
--no-display-prompt don't print prompt at generation (default: false)
I have been looking for a way to turn it off too, --log-disable didn't do it if i recall. Perhaps this --no-display-prompt will do it? I'll have to try too, let me know if you figure it out haha...
Yeah I think it started happening around the time they were working on tool calls, so figure it was leftover hard-coded debug output.
I've been working around it by appending this:
|grep -v 'format_partial_response_oaicompat'
I'll try that --no-display-prompt


