ubergarm/GLM-4.5-Air-GGUF · IQ2_KL Testing - Runs Great Until The Model The Model The Model (lol)

Aug 7

I've noticed this issue also happens with GLM-Air on openrouter, so it might just be a model issue, but after a while of working great, eventually the model gets senile and just loops itself repeaditly. This issue happens more often on the lower quant (also could be my ctk of q4...) but i'm just reporting some feedback! Even at a lower quant this model can produce great results. (one shotted a physics based html game!)

--model /home/phone/Downloads/GLM-4.5-Air-IQ2_KL.gguf
--alias ubergarm/GLM-4.5-Air-IQ2_KL
--chat-template chatglm4
--ctx-size 131072
-fa -fmoe
-ctk q4_0 -ctv q4_0
-ub 4096 -b 4096
-ngl 99
--parallel 1
--threads 16
--host 0.0.0.0
--port 8081
--no-mmap

ubergarm

Owner Aug 8

Pretty wild running this lower BPW quant at 128k context! If you have a little more VRAM or reduce context slightly you could try increasing kv-cache quality a little bit to q4_1 or q6_0 which I believe are valid options if you don't want to go up to q8_0.

You shouldn't have to pass --chat-template chatglm4 anymore as it should auto-detect properly now, but doesn't hurt anything to leave it explicit.

phakio

Aug 8

Pretty wild running this lower BPW quant at 128k context! If you have a little more VRAM or reduce context slightly you could try increasing kv-cache quality a little bit to q4_1 or q6_0 which I believe are valid options if you don't want to go up to q8_0.

You shouldn't have to pass --chat-template chatglm4 anymore as it should auto-detect properly now, but doesn't hurt anything to leave it explicit.

totally forgot about the in between quants for the kv-cache, i was stuck in my mind on either 4 or 8, sometimes f16 if im running a tiny tiny model!

I have one more 3090 then my build is done for this year, looking forward to future models and quants, the competition is picking up!

also, I noticed both glm 4.5 air and the bigger brother suffer from repeated token generation loop like above, I'm thinking it's due the lower quants, lowever kv-cache quants, or both, but think it could also be the nature of this model right now, as even the openrouter q8 from chutes does it (rarely)

if anyone has any insight on how to prevent this or minimize it, even what's causing it i'm all ears -

ubergarm

Owner Aug 8

if anyone has any insight on how to prevent this or minimize it, even what's causing it i'm all ears -

I've heard some people mentioning getting incoherent or stuck in loops at higher context depth. A few general thoughts:

history truncation to prevent exceeding the 128K context limit, configured with temperature=0.6, top_p=1.0.
https://z.ai/blog/glm-4.5

I'd suggest trying to avoid using that much context depth on any model and personally try to keep my chats limited to single 1-shot prompts with less than 40k tokens in any model more or less. If you must go longer context and can't use RAG or other techniques to reduce prompt size, then definitely be careful with your sampling settings and might have to tweak those a bit.

The newer qwen models are quite yappy and the qwen official model cards suggest:

For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507#best-practices

So possibly try setting presence_penalty to a small positive number e.g. 0.25 and maybe also the same for frequency_penalty though if you're doing coding don't go too high probably with these.

ubergarm

Owner Aug 8

Uploading slightly larger IQ3_KS now but might be a bit too big for your setup.

phakio

Aug 8

•

edited Aug 9

No worries! Recovering from COVID right now but if I muster up the strength tonight I'll give it a try! I can still fit it around 60k context or more vram depending on the kv-cache quality...

I'll update the contents of this message in the morning!

BTW I spent all night last night swapping over my system from AM5 to Sapphire Rapids. I dragged doing it for so long because life and I had to swap my water cooling pump to it... It's an engineering sample CPU I got for $100 (QYFS) but god is it much QUICKER. The CPU full tilt never goes above 50c and I can run qwen3 235b at 20tk/s q4... (Irrelevant to this as I'm running full vram offload) And that will only improve when my final 3090 comes Monday! I learned that running it at 56 threads instead of 112 greatly improves inference speed... I'll play around with this later too.

Alas, this hobby isn't for the weak wallet, (the 8x32gb ram cost more than the motherboard... Just under $1000) and my recent engagement means I need to start conserving. I'll wait for newer gpus with more vram density in a year or two. And bigger ram sticks to drop in price too... Realistically I'd like 8x128gb...

For now, my 256gb DDR5 and 96gb vram should offer a good position as models get more efficient. I hope taking the time to list my upgrade specs helps interested people viewing for an idea of a good performance build!

UPDATE:
IQ3_KS runs good, got around 70k context @ q6 ctk. Pretty coherent, solid speeds when full gpu offload. Great model as always @ubergarm !

Normally I use GLM 4.5 Air for typescript and javascript code tweaking, but to test the coherencey of the IQ3_KS I had it oneshot a multi-file site that served as an expiremental game system OS! I think it did great. It's amazing that it can write multi-file code and have it all work with no breaks in the functions across different files.

ubergarm

Owner Aug 9

Very cool you're getting solid results with such a compressed model!

Interesting you went from AM5 to Saphire Rapids, did you do mlc intel memory latency check before and after to see effective memory bandwidth between the two systems?

And yeah if it is dual socket there are some games to play with numactl and using different values for --threads 56 --threads-batch 112 assuming two cores with 56 physical cores each for example. but right if you're full VRAM offload like this model then -t 1 is the way to go!

Congrats and yes make sure to save a little for your celebration haha! Cheers!

phakio

Aug 10

just ran the MLC program, I didnt get a chance to run it before decommissioning the old system, but online says the 7950x was about 74.9 GB/s, which i agree with due to the dual channel 4 slot ram setup

new system is a single cpu motherboard, so i'm limited to 8 channels of DDR5, however each channel is individually accessable by cpu which greatly speeds up CPU inference.

my reason for swapping was mainly I wanted more pci slots, but also I noticed the speed degredation of models that couldnt full be offloaded to vram was too high, due to the 79.4GB/s bandwidth for the 7950x.

so now, if a model can barely fit onto my 96gb vram and it overflows onto the ddr5 ram buffer, the t/s generation impact isn't as greatly!

Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0         114.6

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      255643.1
3:1 Reads-Writes :      224798.9
2:1 Reads-Writes :      217866.0
1:1 Reads-Writes :      204116.3
Stream-triad like:      218647.4

tldr;

if i were to load a 120gb model with 96gb quad 3090 @ 936.2 GB/s off load with therest of the model on system ram (old) at 74.9GB/s, we would greatly drag down the performance.
total performance average: 283.76 GB/s

the same 120gb model with 96 GB offloaded at 936.2 GB/s, and the rest of them model offloaded at system ram at 255GB/s;
total performance average: 610.38 GB/s

a 3x increase in performance! and it really does just feel more fluid using models on this system.

I do wonder if the intel's AMX instruction set will be more utilized in the future, but for now I'm very happy with the performance. $2K for the cpu, motherboard and 256gb ram all together. I think this was the most budget friendly option to greatly speed up local models for me at this point in time.

ubergarm

Owner Aug 10

but online says the 7950x was about 74.9 GB/s, which i agree with due to the dual channel 4 slot ram setup

Yeah you'd be maxxing out the "verboten 4x DIMM" AM5 gaming rig configuration to get 75GB/s. I get about 86GB/s with 2x48GB DDR5@6400MT/s with overclocked infinity fabric on my home rig amd 9950x.

So your new rig has about 256GB/s in a single socket which is quite nice indeed for token generation!

I do wonder if the intel's AMX instruction set will be more utilized in the future

Right, mainline lcpp has some AMX support but in emperical testing it seems like ik's AVX2 kernels were faster anyway. I know intel has some folks working on making more use of AMX but given TG is memory bandwidth limited already I wouldn't expect that to get faster with AMX extensions anyway. You might get some boost to PP but in general that is not the thing most people are limited by for daily driving local LLMs.

intel pytorch team did some stuff for bigger saphire rapids xeon's here with sglang recently: https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/#multi-numa-parallelism however i'm not convinced it is much faster than ik_llama.cpp in terms of "aggregate" but yes you can get more in a single generation. more about multi-numa so not applicable to you.

this is the guy, Ma Mingfei, who might add more AMX support now that he is done with the above sglang stuff: https://github.com/ggml-org/llama.cpp/issues/12003#issuecomment-3166734810 so keep an eye on his github