general questions

#8
by Hansi2024 - opened

first thanks to ubergarm for his inspiring contributions to use local llms.
i made some successfull tests, to use deepseek terminus IQ2_KL and glm46 IQ5_K connectetd to sillytavern, to create CYOA RP stories. I have 32GB Vram / 256GB DDR 5600 / 9800X3D.
glm is 4.5 t/s (think/nothink) and deepseek 7t/s (think/nothink). is the speed difference because of the different architecture of the models? is a smaller glm q4 quant faster? both models are ok, glm is more detailed and engaging.
Is the feeled quality difference result of the more compressed q2 deepseek model and can you say that a less compressed model, which fills the memory completely, is always better? is it worth testing the small q2 models of ling and kimi? I'm short of bandwith ;-) .

first thanks to ubergarm for his inspiring contributions to use local llms.

Thanks! I recently did a talk with aifoundry.org at their ai plumbers unconference in san fransisco and hope to post a link to the talk soon!

I have 32GB Vram / 256GB DDR 5600 / 9800X3D.

This is probably one of the best low cost options to run these big MoEs despite likely under ~60GB/s memory bandwidth.

deepseek terminus IQ2_KL and glm46 IQ5_K

The token generation (decode) speed for your rig will likely be directly proportional to the size of the active weights on RAM. DeepSeek is 671B-A37B. GLM-4.6 is 355B-A32B so the reason your GLM is slower is likely because it is a bigger quant so despite less active weights the overall size in GB is bigger. TG is generally memory bandwidth bottlenecked and you can calculate your theoretical max speed as the memory bandwidth divided by the size of the active weights per token e.g.

60GB memory bandwidth / ~13GB active weights per token = 4.6 tok/sec

You are asking the right questions, but there are no simple always correct answers. Basically you have to try different models for your specific rig and workload to figure out what you prefer. In general larger models can take more quantization while retaining reasonable performance.

I'd suggest you read and follow r/LocalLLaMA on reddit for gossip or also AI Beavers discord where people like to discuss these questions.

60GB memory bandwidth / ~13GB active weights per token = 4.6 tok/sec

where can i see how many active weights a model has?

I'd suggest you read and follow r/LocalLLaMA on reddit for gossip or also AI Beavers discord where people like to discuss these questions.

AI Beavers is new to me, just headed over

@Hansi2024

where can i see how many active weights a model has?

Generally the original model card on huggingface will say. sometimes its in the name for example Qwen3-30B-A3B the A3B means out of the 30B total model weights, 3B of them are active per token generated. It can be a bit tricky to calculate though as some architectures use shared expert, some use the "first N dense layers" etc.

Enjoy the wild ride!

This comment has been hidden (marked as Off-Topic)

Sign up or log in to comment