Testing smol-IQ4_KSS

#8
by shewin - opened

W790E Sage + QYFS + 512G + RTX5090

Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 170240
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6061.01 MiB
llama_new_context_with_model: KV self size = 6060.98 MiB, c^KV (q8_0): 6060.98 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 8705.09 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1441.88 MiB
llama_new_context_with_model: graph nodes = 24144
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

main: n_kv_max = 170240, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 50.610 80.81 83.381 12.26
4090 1022 4090 67.035 61.01 70.408 14.52
4090 1022 8180 49.551 82.54 71.436 14.31
4090 1022 12270 49.956 81.87 72.952 14.01
4090 1022 16360 50.518 80.96 73.210 13.96

2025-11-09_17-03

2025-11-09_15-46

The output of this model is superb. Looks like you got some quite usable numbers as well! I've been enjoying using the "Q8_0-Q4_0", but might drop to the iq4_kss for a slight speed bump and more context.

How does the lego example work? procedural generation, or can you place your own blocks? looks cool!

Hi, I am thinking about the best value for money pc build to run this.

has anyone got any suggestions? i did not know if i should get a Intel Xeon Platinum 8480 the engineering samples you can pick up really cheap now, or should i get an AMD ?

sherwin what type of ram did you use ? DDR5? or DDR4?

I'm not Sherwin but I run a very similar setup to him, I use 512GB DDR5 with an Intel QYFS engineering sample, on an ASUS sage w790e motherboard and 3x 3090 with one 4090. I probably get a fraction to one whole t/s more due to more VRAM but I think for the price Intel is the way to go. Comparable AMD systems might have a few more slots of ram though, so keep that in mind. Also sapphire rapids engineering samples can only access 5 out of the 7 pcie slots on the motherboard. One of the 5 is stuck at x4 speeds too... But besides that I love it and can't imagine using another PC at this point in time

Hi Phakio yes ,i think i am going to get intel, what is the speed and brand of DDR5 ram your using? I probably might buy some , DDR5 ram is expensive lol .

i wonder what the token per second differences is on DDR4 ram?

@infinityai I posted some AMD benchmarks here. Kimi can run very well on a single 32GB VRAM GPU, with good CPU RAM bandwidth. But it is quite unfortunate that DDR5 DRAM prices skyrocketed recently...

i wonder what the token per second differences is on DDR4 ram?

You can estimate the TG speed assuming RAM bandwidth bottleneck for the active tokens of the model not offloaded to GPU. Its easy on dense models e.g. a 70B model at 4bpw would be 35GB active weights assuming you put 20GB on VRAM that leaves 15GB on CPU/RAM.

You can benchmark with mlc your memory bandwidth for reads and say you have 100GB/s bandwidth that would be:

100 GB/s / 15GB = ~6.7 tok/sec TG theoretical max.

Then for bigger multi-NUMA CPUs you'll get maybe 40-70% of that theoretical max depending on how well you keep things on a single NUMA node and how well NPS1/SNC=Disable is implemented by the vendors.

You can maybe get 4x 64GB DDR5-6000 stable in the "verboten four slot" configuration now as shown by Wendell: https://www.youtube.com/watch?v=pA-R1FabTDY which you could combine with an AMD 9950X (or X3D, though extra 3d vcache isn't gonna help a ton given weights are so huge) and hope to get about 75GB/s memory bandwidth with some luck.

The rigs like @phakoand @shewin are using are a step up with more PCIe slots and more memory controllers and very nice too so do your price checking.

Finally, ik_llama.cpp takes advantage of avx512_vnni for faster prompt processing if that is important to your task.

Cheers!

i wonder what the token per second differences is on DDR4 ram?

512 gb ddr4 3200 + Epyc 7532 + 1x3090 gives me 7 t/s text generation, prompt eval 30 t/s (Windows build of ik_llama.cpp)

@infinityai I started with 256GB DDR5 5600MHZ, but the QYFS is unable to overclock ram as it's a engineering sample so it ran at 4800mhz, and memory bandwidth was ~350gb/s if I remember correctly I tested it a long time ago.

I upgraded to 512GB 4800MHZ a week or two ago and think that is the perfect size right now for budget conscious/ good perplexity. I think prices have gone up since then but I paid $275 per 64gb stick for this upgrade. It's Samsung memory but I honestly don't think brand matters too much if running stock speeds for stability

Sign up or log in to comment