Testing smol-IQ4_KSS

by shewin - opened 15 days ago

15 days ago

W790E Sage + QYFS + 512G + RTX5090

Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 170240
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6061.01 MiB
llama_new_context_with_model: KV self size = 6060.98 MiB, c^KV (q8_0): 6060.98 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 8705.09 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1441.88 MiB
llama_new_context_with_model: graph nodes = 24144
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

main: n_kv_max = 170240, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	50.610	80.81	83.381	12.26
4090	1022	4090	67.035	61.01	70.408	14.52
4090	1022	8180	49.551	82.54	71.436	14.31
4090	1022	12270	49.956	81.87	72.952	14.01
4090	1022	16360	50.518	80.96	73.210	13.96

shewin

15 days ago

phakio

14 days ago

The output of this model is superb. Looks like you got some quite usable numbers as well! I've been enjoying using the "Q8_0-Q4_0", but might drop to the iq4_kss for a slight speed bump and more context.

How does the lego example work? procedural generation, or can you place your own blocks? looks cool!

infinityai

13 days ago

•

edited 13 days ago

Hi, I am thinking about the best value for money pc build to run this.

has anyone got any suggestions? i did not know if i should get a Intel Xeon Platinum 8480 the engineering samples you can pick up really cheap now, or should i get an AMD ?

sherwin what type of ram did you use ? DDR5? or DDR4?

phakio

13 days ago

I'm not Sherwin but I run a very similar setup to him, I use 512GB DDR5 with an Intel QYFS engineering sample, on an ASUS sage w790e motherboard and 3x 3090 with one 4090. I probably get a fraction to one whole t/s more due to more VRAM but I think for the price Intel is the way to go. Comparable AMD systems might have a few more slots of ram though, so keep that in mind. Also sapphire rapids engineering samples can only access 5 out of the 7 pcie slots on the motherboard. One of the 5 is stuck at x4 speeds too... But besides that I love it and can't imagine using another PC at this point in time

infinityai

13 days ago

Hi Phakio yes ,i think i am going to get intel, what is the speed and brand of DDR5 ram your using? I probably might buy some , DDR5 ram is expensive lol .

i wonder what the token per second differences is on DDR4 ram?

sousekd

13 days ago

•

edited 13 days ago

@infinityai I posted some AMD benchmarks here. Kimi can run very well on a single 32GB VRAM GPU, with good CPU RAM bandwidth. But it is quite unfortunate that DDR5 DRAM prices skyrocketed recently...

ubergarm

Owner 13 days ago

i wonder what the token per second differences is on DDR4 ram?

You can estimate the TG speed assuming RAM bandwidth bottleneck for the active tokens of the model not offloaded to GPU. Its easy on dense models e.g. a 70B model at 4bpw would be 35GB active weights assuming you put 20GB on VRAM that leaves 15GB on CPU/RAM.

You can benchmark with mlc your memory bandwidth for reads and say you have 100GB/s bandwidth that would be:

100 GB/s / 15GB = ~6.7 tok/sec TG theoretical max.

Then for bigger multi-NUMA CPUs you'll get maybe 40-70% of that theoretical max depending on how well you keep things on a single NUMA node and how well NPS1/SNC=Disable is implemented by the vendors.

You can maybe get 4x 64GB DDR5-6000 stable in the "verboten four slot" configuration now as shown by Wendell: https://www.youtube.com/watch?v=pA-R1FabTDY which you could combine with an AMD 9950X (or X3D, though extra 3d vcache isn't gonna help a ton given weights are so huge) and hope to get about 75GB/s memory bandwidth with some luck.

The rigs like @phakoand @shewin are using are a step up with more PCIe slots and more memory controllers and very nice too so do your price checking.

Finally, ik_llama.cpp takes advantage of avx512_vnni for faster prompt processing if that is important to your task.

Cheers!

IvanVodkin

12 days ago

•

edited 12 days ago

i wonder what the token per second differences is on DDR4 ram?

512 gb ddr4 3200 + Epyc 7532 + 1x3090 gives me 7 t/s text generation, prompt eval 30 t/s (Windows build of ik_llama.cpp)

phakio

12 days ago

@infinityai I started with 256GB DDR5 5600MHZ, but the QYFS is unable to overclock ram as it's a engineering sample so it ran at 4800mhz, and memory bandwidth was ~350gb/s if I remember correctly I tested it a long time ago.

I upgraded to 512GB 4800MHZ a week or two ago and think that is the perfect size right now for budget conscious/ good perplexity. I think prices have gone up since then but I paid $275 per 64gb stick for this upgrade. It's Samsung memory but I honestly don't think brand matters too much if running stock speeds for stability

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment