IQ5_K on EPYC with a single GPU

by sousekd - opened 23 days ago

23 days ago

Here are benchmarks of IQ5_K on EPYC 9355 with a single GPU.
This model is much slower than DeepSeek or Kimi, and it makes it quite challenging to fit 32K context into 32GB VRAM.
I had to use low -ub values, affecting prefill performance:

RTX 5090

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -ger \
    -b 2048 -ub 1024 \
    -ctk q8_0 -ctv q8_0 -c 32768 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	16.140	63.45	23.695	10.80
1024	256	8192	15.819	64.73	24.752	10.34
1024	256	16384	16.287	62.87	25.449	10.06
1024	256	24576	16.328	62.72	26.330	9.72

Obviously, VRAM is not an issue with RTX 6000, allowing for the use of -ctk f16 -ctv f16 and larger batch sizes.
Prefill performance improves significantly:

RTX 6000

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -ger \
    -b 8192 -ub 8192 \
    -ctk f16 -ctv f16 -c 32768 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	23.666	346.15	200.409	10.22
8192	2048	8192	24.225	338.16	203.324	10.07
8192	2048	16384	25.270	324.18	207.112	9.89
8192	2048	24576	25.771	317.88	210.673	9.72

This is running in a VM, with GPUs and CPU power-limited, so better results are achievable on the same hardware.

Thank you for the quants @ubergarm .

ubergarm

Owner 22 days ago

Thanks for the numbers on both blackwell GPUs! Yes Ling-1T with its ~51B active parameters is def chonky and slows down inference compared to the ~37B active on DeepSeek.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment