IQ5_K on EPYC with a single GPU
#9
by
sousekd
- opened
Here are benchmarks of IQ5_K on EPYC 9355 with a single GPU.
This model is much slower than DeepSeek or Kimi, and it makes it quite challenging to fit 32K context into 32GB VRAM.
I had to use low -ub values, affecting prefill performance:
RTX 5090
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap \
-ger \
-b 2048 -ub 1024 \
-ctk q8_0 -ctv q8_0 -c 32768 \
-ngl 999 -ot exps=CPU \
--threads 16 \
--threads-batch 28 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 1024 | 256 | 0 | 16.140 | 63.45 | 23.695 | 10.80 |
| 1024 | 256 | 8192 | 15.819 | 64.73 | 24.752 | 10.34 |
| 1024 | 256 | 16384 | 16.287 | 62.87 | 25.449 | 10.06 |
| 1024 | 256 | 24576 | 16.328 | 62.72 | 26.330 | 9.72 |
Obviously, VRAM is not an issue with RTX 6000, allowing for the use of -ctk f16 -ctv f16 and larger batch sizes.
Prefill performance improves significantly:
RTX 6000
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap \
-ger \
-b 8192 -ub 8192 \
-ctk f16 -ctv f16 -c 32768 \
-ngl 999 -ot exps=CPU \
--threads 16 \
--threads-batch 28 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8192 | 2048 | 0 | 23.666 | 346.15 | 200.409 | 10.22 |
| 8192 | 2048 | 8192 | 24.225 | 338.16 | 203.324 | 10.07 |
| 8192 | 2048 | 16384 | 25.270 | 324.18 | 207.112 | 9.89 |
| 8192 | 2048 | 24576 | 25.771 | 317.88 | 210.673 | 9.72 |
This is running in a VM, with GPUs and CPU power-limited, so better results are achievable on the same hardware.
Thank you for the quants @ubergarm .
Thanks for the numbers on both blackwell GPUs! Yes Ling-1T with its ~51B active parameters is def chonky and slows down inference compared to the ~37B active on DeepSeek.