IQ5_K on EPYC with a single GPU

#9
by sousekd - opened

Here are benchmarks of IQ5_K on EPYC 9355 with a single GPU.
This model is much slower than DeepSeek or Kimi, and it makes it quite challenging to fit 32K context into 32GB VRAM.
I had to use low -ub values, affecting prefill performance:

RTX 5090

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -ger \
    -b 2048 -ub 1024 \
    -ctk q8_0 -ctv q8_0 -c 32768 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 16.140 63.45 23.695 10.80
1024 256 8192 15.819 64.73 24.752 10.34
1024 256 16384 16.287 62.87 25.449 10.06
1024 256 24576 16.328 62.72 26.330 9.72

Obviously, VRAM is not an issue with RTX 6000, allowing for the use of -ctk f16 -ctv f16 and larger batch sizes.
Prefill performance improves significantly:

RTX 6000

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -ger \
    -b 8192 -ub 8192 \
    -ctk f16 -ctv f16 -c 32768 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 23.666 346.15 200.409 10.22
8192 2048 8192 24.225 338.16 203.324 10.07
8192 2048 16384 25.270 324.18 207.112 9.89
8192 2048 24576 25.771 317.88 210.673 9.72

This is running in a VM, with GPUs and CPU power-limited, so better results are achievable on the same hardware.

Thank you for the quants @ubergarm .

Thanks for the numbers on both blackwell GPUs! Yes Ling-1T with its ~51B active parameters is def chonky and slows down inference compared to the ~37B active on DeepSeek.

Sign up or log in to comment