IQ5_K benchmarks on AMD EPYC with a single GPU (RTX 5090/6000)

#7
by sousekd - opened

Here are some performance tests of the IQ5_K version running on EPYC 9355, RTX 5090, and RTX Pro 6000.
Sharing these benchmarks for anyone thinking about this kind of setup.

  • CPU power capped at 280 W, GPUs at 450 W
  • Running inside a Proxmox/QEMU VM, with only 32 real cores assigned (SMT cores reserved for host)
  • Testing various combinations of context size and batch size fitting the VRAM
./llama-sweep-bench \
    -fa -fmoe \
    -ngl 999 \
    -ot exps=CPU \
    --no-mmap \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch

RTX 5090 - 32k @ f16 (-ub 2048)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 7.465 274.33 29.334 17.45
2048 512 2048 7.546 271.39 29.636 17.28
2048 512 4096 7.629 268.45 29.921 17.11
2048 512 6144 7.715 265.45 30.586 16.74
2048 512 8192 7.808 262.29 30.479 16.80
2048 512 10240 7.930 258.27 30.871 16.58
2048 512 12288 7.981 256.62 31.220 16.40
2048 512 14336 8.061 254.07 31.598 16.20
2048 512 16384 8.172 250.61 31.927 16.04
2048 512 18432 8.266 247.78 32.196 15.90
2048 512 20480 8.101 252.82 32.558 15.73
2048 512 22528 8.251 248.23 32.963 15.53
2048 512 24576 8.305 246.59 34.211 14.97
2048 512 26624 8.456 242.20 34.201 14.97
2048 512 28672 8.563 239.18 34.049 15.04
2048 512 30720 8.641 237.00 34.399 14.88

RTX 5090 - 32k @ q8_0 (-ub 8192)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 12.104 676.80 123.036 16.65
8192 2048 8192 13.802 593.52 131.376 15.59
8192 2048 16384 15.529 527.52 138.550 14.78
8192 2048 24576 17.297 473.62 150.579 13.60

RTX 6000 - 32k @ q8_0 (-ub 16384, -ot "\.(2[6-9]|[3-9][0-9])\.ffn_(gate|up|down)_exps.=CPU")

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
16384 4096 0 16.896 969.68 219.313 18.68
16384 4096 16384 23.399 700.21 247.435 16.55

RTX 6000 - 192k @ f16 (-ub 8192)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 11.294 725.36 121.533 16.85
8192 2048 8192 12.679 646.12 125.605 16.31
8192 2048 16384 13.913 588.79 132.592 15.45
8192 2048 24576 15.433 530.81 138.509 14.79
8192 2048 32768 17.222 475.67 144.749 14.15
8192 2048 40960 19.431 421.58 151.219 13.54
8192 2048 49152 21.785 376.03 157.030 13.04
8192 2048 57344 24.127 339.54 163.342 12.54
8192 2048 65536 26.443 309.80 171.141 11.97
8192 2048 73728 28.571 286.73 186.935 10.96
8192 2048 81920 30.809 265.90 218.675 9.37
8192 2048 90112 33.121 247.33 245.795 8.33
8192 2048 98304 35.498 230.78 260.610 7.86
8192 2048 106496 38.019 215.47 273.867 7.48
8192 2048 114688 40.699 201.28 285.800 7.17
8192 2048 122880 43.860 186.78 297.138 6.89
8192 2048 131072 46.158 177.48 308.663 6.64
8192 2048 139264 47.971 170.77 319.498 6.41
8192 2048 147456 50.163 163.31 331.344 6.18
8192 2048 155648 52.666 155.55 343.088 5.97
8192 2048 163840 55.144 148.56 354.767 5.77
8192 2048 172032 57.282 143.01 365.980 5.60
8192 2048 180224 59.690 137.24 380.191 5.39
8192 2048 188416 62.231 131.64 389.775 5.25

I could squeeze out slightly better results by using more threads or lifting the power limits, but it's not really worth the extra heat and noise.
Any feedback, configuration advice, or questions are very welcome!

Thanks for sharing your results! The command looks reasonable and cool you can hit 8192 batch sizes (some folks have trouble with higher sizes on their rigs).

CPU power capped at 280 W, GPUs at 450 W

The only tip I can think of off-hand is to consider not power capping your GPUs with nvidia-smi -pl 450 or whatever and instead use the LACT "indirect undervolt" method assuming Linux or something like MSI Afterburner on Windoze.

On Beaver AI Discord a guy has an RTX 6000 PRO and said by locking the boost clock and setting 1000MHz offset (which seems scaled in Blackwell vs older arch GPUs) that it runs at same performance and never goes much over 300W anyway. I have a post on Reddit, l1t, and there is a good github issue thread describing it on the LACT github repo. It works on headless systems as well and can be persisted across boots. @Panchovix helped spread this method and explain how the Blackwell GPUs use like 10x the offset as earlier models for some reason.

https://github.com/ilya-zlobintsev/LACT/issues/486#issuecomment-3303917115

Thank you @ubergarm for the tip about LACT, I'll sure check it out!

I started power-limiting when some people stressed me out here about my 2200W PSU, claiming it is not strong enough for 2x CPU, 2x GPU, 4x HDD and 3x SSD 😀. Although I disagree, I decided that putting some power limits in place might not be a bad idea...

Yeah, higher -ub values work well on my machine. Here I tested -ub 16384 to get over 1000 t/s PP.
What I find quite surprising is that offloading some/more layers has very little effect... but I guess it is just math - I'd need to offload much more to see a noticeable improvement.

This comment has been hidden (marked as Off-Topic)

Yeah as @anikifoss points out on the linked thread

GLM-4.6 has 92 layers

So each additional routed expert layer offload to VRAM isn't a large overall percentage especially given only 8 or however many are active per token - its very sparse.

You can still add a max power cap in addition to LACT if you're really worried about tripping the power supply, but yeah play around and get your GPUs tuned up one at a time then save the config into /etc/lact/config.yaml and you'll be gucci!

No idea what this was about, it said this already by the time I read this:

This comment has been hidden (marked as Off-Topic)

No idea what this was about (...)

HF is acting strange lately... but this was my mistake ;)

Here are some performance tests of the IQ5_K version running on EPYC 9355, RTX 5090, and RTX Pro 6000.
Sharing these benchmarks for anyone thinking about this kind of setup.

  • CPU power capped at 280 W, GPUs at 450 W
  • Running inside a Proxmox/QEMU VM, with only 32 real cores assigned (SMT cores reserved for host)
  • Testing various combinations of context size and batch size fitting the VRAM

This is really neat! How much RAM do you have? I assume not everything is fitting on the 96GB of VRAM.

Sign up or log in to comment