IQ5_K benchmarks on AMD EPYC with a single GPU (RTX 5090/6000)
Here are some performance tests of the IQ5_K version running on EPYC 9355, RTX 5090, and RTX Pro 6000.
Sharing these benchmarks for anyone thinking about this kind of setup.
- CPU power capped at 280 W, GPUs at 450 W
- Running inside a Proxmox/QEMU VM, with only 32 real cores assigned (SMT cores reserved for host)
- Testing various combinations of context size and batch size fitting the VRAM
./llama-sweep-bench \
-fa -fmoe \
-ngl 999 \
-ot exps=CPU \
--no-mmap \
--threads 16 \
--threads-batch 28 \
--warmup-batch
RTX 5090 - 32k @ f16 (-ub 2048)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 7.465 | 274.33 | 29.334 | 17.45 |
| 2048 | 512 | 2048 | 7.546 | 271.39 | 29.636 | 17.28 |
| 2048 | 512 | 4096 | 7.629 | 268.45 | 29.921 | 17.11 |
| 2048 | 512 | 6144 | 7.715 | 265.45 | 30.586 | 16.74 |
| 2048 | 512 | 8192 | 7.808 | 262.29 | 30.479 | 16.80 |
| 2048 | 512 | 10240 | 7.930 | 258.27 | 30.871 | 16.58 |
| 2048 | 512 | 12288 | 7.981 | 256.62 | 31.220 | 16.40 |
| 2048 | 512 | 14336 | 8.061 | 254.07 | 31.598 | 16.20 |
| 2048 | 512 | 16384 | 8.172 | 250.61 | 31.927 | 16.04 |
| 2048 | 512 | 18432 | 8.266 | 247.78 | 32.196 | 15.90 |
| 2048 | 512 | 20480 | 8.101 | 252.82 | 32.558 | 15.73 |
| 2048 | 512 | 22528 | 8.251 | 248.23 | 32.963 | 15.53 |
| 2048 | 512 | 24576 | 8.305 | 246.59 | 34.211 | 14.97 |
| 2048 | 512 | 26624 | 8.456 | 242.20 | 34.201 | 14.97 |
| 2048 | 512 | 28672 | 8.563 | 239.18 | 34.049 | 15.04 |
| 2048 | 512 | 30720 | 8.641 | 237.00 | 34.399 | 14.88 |
RTX 5090 - 32k @ q8_0 (-ub 8192)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8192 | 2048 | 0 | 12.104 | 676.80 | 123.036 | 16.65 |
| 8192 | 2048 | 8192 | 13.802 | 593.52 | 131.376 | 15.59 |
| 8192 | 2048 | 16384 | 15.529 | 527.52 | 138.550 | 14.78 |
| 8192 | 2048 | 24576 | 17.297 | 473.62 | 150.579 | 13.60 |
RTX 6000 - 32k @ q8_0 (-ub 16384, -ot "\.(2[6-9]|[3-9][0-9])\.ffn_(gate|up|down)_exps.=CPU")
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 16384 | 4096 | 0 | 16.896 | 969.68 | 219.313 | 18.68 |
| 16384 | 4096 | 16384 | 23.399 | 700.21 | 247.435 | 16.55 |
RTX 6000 - 192k @ f16 (-ub 8192)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8192 | 2048 | 0 | 11.294 | 725.36 | 121.533 | 16.85 |
| 8192 | 2048 | 8192 | 12.679 | 646.12 | 125.605 | 16.31 |
| 8192 | 2048 | 16384 | 13.913 | 588.79 | 132.592 | 15.45 |
| 8192 | 2048 | 24576 | 15.433 | 530.81 | 138.509 | 14.79 |
| 8192 | 2048 | 32768 | 17.222 | 475.67 | 144.749 | 14.15 |
| 8192 | 2048 | 40960 | 19.431 | 421.58 | 151.219 | 13.54 |
| 8192 | 2048 | 49152 | 21.785 | 376.03 | 157.030 | 13.04 |
| 8192 | 2048 | 57344 | 24.127 | 339.54 | 163.342 | 12.54 |
| 8192 | 2048 | 65536 | 26.443 | 309.80 | 171.141 | 11.97 |
| 8192 | 2048 | 73728 | 28.571 | 286.73 | 186.935 | 10.96 |
| 8192 | 2048 | 81920 | 30.809 | 265.90 | 218.675 | 9.37 |
| 8192 | 2048 | 90112 | 33.121 | 247.33 | 245.795 | 8.33 |
| 8192 | 2048 | 98304 | 35.498 | 230.78 | 260.610 | 7.86 |
| 8192 | 2048 | 106496 | 38.019 | 215.47 | 273.867 | 7.48 |
| 8192 | 2048 | 114688 | 40.699 | 201.28 | 285.800 | 7.17 |
| 8192 | 2048 | 122880 | 43.860 | 186.78 | 297.138 | 6.89 |
| 8192 | 2048 | 131072 | 46.158 | 177.48 | 308.663 | 6.64 |
| 8192 | 2048 | 139264 | 47.971 | 170.77 | 319.498 | 6.41 |
| 8192 | 2048 | 147456 | 50.163 | 163.31 | 331.344 | 6.18 |
| 8192 | 2048 | 155648 | 52.666 | 155.55 | 343.088 | 5.97 |
| 8192 | 2048 | 163840 | 55.144 | 148.56 | 354.767 | 5.77 |
| 8192 | 2048 | 172032 | 57.282 | 143.01 | 365.980 | 5.60 |
| 8192 | 2048 | 180224 | 59.690 | 137.24 | 380.191 | 5.39 |
| 8192 | 2048 | 188416 | 62.231 | 131.64 | 389.775 | 5.25 |
I could squeeze out slightly better results by using more threads or lifting the power limits, but it's not really worth the extra heat and noise.
Any feedback, configuration advice, or questions are very welcome!
Thanks for sharing your results! The command looks reasonable and cool you can hit 8192 batch sizes (some folks have trouble with higher sizes on their rigs).
CPU power capped at 280 W, GPUs at 450 W
The only tip I can think of off-hand is to consider not power capping your GPUs with nvidia-smi -pl 450 or whatever and instead use the LACT "indirect undervolt" method assuming Linux or something like MSI Afterburner on Windoze.
On Beaver AI Discord a guy has an RTX 6000 PRO and said by locking the boost clock and setting 1000MHz offset (which seems scaled in Blackwell vs older arch GPUs) that it runs at same performance and never goes much over 300W anyway. I have a post on Reddit, l1t, and there is a good github issue thread describing it on the LACT github repo. It works on headless systems as well and can be persisted across boots. @Panchovix helped spread this method and explain how the Blackwell GPUs use like 10x the offset as earlier models for some reason.
https://github.com/ilya-zlobintsev/LACT/issues/486#issuecomment-3303917115
Thank you @ubergarm for the tip about LACT, I'll sure check it out!
I started power-limiting when some people stressed me out here about my 2200W PSU, claiming it is not strong enough for 2x CPU, 2x GPU, 4x HDD and 3x SSD 😀. Although I disagree, I decided that putting some power limits in place might not be a bad idea...
Yeah, higher -ub values work well on my machine. Here I tested -ub 16384 to get over 1000 t/s PP.
What I find quite surprising is that offloading some/more layers has very little effect... but I guess it is just math - I'd need to offload much more to see a noticeable improvement.
Yeah as @anikifoss points out on the linked thread
GLM-4.6 has 92 layers
So each additional routed expert layer offload to VRAM isn't a large overall percentage especially given only 8 or however many are active per token - its very sparse.
You can still add a max power cap in addition to LACT if you're really worried about tripping the power supply, but yeah play around and get your GPUs tuned up one at a time then save the config into /etc/lact/config.yaml and you'll be gucci!
No idea what this was about, it said this already by the time I read this:
This comment has been hidden (marked as Off-Topic)
No idea what this was about (...)
HF is acting strange lately... but this was my mistake ;)
Here are some performance tests of the IQ5_K version running on EPYC 9355, RTX 5090, and RTX Pro 6000.
Sharing these benchmarks for anyone thinking about this kind of setup.
- CPU power capped at 280 W, GPUs at 450 W
- Running inside a Proxmox/QEMU VM, with only 32 real cores assigned (SMT cores reserved for host)
- Testing various combinations of context size and batch size fitting the VRAM
This is really neat! How much RAM do you have? I assume not everything is fitting on the 96GB of VRAM.