Comparison of PPL and speed vs exl3

#2
by bullerwins - opened

Hi!

Now that finally some new models that are not 400B+ are coming out and I can put it everything on VRAM, I made some exl3 quants and for reference I'm leaving my ppl results here. I tried to make the bpw of the exl3 quants as close as possible to uber's:

Disclaimer: both ppl test use wikitext but I'm not sure if the evaluation corpus and formulas are apples to apples between exllamav3 eval/ppl and ik_llama.cpp's, so take this results with a grain of salt.

I'm putting in bold the closest bpw from uber's gguf's own results taken from the model card for a quick reference

8.0bpw:
29G
Exl3 Perplexity: 7.343258
GGUF 8.510 BPW PPL = 7.3606 +/- 0.05171

6.0bpw:
22G
Exl3 Perplexity: 7.346819
GGUF 5.999 BPW PPL = 7.3806 +/- 0.05170

5.0bpw:
19G
Exl3 Perplexity: 7.398807
GGUF 5.030 BPW PPL = 7.3951 +/- 0.05178

4.0bpw:
15G
Exl3 Perplexity: 7.413587
GGUF 4.082 BPW PPL = 7.4991 +/- 0.05269

3.0bpw:
12G
Exl3 7.831189
GGUF 3.240 BPW PPL = 7.7121 +/- 0.05402

PD: On going quants (exl3 takes a long time):
2.66bpw
4.37bpw
3.24bpw

I also made exl3's for Qwen3-235B-A22B-Instruct-2507 and I can fit up to 4-4.5bpw, so I may test those next, but each quant take +24h to make

Quick speed benchmark:

Testing with an agnostic openai api pp and tg tool at 8192 prompt and response length, both from a fresh start without any caching:

Exl3 4.0bpw:
PromptTok,RespTok,AvgPromptLatency(s),PromptProcThroughput(tok/s),AvgGenThroughput(tok/s)
8192,8192,2.156,3799.6,65.2

Note: 4100t/s pp and 65.18t/s tg reported by tabbyapi
On a 5090, it takes 54% VRAM
24K context at FP16

GGUF 4.082 BPW:
PromptTok,RespTok,AvgPromptLatency(s),PromptProcThroughput(tok/s),AvgGenThroughput(tok/s)
8192,8192,1.837,4460.2,71.6

Note: 4477.29t/s pp and 71.57t/s tg reported by ik_llama.cpp
On a 5090, it takes 55% of VRAM using:
CUDA_VISIBLE_DEVICES="2"
./build/bin/llama-server
--model /mnt/llms/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ3_K.gguf
-c 24000
-ngl 99 --host 0.0.0.0 --port 5001

@bullerwins

wow amazing work! that is a lot of effort!

Very interesting to see these comparisons, though right in my own testing with exl3 it may not be exactly apples-apples and unfortunately exllamav3 uses llama-cpp-python bindings which only work with mainline llama.cpp quants for its internal evaluation suite. I have a little info in my own experiments here: https://github.com/turboderp-org/exllamav3/pull/26#issuecomment-2900663206

I ended up with a crazy graph eventually comparing some mainline llama.cpp quants using the exllamav3 internal eval tool. (this is the older Qwen3 models from late May 2025):

plot-ppl-Qwen3-30B-A3B-exl3.png
plot-kld-Qwen3-30B-A3B-exl3.png

These graphs suggest to me that EXL3 is likely better quality quantization than some commonly available mainline llama.cpp recipes. But yeah if one can compare apples to oranges (and assuming same 512 context length - i did confirm the wikie.test.raw text is the same) then ik's new quants are quite competitive at least. EXL3 seems among the best available for full GPU offload situations for sure though!

I'm curious to see how those lower BPW KT quants work out, given ik's KT quants are similar QTIP style trellis quant using Louie's magic number actually haha...

Also thanks for your https://huggingface.co/bullerwins/Wan2.2-I2V-A14B-GGUF appreciate you sharing so much good stuff!


Also pinging @ArtusDev here as they provide a lot of high quality EXL3 quants!

Sign up or log in to comment