Absurd sizes.

#12
by ZeroWw - opened

It's absurd that different quantizations give out the same 11 GB size.
I don't see the advantage.
Also:
the quantized versions don't work well in llama.cpp

Unsloth AI org

For quantizing, llama.cpp has limitations atm and I think they're working on fixing it. Then we can make proper quants for it with many different sizes :)

Could you explain what you mean they dont work well? Accuracy, speed?

(Note: Wrote this before I realised the more detailed discussion around the same points in https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/2). Besides, the sizes for fixed bit-width quantisations don't add up: A 20B model at 16 bits should be around 40 GB, and at 8 bits at least 20 GB. Edit: Just read in the other thread that it seems to be generated from an FP4 original. While the size calculations still apply, they could be completely irrelevant if there is not any more information than 4 bits per parameter anyway (and it is not obvious to me how any quant above 4 bit could make any sense (at least not information-wise - but maybe for utilizing specific hardware optimizations).

yes. I understand... but Q2 size should be almost half of the Q4 size. (for example)

At first I was disappointed at the small difference in sizes for all quants. The layers in the original openai files are already mostly in mxfp4 format. I just went and used the original and didn't bother with the unsloth gguf.

Then I decided to give the UD Q6_XL a try. It is awesome in my case compared to the original openai version. I only have a 3060ti with 8GB with an Intel i3-10100 cpu, saving up 1.8GB of vram helps.

I'm able to get 15-16 tokens/sec with a context of 16,384 and 15 layers in vram using llama.cpp, compared to 11-12 tokens/sec with a context of 8,196 and 14 layers in vram using the original.

Sign up or log in to comment