Repetitions, repetitions everywhere with top-p: 0.95

#2
by mratsim - opened

Hey, thank you for your models.

I actually spent the past 5 days raising issues on vllm and llmcompressor tracker to try to quantize your v1 (to prepare for GLM-4.6-Air drop soon™) and right in the middle of quantization you drop v2.

Anyway, I have noticed a weird behavior on the model if top-p is 0.95, whether temp is 0.8 or 0.9, I get repetition in either thinking or no thinking very easily (first reply!)
image

This didn't happen in the v1.

This is fixed by unsetting top-p (or setting it to 1). No action needed on your end, this is just informational in case others report the same to you.

This is when running with vllm in Chat Completions mode so I can't do backend reordering like in traditional SillyTavern (in Text Completions mode it's somewhat better).

It might be that the base model is quite sampler/quantization sensitive, see discussion here https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/discussions/4#68960ee29e427b197ddde54c

I'll try do do other quants like Fp8 to see if that happens there as well.

I have heard anecdotally from some others that GLM Air finetunes degrade quite heavily when quanted, although I've mostly tested sizes around Q4KS myself.

I haven't tested this model specifically, but I did run the original Iceblink at topp 0.9~ and temp 0.8 using chat completions in vLLM and had similar issues. I didn't investigate too much into why at the time (It's surprisingly common for finetunes to degrade under these conditions, although Iceblink did seem particularly bad). I think minp hides a world of hurt in the low probs.

There's Steam by TheDrummer & Animus by Darkhn which are other GLM 4.5 Air finetunes that you could potentially test for comparison.

Actually your Iceblink v1 with your recommended settings '{"temperature": 0.8, "min_p": 0.05, "top_p": 0.95}'quantized to AWQ4-bit didn't have the issue which is why I was surprised.

I quantized GLM-Steam as well and it seemed to work.

That said, the quantization ecosystem around vllm is being overhauled and at the time (ant still now) there was no easy way to ensure that for MoE all experts were activated by an input, to ensure that a seldom used expert (but critical in niche scenario) is not misquantized.

I thought I could counteract that with a rich calibration dataset and went overkill with, quoting myself

and calibrated with over 1600 samples, up to 8192 sequence length of:

According to the AWQ presentation
only 64 samples are needed however due to the Mixture-of-Experts topology, this implies all 127 experts need to see at least 64 samples or alternatively, we activate all experts during calibration which requires reimplementing the attention block of the model in llmcompressor's modeling DB.

but seems like it wasn't enough and I will have to write code https://github.com/vllm-project/llm-compressor/blob/0.8.1/examples/quantization_w4a4_fp4/README.md#quantizing-moes

in order to ensure the experts are quantized correctly, we can pass in calibrate_moe_context which temporarily updates the model definition to use Qwen3MoeSparseMoeBlock which updates how the forward pass is handled in the MoE block during calibration.

Code that will be soon deprecated :/ https://github.com/vllm-project/llm-compressor/issues/2036

Anyway, I'll add Fp8 and NVFP4A16 quantizations as well when they can be run without loading the full FP16 in VRAM (220GB!) as those are calibration free and so less at risk of MoE issues.

Sign up or log in to comment