Kind of broken.

#7
by ItzPingCat - opened

SOMETHING went wrong in the making of these quants, as Ollama's default quant outperforms all of them.

For now do not use GGUF in Ollama due to compatibility issues. We are working with Ollama to fix the issue

llama.cpp works fine via ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --special --jinja

llamacpp glm

But using ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF:Q8_0 and using a Community chat template from https://ollama.com/MichelRosselli/GLM-4.5-Air:BF16/blobs/e683b5dab156's doesn't work

ollama glm

IK. I am just putting this out so people know.

IK. I am just putting this out so people know.

wrote as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/6

Ollama might be better, but still messes up code so easily.

Unsloth AI org

IK. I am just putting this out so people know.

How did you test Ollamas quant? From our tests it performs very similar to our quants and LM studios

IK. I am just putting this out so people know.

How did you test Ollamas quant? From our tests it performs very similar to our quants and LM studios

ollama pull glm-4.7-flash:Q4_K_M

Write a snippet of python code that draws a cute kitty with Matplotlib

Unsloth AI org

Jan 21 UPDATE: llama.cpp has fixed a bug which caused the model to loop and produce poor outputs. We have reconverted and reuploaded the model so outputs should be much much better now.

You can now use Z.ai's recommended parameters and get great results:

  • For general use-case: --temp 1.0 --top-p 0.95
  • For tool-calling: --temp 0.7 --top-p 1.0

If you can test and let us know if you get better results? Thanks so much!

CC: @ItzPingCat @zoyer

Jan 21 UPDATE: llama.cpp has fixed a bug which caused the model to loop and produce poor outputs. We have reconverted and reuploaded the model so outputs should be much much better now.

You can now use Z.ai's recommended parameters and get great results:

  • For general use-case: --temp 1.0 --top-p 0.95
  • For tool-calling: --temp 0.7 --top-p 1.0

If you can test and let us know if you get better results? Thanks so much!

CC: @ItzPingCat @zoyer

I

it still seems schizo

Unsloth AI org

it still seems schizo

Where are you using this on? Llama.cpp, lmstudio?

ollama. heres an image of the new unsloth IQ3_XXS going off the rails
image

now heres an image of ollama q4 solving the same problem, exact same config and prompts.

image

heres an image of schizo thinking in unsloth quant. tons of misspellings, weird tokens, etc

image

note: this is due to current chat template compatability issues

Unsloth AI org

For now do not use GGUF in Ollama due to compatibility issues. We are working with Ollama to fix the issue

llama.cpp works fine via ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --special --jinja

llamacpp glm

But using ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF:Q8_0 and using a Community chat template from https://ollama.com/MichelRosselli/GLM-4.5-Air:BF16/blobs/e683b5dab156's doesn't work

ollama glm

In ollama if you edit the config file changing deepseek2 to
"model_format":"gguf","model_family":"glm4moelite","model_families":["glm4moelite"],"model_type":"29.9B","file_type":"Q4_K_M","renderer":"glm-4.7","parser":"glm-4.7"
you get a nicely formatted answer with a thinking block.
Still struggling with output though. A lot of incorrect answers vs ollama's glm-4.7-flash:q4_K_M

Would prefer to use Unsloth and the new REAP model! Will stay patient.

Sign up or log in to comment