Kind of broken.

by ItzPingCat - opened Jan 20

Discussion

ItzPingCat

Jan 20

SOMETHING went wrong in the making of these quants, as Ollama's default quant outperforms all of them.

shimmyshimmer

Unsloth AI org Jan 21

•

edited Jan 22 by

danielhanchen

For now do not use GGUF in Ollama due to compatibility issues. We are working with Ollama to fix the issue

llama.cpp works fine via ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --special --jinja

But using ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF:Q8_0 and using a Community chat template from https://ollama.com/MichelRosselli/GLM-4.5-Air:BF16/blobs/e683b5dab156's doesn't work

ItzPingCat

Jan 21

IK. I am just putting this out so people know.

zoyer

Jan 21

IK. I am just putting this out so people know.

wrote as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/6

Ollama might be better, but still messes up code so easily.

shimmyshimmer

Unsloth AI org Jan 21

IK. I am just putting this out so people know.

How did you test Ollamas quant? From our tests it performs very similar to our quants and LM studios

ItzPingCat

Jan 21

IK. I am just putting this out so people know.

How did you test Ollamas quant? From our tests it performs very similar to our quants and LM studios

ollama pull glm-4.7-flash:Q4_K_M

Write a snippet of python code that draws a cute kitty with Matplotlib

danielhanchen

Unsloth AI org Jan 21

Jan 21 UPDATE: llama.cpp has fixed a bug which caused the model to loop and produce poor outputs. We have reconverted and reuploaded the model so outputs should be much much better now.

You can now use Z.ai's recommended parameters and get great results:

For general use-case: --temp 1.0 --top-p 0.95
For tool-calling: --temp 0.7 --top-p 1.0

If you can test and let us know if you get better results? Thanks so much!

CC: @ItzPingCat @zoyer

ItzPingCat

Jan 21

Jan 21 UPDATE: llama.cpp has fixed a bug which caused the model to loop and produce poor outputs. We have reconverted and reuploaded the model so outputs should be much much better now.

You can now use Z.ai's recommended parameters and get great results:

For general use-case: --temp 1.0 --top-p 0.95

For tool-calling: --temp 0.7 --top-p 1.0

If you can test and let us know if you get better results? Thanks so much!

CC: @ItzPingCat @zoyer

ItzPingCat

Jan 21

it still seems schizo

shimmyshimmer

Unsloth AI org Jan 22

it still seems schizo

Where are you using this on? Llama.cpp, lmstudio?

ItzPingCat

Jan 22

ollama. heres an image of the new unsloth IQ3_XXS going off the rails

now heres an image of ollama q4 solving the same problem, exact same config and prompts.

ItzPingCat

Jan 22

heres an image of schizo thinking in unsloth quant. tons of misspellings, weird tokens, etc

ItzPingCat

Jan 22

note: this is due to current chat template compatability issues

danielhanchen

Unsloth AI org Jan 22

For now do not use GGUF in Ollama due to compatibility issues. We are working with Ollama to fix the issue

llama.cpp works fine via ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --special --jinja

But using ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF:Q8_0 and using a Community chat template from https://ollama.com/MichelRosselli/GLM-4.5-Air:BF16/blobs/e683b5dab156's doesn't work

LA-HuggingFace

Jan 23

In ollama if you edit the config file changing deepseek2 to
"model_format":"gguf","model_family":"glm4moelite","model_families":["glm4moelite"],"model_type":"29.9B","file_type":"Q4_K_M","renderer":"glm-4.7","parser":"glm-4.7"
you get a nicely formatted answer with a thinking block.
Still struggling with output though. A lot of incorrect answers vs ollama's glm-4.7-flash:q4_K_M

Would prefer to use Unsloth and the new REAP model! Will stay patient.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment