Kind of broken.
SOMETHING went wrong in the making of these quants, as Ollama's default quant outperforms all of them.
For now do not use GGUF in Ollama due to compatibility issues. We are working with Ollama to fix the issue
llama.cpp works fine via ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --special --jinja
But using ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF:Q8_0 and using a Community chat template from https://ollama.com/MichelRosselli/GLM-4.5-Air:BF16/blobs/e683b5dab156's doesn't work
IK. I am just putting this out so people know.
IK. I am just putting this out so people know.
wrote as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/6
Ollama might be better, but still messes up code so easily.
IK. I am just putting this out so people know.
How did you test Ollamas quant? From our tests it performs very similar to our quants and LM studios
IK. I am just putting this out so people know.
How did you test Ollamas quant? From our tests it performs very similar to our quants and LM studios
ollama pull glm-4.7-flash:Q4_K_M
Write a snippet of python code that draws a cute kitty with Matplotlib
Jan 21 UPDATE: llama.cpp has fixed a bug which caused the model to loop and produce poor outputs. We have reconverted and reuploaded the model so outputs should be much much better now.
You can now use Z.ai's recommended parameters and get great results:
- For general use-case: --temp 1.0 --top-p 0.95
- For tool-calling: --temp 0.7 --top-p 1.0
If you can test and let us know if you get better results? Thanks so much!
CC: @ItzPingCat @zoyer
Jan 21 UPDATE: llama.cpp has fixed a bug which caused the model to loop and produce poor outputs. We have reconverted and reuploaded the model so outputs should be much much better now.
You can now use Z.ai's recommended parameters and get great results:
- For general use-case: --temp 1.0 --top-p 0.95
- For tool-calling: --temp 0.7 --top-p 1.0
If you can test and let us know if you get better results? Thanks so much!
CC: @ItzPingCat @zoyer
I
it still seems schizo
it still seems schizo
Where are you using this on? Llama.cpp, lmstudio?
note: this is due to current chat template compatability issues
For now do not use GGUF in Ollama due to compatibility issues. We are working with Ollama to fix the issue
llama.cpp works fine via ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --special --jinja
But using ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF:Q8_0 and using a Community chat template from https://ollama.com/MichelRosselli/GLM-4.5-Air:BF16/blobs/e683b5dab156's doesn't work
In ollama if you edit the config file changing deepseek2 to
"model_format":"gguf","model_family":"glm4moelite","model_families":["glm4moelite"],"model_type":"29.9B","file_type":"Q4_K_M","renderer":"glm-4.7","parser":"glm-4.7"
you get a nicely formatted answer with a thinking block.
Still struggling with output though. A lot of incorrect answers vs ollama's glm-4.7-flash:q4_K_M
Would prefer to use Unsloth and the new REAP model! Will stay patient.




