ubergarm
/

GLM-4.5-Air-GGUF

@@ -10,14 +10,12 @@ tags:
 - ik_llama.cpp
 ---
-*Note* The ik_llama.cpp PR is still in progress for support in main branch. Until then follow instructions here and keep an eye on the PR: https://github.com/ikawrakow/ik_llama.cpp/pull/668
 ## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
 This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
 *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
-Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP.
 These quants provide best in class perplexity for the given memory footprint.
@@ -411,18 +409,15 @@ numactl -N 1 -m 1 \
 If you want to disable thinking, add `/nothink` (correct, no underscore) at the *end* of your prompt.
 ```bash
-# Clone and checkout experimental PR (hopefully merged into main soon)
 $ git clone https://github.com/ikawrakow/ik_llama.cpp
 $ cd ik_llama.cpp
-$ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git
-$ git fetch Thireus
-$ git checkout glm-4.5-clean
 # Build for hybrid CPU+CUDA
 $ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
 $ cmake --build build --config Release -j $(nproc)
-# Test Experimental GGUF
 $ ./build/bin/llama-server \
     --model GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
     --alias ubergarm/GLM-4.5-Air-IQ4_KSS \

 - ik_llama.cpp
 ---
 ## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
 This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
 *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
+Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for [Windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases) which have been CUDA 12.8.
 These quants provide best in class perplexity for the given memory footprint.
 If you want to disable thinking, add `/nothink` (correct, no underscore) at the *end* of your prompt.
 ```bash
+# Clone and checkout
 $ git clone https://github.com/ikawrakow/ik_llama.cpp
 $ cd ik_llama.cpp
 # Build for hybrid CPU+CUDA
 $ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
 $ cmake --build build --config Release -j $(nproc)
+# Run API server
 $ ./build/bin/llama-server \
     --model GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
     --alias ubergarm/GLM-4.5-Air-IQ4_KSS \