Support added into ik_llama.cpp main branch now yay!
Browse files
README.md
CHANGED
|
@@ -10,14 +10,12 @@ tags:
|
|
| 10 |
- ik_llama.cpp
|
| 11 |
---
|
| 12 |
|
| 13 |
-
*Note* The ik_llama.cpp PR is still in progress for support in main branch. Until then follow instructions here and keep an eye on the PR: https://github.com/ikawrakow/ik_llama.cpp/pull/668
|
| 14 |
-
|
| 15 |
## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
|
| 16 |
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
|
| 17 |
|
| 18 |
*NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
|
| 19 |
|
| 20 |
-
Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP.
|
| 21 |
|
| 22 |
These quants provide best in class perplexity for the given memory footprint.
|
| 23 |
|
|
@@ -411,18 +409,15 @@ numactl -N 1 -m 1 \
|
|
| 411 |
If you want to disable thinking, add `/nothink` (correct, no underscore) at the *end* of your prompt.
|
| 412 |
|
| 413 |
```bash
|
| 414 |
-
# Clone and checkout
|
| 415 |
$ git clone https://github.com/ikawrakow/ik_llama.cpp
|
| 416 |
$ cd ik_llama.cpp
|
| 417 |
-
$ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git
|
| 418 |
-
$ git fetch Thireus
|
| 419 |
-
$ git checkout glm-4.5-clean
|
| 420 |
|
| 421 |
# Build for hybrid CPU+CUDA
|
| 422 |
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
|
| 423 |
$ cmake --build build --config Release -j $(nproc)
|
| 424 |
|
| 425 |
-
#
|
| 426 |
$ ./build/bin/llama-server \
|
| 427 |
--model GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
|
| 428 |
--alias ubergarm/GLM-4.5-Air-IQ4_KSS \
|
|
|
|
| 10 |
- ik_llama.cpp
|
| 11 |
---
|
| 12 |
|
|
|
|
|
|
|
| 13 |
## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
|
| 14 |
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
|
| 15 |
|
| 16 |
*NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
|
| 17 |
|
| 18 |
+
Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for [Windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases) which have been CUDA 12.8.
|
| 19 |
|
| 20 |
These quants provide best in class perplexity for the given memory footprint.
|
| 21 |
|
|
|
|
| 409 |
If you want to disable thinking, add `/nothink` (correct, no underscore) at the *end* of your prompt.
|
| 410 |
|
| 411 |
```bash
|
| 412 |
+
# Clone and checkout
|
| 413 |
$ git clone https://github.com/ikawrakow/ik_llama.cpp
|
| 414 |
$ cd ik_llama.cpp
|
|
|
|
|
|
|
|
|
|
| 415 |
|
| 416 |
# Build for hybrid CPU+CUDA
|
| 417 |
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
|
| 418 |
$ cmake --build build --config Release -j $(nproc)
|
| 419 |
|
| 420 |
+
# Run API server
|
| 421 |
$ ./build/bin/llama-server \
|
| 422 |
--model GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
|
| 423 |
--alias ubergarm/GLM-4.5-Air-IQ4_KSS \
|