|
|
--- |
|
|
quantized_by: AesSedai |
|
|
pipeline_tag: text-generation |
|
|
base_model: zai-org/GLM-4.5 |
|
|
license: mit |
|
|
base_model_relation: quantized |
|
|
--- |
|
|
|
|
|
## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5 |
|
|
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! |
|
|
|
|
|
*NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. |
|
|
|
|
|
Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for [Windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases) which have been CUDA 12.8. |
|
|
|
|
|
See [Ubergarm's GLM-4.5 quants](https://huggingface.co/ubergarm/GLM-4.5-GGUF) for info on how to use the recipe or make your own quant. |
|
|
|
|
|
## IQ2_KT: 109.269 GiB (2.619 BPW), Final estimate: PPL = 4.1170 +/- 0.02457 |
|
|
|
|
|
<details> |
|
|
|
|
|
<summary>π Recipe</summary> |
|
|
|
|
|
```bash |
|
|
# 93 Repeating Layers [0-92] |
|
|
|
|
|
# Attention |
|
|
blk\..*\.attn_q.*=iq4_k |
|
|
blk\..*\.attn_k.*=iq6_k |
|
|
blk\..*\.attn_v.*=iq6_k |
|
|
blk\..*\.attn_output.*=iq5_ks |
|
|
|
|
|
# First 3 Dense Layers [0-2] |
|
|
blk\..*\.ffn_down\.weight=iq4_ks |
|
|
blk\..*\.ffn_(gate|up)\.weight=iq3_ks |
|
|
|
|
|
# Shared Expert Layers [3-92] |
|
|
blk\..*\.ffn_down_shexp\.weight=iq6_k |
|
|
blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k |
|
|
|
|
|
# Routed Experts Layers [3-92] |
|
|
blk\..*\.ffn_down_exps\.weight=iq3_kt |
|
|
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt |
|
|
|
|
|
# NextN MTP Layer [92] |
|
|
blk\..*\.nextn\.embed_tokens\.weight=iq4_k |
|
|
blk\..*\.nextn\.shared_head_head\.weight=iq6_k |
|
|
blk\..*\.nextn\.eh_proj\.weight=iq6_k |
|
|
|
|
|
# Non-Repeating Layers |
|
|
token_embd\.weight=iq4_k |
|
|
output\.weight=iq6_k |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
## IQ4_KSS: 176.499 GiB (4.231 BPW), Final estimate: PPL = 3.3031 +/- 0.01871 |
|
|
|
|
|
<details> |
|
|
|
|
|
<summary>π Recipe</summary> |
|
|
|
|
|
```bash |
|
|
# 93 Repeating Layers [0-92] |
|
|
|
|
|
# Attention |
|
|
blk\.(0|1|2)\.attn_q.*=q8_0 |
|
|
blk\.(0|1|2)\.attn_k.*=q8_0 |
|
|
blk\.(0|1|2)\.attn_v.*=q8_0 |
|
|
blk\.(0|1|2)\.attn_output.*=q8_0 |
|
|
|
|
|
blk\..*\.attn_q.*=iq6_k |
|
|
blk\..*\.attn_k.*=iq6_k |
|
|
blk\..*\.attn_v.*=iq6_k |
|
|
blk\..*\.attn_output.*=iq6_k |
|
|
|
|
|
# First 3 Dense Layers [0-2] |
|
|
blk\..*\.ffn_down\.weight=iq5_ks |
|
|
blk\..*\.ffn_(gate|up)\.weight=iq4_ks |
|
|
|
|
|
# Shared Expert Layers [3-92] |
|
|
blk\..*\.ffn_down_shexp\.weight=q8_0 |
|
|
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 |
|
|
|
|
|
# Routed Experts Layers [3-92] |
|
|
blk\..*\.ffn_down_exps\.weight=iq4_ks |
|
|
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss |
|
|
|
|
|
# NextN MTP Layer [92] |
|
|
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks |
|
|
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks |
|
|
blk\..*\.nextn\.eh_proj\.weight=q8_0 |
|
|
|
|
|
# Non-Repeating Layers |
|
|
token_embd\.weight=iq4_k |
|
|
output\.weight=iq6_k |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
## IQ4_KS-IQ4_KS-IQ5_KS: 200.326 GiB (4.802 BPW), Final estimate: PPL = TBD (but better than IQ5_K) |
|
|
|
|
|
<details> |
|
|
|
|
|
<summary>π Recipe</summary> |
|
|
|
|
|
```bash |
|
|
Default quant level @ Q8_0 |
|
|
|
|
|
# Shared Expert Layers [3-92] |
|
|
blk\..*\.ffn_down_shexp\.weight=q8_0 |
|
|
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 |
|
|
|
|
|
# Routed Experts Layers [3-92] |
|
|
blk\..*\.ffn_up_exps\.weight=iq4_ks |
|
|
blk\..*\.ffn_gate_exps\.weight=iq4_ks |
|
|
blk\..*\.ffn_down_exps\.weight=iq5_ks |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
## IQ5_K: 204.948 GiB (4.913 BPW), Final estimate: PPL = 3.1992 +/- 0.01801 |
|
|
|
|
|
<details> |
|
|
|
|
|
<summary>π Recipe</summary> |
|
|
|
|
|
```bash |
|
|
# 93 Repeating Layers [0-92] |
|
|
|
|
|
# Attention |
|
|
blk\.(0|1|2)\.attn_q.*=q8_0 |
|
|
blk\.(0|1|2)\.attn_k.*=q8_0 |
|
|
blk\.(0|1|2)\.attn_v.*=q8_0 |
|
|
blk\.(0|1|2)\.attn_output.*=q8_0 |
|
|
|
|
|
blk\..*\.attn_q.*=iq5_k |
|
|
blk\..*\.attn_k.*=iq5_k |
|
|
blk\..*\.attn_v.*=iq5_k |
|
|
blk\..*\.attn_output.*=iq5_k |
|
|
|
|
|
# First 3 Dense Layers [0-2] |
|
|
blk\..*\.ffn_down\.weight=q8_0 |
|
|
blk\..*\.ffn_(gate|up)\.weight=q8_0 |
|
|
|
|
|
# Shared Expert Layers [3-92] |
|
|
blk\..*\.ffn_down_shexp\.weight=q8_0 |
|
|
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 |
|
|
|
|
|
# Routed Experts Layers [3-92] |
|
|
blk\..*\.ffn_down_exps\.weight=iq5_k |
|
|
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k |
|
|
|
|
|
# NextN MTP Layer [92] |
|
|
blk\..*\.nextn\.embed_tokens\.weight=iq5_k |
|
|
blk\..*\.nextn\.shared_head_head\.weight=iq5_k |
|
|
blk\..*\.nextn\.eh_proj\.weight=q8_0 |
|
|
|
|
|
# Non-Repeating Layers |
|
|
token_embd\.weight=q8_0 |
|
|
output\.weight=q8_0 |
|
|
``` |
|
|
|
|
|
</details> |