GLM-4.5-GGUF / README.md

Update README.md

0b70c23 verified 2 months ago

4.24 kB

	---
	quantized_by: AesSedai
	pipeline_tag: text-generation
	base_model: zai-org/GLM-4.5
	license: mit
	base_model_relation: quantized
	---

	## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5
	This quant collection REQUIRES [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

	NOTE `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

	Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for [Windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases) which have been CUDA 12.8.

	See [Ubergarm's GLM-4.5 quants](https://huggingface.co/ubergarm/GLM-4.5-GGUF) for info on how to use the recipe or make your own quant.

	## IQ2_KT: 109.269 GiB (2.619 BPW), Final estimate: PPL = 4.1170 +/- 0.02457

	<details>

	<summary>👈 Recipe</summary>

	```bash
	# 93 Repeating Layers [0-92]

	# Attention
	blk\..\.attn_q.=iq4_k
	blk\..\.attn_k.=iq6_k
	blk\..\.attn_v.=iq6_k
	blk\..\.attn_output.=iq5_ks

	# First 3 Dense Layers [0-2]
	blk\..*\.ffn_down\.weight=iq4_ks
	blk\..*\.ffn_(gate\|up)\.weight=iq3_ks

	# Shared Expert Layers [3-92]
	blk\..*\.ffn_down_shexp\.weight=iq6_k
	blk\..*\.ffn_(gate\|up)_shexp\.weight=iq6_k

	# Routed Experts Layers [3-92]
	blk\..*\.ffn_down_exps\.weight=iq3_kt
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq2_kt

	# NextN MTP Layer [92]
	blk\..*\.nextn\.embed_tokens\.weight=iq4_k
	blk\..*\.nextn\.shared_head_head\.weight=iq6_k
	blk\..*\.nextn\.eh_proj\.weight=iq6_k

	# Non-Repeating Layers
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	```

	</details>

	## IQ4_KSS: 176.499 GiB (4.231 BPW), Final estimate: PPL = 3.3031 +/- 0.01871

	<details>

	<summary>👈 Recipe</summary>

	```bash
	# 93 Repeating Layers [0-92]

	# Attention
	blk\.(0\|1\|2)\.attn_q.*=q8_0
	blk\.(0\|1\|2)\.attn_k.*=q8_0
	blk\.(0\|1\|2)\.attn_v.*=q8_0
	blk\.(0\|1\|2)\.attn_output.*=q8_0

	blk\..\.attn_q.=iq6_k
	blk\..\.attn_k.=iq6_k
	blk\..\.attn_v.=iq6_k
	blk\..\.attn_output.=iq6_k

	# First 3 Dense Layers [0-2]
	blk\..*\.ffn_down\.weight=iq5_ks
	blk\..*\.ffn_(gate\|up)\.weight=iq4_ks

	# Shared Expert Layers [3-92]
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	# Routed Experts Layers [3-92]
	blk\..*\.ffn_down_exps\.weight=iq4_ks
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq4_kss

	# NextN MTP Layer [92]
	blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
	blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
	blk\..*\.nextn\.eh_proj\.weight=q8_0

	# Non-Repeating Layers
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	```

	</details>

	## IQ4_KS-IQ4_KS-IQ5_KS: 200.326 GiB (4.802 BPW), Final estimate: PPL = TBD (but better than IQ5_K)

	<details>

	<summary>👈 Recipe</summary>

	```bash
	Default quant level @ Q8_0

	# Shared Expert Layers [3-92]
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	# Routed Experts Layers [3-92]
	blk\..*\.ffn_up_exps\.weight=iq4_ks
	blk\..*\.ffn_gate_exps\.weight=iq4_ks
	blk\..*\.ffn_down_exps\.weight=iq5_ks
	```

	</details>

	## IQ5_K: 204.948 GiB (4.913 BPW), Final estimate: PPL = 3.1992 +/- 0.01801

	<details>

	<summary>👈 Recipe</summary>

	```bash
	# 93 Repeating Layers [0-92]

	# Attention
	blk\.(0\|1\|2)\.attn_q.*=q8_0
	blk\.(0\|1\|2)\.attn_k.*=q8_0
	blk\.(0\|1\|2)\.attn_v.*=q8_0
	blk\.(0\|1\|2)\.attn_output.*=q8_0

	blk\..\.attn_q.=iq5_k
	blk\..\.attn_k.=iq5_k
	blk\..\.attn_v.=iq5_k
	blk\..\.attn_output.=iq5_k

	# First 3 Dense Layers [0-2]
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	# Shared Expert Layers [3-92]
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	# Routed Experts Layers [3-92]
	blk\..*\.ffn_down_exps\.weight=iq5_k
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq4_k

	# NextN MTP Layer [92]
	blk\..*\.nextn\.embed_tokens\.weight=iq5_k
	blk\..*\.nextn\.shared_head_head\.weight=iq5_k
	blk\..*\.nextn\.eh_proj\.weight=q8_0

	# Non-Repeating Layers
	token_embd\.weight=q8_0
	output\.weight=q8_0
	```

	</details>