Add IQ1_KT (with iq4_nl ffn_down_exps lmao)
Browse files- README.md +60 -0
- images/perplexity.png +2 -2
README.md
CHANGED
|
@@ -347,6 +347,66 @@ numactl -N 0 -m 0 \
|
|
| 347 |
|
| 348 |
</details>
|
| 349 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 350 |
## Quick Start
|
| 351 |
If you want to disable thinking, add `/nothink` (correct, no underscore) at the *end* of your prompt.
|
| 352 |
|
|
|
|
| 347 |
|
| 348 |
</details>
|
| 349 |
|
| 350 |
+
## IQ1_KT 36.039 GiB (2.802 BPW)
|
| 351 |
+
Final estimate: PPL = 5.8214 +/- 0.03767
|
| 352 |
+
|
| 353 |
+
<details>
|
| 354 |
+
|
| 355 |
+
<summary>👈 Secret Recipe</summary>
|
| 356 |
+
|
| 357 |
+
```bash
|
| 358 |
+
#!/usr/bin/env bash
|
| 359 |
+
|
| 360 |
+
custom="
|
| 361 |
+
# 47 Repeating Layers [0-46]
|
| 362 |
+
# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.
|
| 363 |
+
|
| 364 |
+
# Attention
|
| 365 |
+
blk\..*\.attn_q.*=iq4_kt
|
| 366 |
+
blk\..*\.attn_k.*=iq4_kt
|
| 367 |
+
blk\..*\.attn_v.*=iq4_kt
|
| 368 |
+
blk\..*\.attn_output.*=iq4_kt
|
| 369 |
+
|
| 370 |
+
# First 1 Dense Layers [0]
|
| 371 |
+
blk\..*\.ffn_down\.weight=iq4_nl
|
| 372 |
+
blk\..*\.ffn_(gate|up)\.weight=iq4_kt
|
| 373 |
+
|
| 374 |
+
# Shared Expert Layers [1-46]
|
| 375 |
+
blk\..*\.ffn_down_shexp\.weight=iq4_nl
|
| 376 |
+
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kt
|
| 377 |
+
|
| 378 |
+
# Routed Experts Layers [1-46]
|
| 379 |
+
blk\..*\.ffn_down_exps\.weight=iq4_nl
|
| 380 |
+
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt
|
| 381 |
+
|
| 382 |
+
# NextN MTP Layer [46]
|
| 383 |
+
blk\..*\.nextn\.embed_tokens\.weight=iq4_kt
|
| 384 |
+
blk\..*\.nextn\.shared_head_head\.weight=iq4_kt
|
| 385 |
+
blk\..*\.nextn\.eh_proj\.weight=q8_0
|
| 386 |
+
|
| 387 |
+
# Non-Repeating Layers
|
| 388 |
+
token_embd\.weight=iq4_k
|
| 389 |
+
output\.weight=iq6_k
|
| 390 |
+
"
|
| 391 |
+
|
| 392 |
+
custom=$(
|
| 393 |
+
echo "$custom" | grep -v '^#' | \
|
| 394 |
+
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
|
| 395 |
+
)
|
| 396 |
+
|
| 397 |
+
numactl -N 1 -m 1 \
|
| 398 |
+
./build/bin/llama-quantize \
|
| 399 |
+
--custom-q "$custom" \
|
| 400 |
+
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
|
| 401 |
+
/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
|
| 402 |
+
/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ1_KT.gguf \
|
| 403 |
+
IQ1_KT \
|
| 404 |
+
192
|
| 405 |
+
```
|
| 406 |
+
|
| 407 |
+
</details>
|
| 408 |
+
|
| 409 |
+
|
| 410 |
## Quick Start
|
| 411 |
If you want to disable thinking, add `/nothink` (correct, no underscore) at the *end* of your prompt.
|
| 412 |
|
images/perplexity.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|