Python script to decompress tensors?
Can you provide a python script to decompress the safetensors back to normal bf16 format?
Ideally the script should run on CPU backend and require as little RAM as possible by iterating across the tensors lazily and saving output safetensor files (to avoid loading the whole model into RAM).
The model card says:
The checkpoints are saved in compressed-tensors format, supported by most of mainstream inference engine. If you need the checkpoints in higher precision such as FP8 or BF16, you can refer to official repo of compressed-tensors to unpack the int4 weights and convert to any higher precision.
However, llama.cpp (and by extension ollama) is one of the most mainstream inference engines, and it is not currently possible to quantize this model until it is properly decompressed into a normal bf16 safetensor format like your previous models.
Thanks!
In theory, assuming all of transformers and compressed-tensors is nice and lazy and efficient, it could be straight forward kind of like:
import torch
from safetensors.torch import save_model
from transformers import AutoModelForCausalLM
# first download compressed safetensors with hf cli e.g.
# $ hf download --local-dir ./Kimi-K2-Thinking moonshotai/Kimi-K2-Thinking
input_dir = "/mnt/data/models/moonshotai/Kimi-K2-Thinking/"
model = AutoModelForCausalLM.from_pretrained(input_dir, dtype=torch.bfloat16, trust_remote_code=True)
print("Half way there!")
# make sure you have ~2TB of disk space free for this
# and pre-make the output dir just in case e.g.
# $ mkdir -p /mnt/data/models/moonshotai/Kimi-K2-Thinking-bf16-safetensors/
output_dir = "/mnt/data/models/moonshotai/Kimi-K2-Thinking-bf16-safetensors/"
model.save_pretrained(output_dir)
However, it says "Compressing" and not "Decompressing" when starting?? Also it failed on some smaller test models I found on huggingface using compressed-tensors.
Here is the current situation now, and I expect it to fail after a while:
$ uv pip freeze | grep -P '(torch|trans|compress)'
Using Python 3.12.11 environment at: ~/projects/llama.cpp/venv
compressed-tensors==0.12.2
torch==2.9.0
transformers==4.57.1
$ python decompress-kimi-k2-thinking.py
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
You are using a model of type kimi_k2 to instantiate a model of type deepseek_v3. This is not supported for all configurations of models and can yield errors.
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
Compressing model: 0it [00:00, ?it/s] ~/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:827: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
Compressing model: 7270it [04:34, 25.38it/s]
Thanks to anyone who has figured this out!
I've opened an issue with compressed-tensors as well to provide an example just to decompress models into normal bf16 safetensors here: https://github.com/vllm-project/compressed-tensors/issues/511
You can also try my dequant_compressed_tensor function from https://github.com/ggml-org/llama.cpp/pull/17064
Just need to provide the correct _packed and _scale tensors. The _shape is unused for now.
ggerganov should stop being lazy and just add INT4 support. FP8 should also have been added long time ago, fuck converting everything into big ass bf16 just to quant it down again anyway.
Well thanks to the llama.cpp team there is a GGUF available for testing:
https://www.reddit.com/r/LocalLLaMA/comments/1oqo57j/ubergarmkimik2thinkinggguf_hugging_face/
Great job ngxson, compilade, DevQuasar, Bartowski, AesSedai, and more folks who pulled together hacking on this one today! π«Ά
@ubergarm Re BF16 - I think I might have gotten BF16 conversion to work - https://huggingface.co/unsloth/Kimi-K2-Thinking-BF16, but I'm not fully sure :)
It took ages for a conversion, but the following works I think (you need have 2.5 TB of RAM)
from transformers import AutoModelForCausalLM
from transformers.utils.quantization_config import CompressedTensorsConfig
model_name = "Kimi-K2-Thinking"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map = "cpu", local_files_only = True, trust_remote_code = True, quantization_config = CompressedTensorsConfig(run_compressed=False))
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code = True)
model.save_pretrained("uncompressed")
tokenizer.save_pretrained("uncompressed")
I think I've made it work with an alternative way with my own dequantize script (NOT require 2.5TB ram)
Uploading experimental GGUF for test here:
DevQuasar/moonshotai.Kimi-K2-Thinking-GGUF
Note It's currently uploading!
Okay thanks for the example script! I tried some things very close to that but didn't try CompressedTensorsConfig(run_compressed=False), and also don't have access to enough RAM on a single machine even across multiple sockets/NUMA nodes. Fortunately the llama.cpp folks put something together baked directly into convert_hf_to_gguf.py script that worked for me.
Sweet, yeah the patch in mainline llama.cpp convert script also seems to use its own method to decompress the tensors and not use compressed-tensors library directly (given its unclear if there is a lazy method). The patch version I used had a memory high-water usage mark less than 80GB or so of RAM converting to bf16 50GB splits.
Interestingly looks like you're getting <think> and </think> tags on yours, is that true with llama-server output as well as I haven't seen any thinking tags in my own testing yet!
If this is helpful, you can try this script which will lazily dequantize all safetensors to BF16 (it uses much less RAM than other approaches)
Please note that it will output thousands of new safetensors, one tensor per file. It won't copy the tokenizer/config files, you need to copy them manually.
import torch
import os.path as path
import os
from safetensors import safe_open
from safetensors.torch import save_file
import json
IN_DIR = "models/Kimi-K2-Thinking"
OUT_DIR = "models/Kimi-K2-Thinking-Dequant"
if not path.exists(OUT_DIR):
os.makedirs(OUT_DIR)
src_files = []
for i in range(62):
part_num = str(i + 1).zfill(5)
src_files.append(f"{IN_DIR}/model-{part_num}-of-000062.safetensors")
def main():
total_tensors = 0
for src in src_files:
packed = None
with safe_open(src, framework="pt", device="cpu") as in_f:
for name in in_f.keys():
if name.endswith(".weight_scale") or name.endswith(".weight_shape"):
continue # ignore
else:
total_tensors += 1
i = 0
for src in src_files:
packed = None
with safe_open(src, framework="pt", device="cpu") as in_f:
for name in in_f.keys():
data = in_f.get_tensor(name)
# assuming that packed tensor always followed by scale tensor (not the way around)
if name.endswith(".weight_packed"):
packed = data
continue
elif name.endswith(".weight_scale"):
scales = data
assert(packed is not None)
data = dequant(packed, scales)
name = name.replace(".weight_scale", ".weight")
packed = None
elif name.endswith(".weight_shape"):
# skip shape tensors
continue
outfile = save_single_tensor(name, data, i + 1, total_tensors)
i += 1
print(f"{name} shape={data.shape} dtype={data.dtype} --> {outfile}")
del data
def save_single_tensor(name, data, i, total):
outfile = f"model-{i:05d}-of-{total:05d}.safetensors"
outpath = path.join(OUT_DIR, outfile)
tensors = {name: data}
save_file(tensors, outpath)
return outfile # return the filename
def dequant(packed, scale):
num_bits = 4
group_size = 32
pack_factor = group_size // num_bits
mask = (1 << num_bits) - 1
unpacked = torch.zeros(
(packed.shape[0], packed.shape[1] * pack_factor),
device=packed.device,
dtype=torch.int32,
)
for i in range(pack_factor):
unpacked[:, i::pack_factor] = (packed >> (num_bits * i)) & mask
# convert uint4 to int4 (shift scale)
unpacked = unpacked - (mask + 1) // 2
scale = scale.unsqueeze(2)
unpacked = unpacked.to(torch.float32)
unpacked = unpacked.reshape(-1, unpacked.shape[1] // group_size, group_size)
dequantized = (unpacked * scale).reshape(-1, unpacked.shape[1] * group_size)
dequantized = dequantized.to(torch.bfloat16)
return dequantized
main()
Tested my Q3 quant. It has managed to do the zero shot hexagon test (on the 3rd try)
Prompt:
Write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically
I consider it a working model

