moonshotai/Kimi-K2-Thinking · Python script to decompress tensors?

2 days ago

Can you provide a python script to decompress the safetensors back to normal bf16 format?

Ideally the script should run on CPU backend and require as little RAM as possible by iterating across the tensors lazily and saving output safetensor files (to avoid loading the whole model into RAM).

The model card says:

The checkpoints are saved in compressed-tensors format, supported by most of mainstream inference engine. If you need the checkpoints in higher precision such as FP8 or BF16, you can refer to official repo of compressed-tensors to unpack the int4 weights and convert to any higher precision.

However, llama.cpp (and by extension ollama) is one of the most mainstream inference engines, and it is not currently possible to quantize this model until it is properly decompressed into a normal bf16 safetensor format like your previous models.

Thanks!

ubergarm

2 days ago

•

edited 2 days ago

In theory, assuming all of transformers and compressed-tensors is nice and lazy and efficient, it could be straight forward kind of like:

import torch
from safetensors.torch import save_model
from transformers import AutoModelForCausalLM

# first download compressed safetensors with hf cli e.g.
# $ hf download --local-dir ./Kimi-K2-Thinking moonshotai/Kimi-K2-Thinking
input_dir = "/mnt/data/models/moonshotai/Kimi-K2-Thinking/"
model = AutoModelForCausalLM.from_pretrained(input_dir, dtype=torch.bfloat16, trust_remote_code=True)

print("Half way there!")

# make sure you have ~2TB of disk space free for this
# and pre-make the output dir just in case e.g.
# $ mkdir -p /mnt/data/models/moonshotai/Kimi-K2-Thinking-bf16-safetensors/
output_dir = "/mnt/data/models/moonshotai/Kimi-K2-Thinking-bf16-safetensors/"
model.save_pretrained(output_dir)

However, it says "Compressing" and not "Decompressing" when starting?? Also it failed on some smaller test models I found on huggingface using compressed-tensors.

Here is the current situation now, and I expect it to fail after a while:

$ uv pip freeze | grep -P '(torch|trans|compress)'
Using Python 3.12.11 environment at: ~/projects/llama.cpp/venv
compressed-tensors==0.12.2
torch==2.9.0
transformers==4.57.1

$ python decompress-kimi-k2-thinking.py
The module name  (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The module name  (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
You are using a model of type kimi_k2 to instantiate a model of type deepseek_v3. This is not supported for all configurations of models and can yield errors.
The module name  (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The module name  (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
Compressing model: 0it [00:00, ?it/s] ~/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:827: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Compressing model: 7270it [04:34, 25.38it/s]

Thanks to anyone who has figured this out!

ubergarm

2 days ago

I've opened an issue with compressed-tensors as well to provide an example just to decompress models into normal bf16 safetensors here: https://github.com/vllm-project/compressed-tensors/issues/511

ngxson

2 days ago

You can also try my dequant_compressed_tensor function from https://github.com/ggml-org/llama.cpp/pull/17064

Just need to provide the correct _packed and _scale tensors. The _shape is unused for now.

ChuckMcSneed

2 days ago

ggerganov should stop being lazy and just add INT4 support. FP8 should also have been added long time ago, fuck converting everything into big ass bf16 just to quant it down again anyway.

ubergarm

1 day ago

Well thanks to the llama.cpp team there is a GGUF available for testing:

https://www.reddit.com/r/LocalLLaMA/comments/1oqo57j/ubergarmkimik2thinkinggguf_hugging_face/

Great job ngxson, compilade, DevQuasar, Bartowski, AesSedai, and more folks who pulled together hacking on this one today! 🫶

danielhanchen

1 day ago

@ubergarm Re BF16 - I think I might have gotten BF16 conversion to work - https://huggingface.co/unsloth/Kimi-K2-Thinking-BF16, but I'm not fully sure :)

It took ages for a conversion, but the following works I think (you need have 2.5 TB of RAM)

from transformers import AutoModelForCausalLM
from transformers.utils.quantization_config import CompressedTensorsConfig
model_name = "Kimi-K2-Thinking"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map = "cpu", local_files_only = True, trust_remote_code = True, quantization_config = CompressedTensorsConfig(run_compressed=False))

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code = True)

model.save_pretrained("uncompressed")
tokenizer.save_pretrained("uncompressed")

csabakecskemeti

1 day ago

•

edited 1 day ago

I think I've made it work with an alternative way with my own dequantize script (NOT require 2.5TB ram)
Uploading experimental GGUF for test here:
DevQuasar/moonshotai.Kimi-K2-Thinking-GGUF

Note It's currently uploading!

Conversion utiliy

ubergarm

1 day ago

@danielhanchen

Okay thanks for the example script! I tried some things very close to that but didn't try CompressedTensorsConfig(run_compressed=False), and also don't have access to enough RAM on a single machine even across multiple sockets/NUMA nodes. Fortunately the llama.cpp folks put something together baked directly into convert_hf_to_gguf.py script that worked for me.

@csabakecskemeti

Sweet, yeah the patch in mainline llama.cpp convert script also seems to use its own method to decompress the tensors and not use compressed-tensors library directly (given its unclear if there is a lazy method). The patch version I used had a memory high-water usage mark less than 80GB or so of RAM converting to bf16 50GB splits.

Interestingly looks like you're getting <think> and </think> tags on yours, is that true with llama-server output as well as I haven't seen any thinking tags in my own testing yet!

csabakecskemeti

1 day ago

@ubergarm I'd appreciate a test to confirm the approach.
It takes a long time! :)

ngxson

1 day ago

•

edited 1 day ago

If this is helpful, you can try this script which will lazily dequantize all safetensors to BF16 (it uses much less RAM than other approaches)

Please note that it will output thousands of new safetensors, one tensor per file. It won't copy the tokenizer/config files, you need to copy them manually.

import torch
import os.path as path
import os
from safetensors import safe_open
from safetensors.torch import save_file
import json

IN_DIR = "models/Kimi-K2-Thinking"
OUT_DIR = "models/Kimi-K2-Thinking-Dequant"

if not path.exists(OUT_DIR):
    os.makedirs(OUT_DIR)

src_files = []
for i in range(62):
    part_num = str(i + 1).zfill(5)
    src_files.append(f"{IN_DIR}/model-{part_num}-of-000062.safetensors")

def main():
    total_tensors = 0
    for src in src_files:
        packed = None
        with safe_open(src, framework="pt", device="cpu") as in_f:
            for name in in_f.keys():
                if name.endswith(".weight_scale") or name.endswith(".weight_shape"):
                    continue # ignore
                else:
                    total_tensors += 1


    i = 0
    for src in src_files:
        packed = None
        with safe_open(src, framework="pt", device="cpu") as in_f:
            for name in in_f.keys():
                data = in_f.get_tensor(name)
                # assuming that packed tensor always followed by scale tensor (not the way around)
                if name.endswith(".weight_packed"):
                    packed = data
                    continue
                elif name.endswith(".weight_scale"):
                    scales = data
                    assert(packed is not None)
                    data = dequant(packed, scales)
                    name = name.replace(".weight_scale", ".weight")
                    packed = None
                elif name.endswith(".weight_shape"):
                    # skip shape tensors
                    continue

                outfile = save_single_tensor(name, data, i + 1, total_tensors)
                i += 1
                print(f"{name} shape={data.shape} dtype={data.dtype} --> {outfile}")
                del data


def save_single_tensor(name, data, i, total):
    outfile = f"model-{i:05d}-of-{total:05d}.safetensors"
    outpath = path.join(OUT_DIR, outfile)
    tensors = {name: data}
    save_file(tensors, outpath)
    return outfile # return the filename


def dequant(packed, scale):
    num_bits = 4
    group_size = 32
    pack_factor = group_size // num_bits

    mask = (1 << num_bits) - 1
    unpacked = torch.zeros(
        (packed.shape[0], packed.shape[1] * pack_factor),
        device=packed.device,
        dtype=torch.int32,
    )
    for i in range(pack_factor):
        unpacked[:, i::pack_factor] = (packed >> (num_bits * i)) & mask
    # convert uint4 to int4 (shift scale)
    unpacked = unpacked - (mask + 1) // 2

    scale = scale.unsqueeze(2)
    unpacked = unpacked.to(torch.float32)
    unpacked = unpacked.reshape(-1, unpacked.shape[1] // group_size, group_size)
    dequantized = (unpacked * scale).reshape(-1, unpacked.shape[1] * group_size)
    dequantized = dequantized.to(torch.bfloat16)
    return dequantized

main()

csabakecskemeti

1 day ago

•

edited 1 day ago

Tested my Q3 quant. It has managed to do the zero shot hexagon test (on the 3rd try)
Prompt:

Write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically

I consider it a working model