Definitely interested in this one!

#1
by mtcl - opened

πŸ˜€

would this run on spark at all? i doubt that. lol

but with 512GB RAM + 160GB VRAM I think i can run it on my other machine. Can't wait!

I am definitely not stalking you at all.

lol you gonna need 5x dgx spark to run this one i think πŸ’€ lol... the full model size is 543.617 GiB (4.549 BPW) and it supports MLA so kv-cache won't be too heavy on VRAM usage so you'll be :gucci:!!

uploading now!

$ hf upload ubergarm/Kimi-K2-Thinking-GGUF ./Q8_0-Q4_0 Q8_0-Q4_0
Start hashing 13 files.
Finished hashing 13 files.
Processing Files (0 / 8)      :  13%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰                                                                         | 76.0GB /  584GB,  314MB/s
New Data Upload               :  99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 76.0GB / 76.7GB,  314MB/s
  ...-Q4_0-00001-of-00013.gguf:  20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                   | 9.40GB / 47.8GB
  ...-Q4_0-00003-of-00013.gguf:  20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                                                                   | 9.55GB / 47.6GB
  ...-Q4_0-00013-of-00013.gguf:  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                     | 9.48GB / 12.7GB
  ...-Q4_0-00004-of-00013.gguf:  20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                                                                   | 9.52GB / 47.6GB
  ...-Q4_0-00010-of-00013.gguf:  20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                                                                   | 9.52GB / 47.6GB
  ...-Q4_0-00002-of-00013.gguf:  20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                                                                   | 9.52GB / 47.6GB
  ...-Q4_0-00008-of-00013.gguf:  20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                                                                   | 9.54GB / 47.6GB
  ...-Q4_0-00007-of-00013.gguf:  20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹                                                                   | 9.46GB / 47.6GB

Is there a possibility to make a quantization that can fit into 512Gb of RAM?

So it should fit with 512RAM + 160VRAM hopefully.

What would be the startup command for it? Similar to Kimi non thinking version I'd assume.

total file size is 47.6*12+12.9 = 584.1GB.

I was able to run it on a 265GB VRAM + 512GB RAM (not fully used) system with half of the experts layers offloaded at a decent 9t/s. It works quite well but I haven't been able to fix the chat template, even using the jinja chat template as uber says in the model card I cannot get the model to output the thinking tags, i have to explore a bit more

I'll have some performance numbers up later for 512GB DDR5 and 96GB VRAM! The t/s might be too slow for conversation, but for tasks where intricate thought is required this might be worth it!

--- edit

Solid 13 t/s, the thought process isnt more than it needs to be and the outputs are very solid. I'll use it more throughout the week! I'm limited to very small context window of 10k so I can't ask it to do anything big, but everything is impressive so far! Maybe I'll try a smaller quant when they publish to squeeze more context into the VRAM and have it do some coding tasks

@ perelmanych

Is there a possibility to make a quantization that can fit into 512Gb of RAM?

I've cooked one for you that tries to shave down the full quality just enough to fit more comfortably. Tried to preserve q4_0 where possible given it likely "fits" the original QAT best for routed experts. Testing its perplexity and validating it now before uploading:

{
  "name": "IQ3_K",
  "ppl": "TODO",
  "size": 474.772,
  "bpw": 3.973,
  "legend": "ubergarm",
  "comment": ""
}

Alright... here comes another 500 gig download this week. My new 2tb NVME is already crying after i downloaded minimax m2 bf16... really excited for this one based on the hype online

Edit: sorry I just realized the subtle bit of representation of GiB versus GB, haha so the IQ3_K 502 GB total already...

Sorry to be so picky, but what about a quant between the Q4 and IQ3_K, like 510-520 GB? So that way we can run on 512GB RAM + 24/32/48 GB VRAM. Might get very close in perplexity to Q4 that way. Many thanks for all the quants!

@facedwithahug

This one is a touch larger than the IQ3_K but with a nice dip lower "better" on the updated perplexity chart:

smol-IQ4_KSS 485.008 GiB (4.059 BPW)

Final estimate: PPL = 2.1343 +/- 0.00934

It lands at 520 GB (485GiB) yup!

Sign up or log in to comment