Quant in the 320GiB and 360GiB range?

#11
by Panchovix - opened

Hello, great work as always! I'm waiting for a 3bpw model maybe, as prob I can't load smol IQ3_KS when it gets released (400GB and I think I can load at most 360GB base weights), so in the meanwhile I'm using Q2_K_XL from unsloth.

But I have noticed, that using "Moonshoot AI" preset on SillyTavern makes not work correctly: Like it does "commentary:", then "subject:", then "action:", etc.

Thanks!

Panchovix changed discussion status to closed

Heya @Panchovix I'm getting back to my desk today and catching up.

I saw you closed this, are you still looking for something in that intermediate range that isn't covered?

Panchovix changed discussion title from Does someone know what instruct and context preset to use for this model on SillyTavern? to Quant in the 360-380GiB range?
Panchovix changed discussion title from Quant in the 360-380GiB range? to Quant in the 360GiB range?

Hello! I closed it because I had that ST issue that I have fixed by using latest release version.

But I will re open asking a quant then haha.

A quant in the ~320GiB range and a quant in the ~360GiB range would be great! Q2_K_XL on unsloth says 384GiB but when downloaded is 360GiB, and that is basically my limit.

I have: 5090x2, 4090x2, A6000, A40 and 3090 + 192GB RAM with a 9900X

Got the A40 for 1200USD, and I sold 2 other 3090s for well, 1200USD lol so mostly to save power and space.

Thanks!

Panchovix changed discussion status to open
Panchovix changed discussion title from Quant in the 360GiB range? to Quant in the 320GiB and 360GiB range?

@Panchovix I just released some comparisons here which might be of interest to you: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/1

My current collection is missing something in the 350GiB range so I might take a look. I have to be a bit more prudent releasing the big ones as storage space is more limited than before (but I still have plenty for now).

@ubergarm many thanks! Yes something like 350GB would be great.

And pretty interesting graph, you get basically almost same ppl on small iq3_ks vs UD Q3 k XL, but a lot less size!

Something in the lower end of that 320GiB - 350GiB range would be nice :)

(Largest I can run is the smol-IQ2_KL @ 329.2 GiB for K2-Instruct)

@gghfez

Check over there: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/1#6913a72ba152f17d2be301bf

Just uploaded IQ2_KL 348.883 GiB (2.920 BPW) and really hope it fits for you! Looking strong on perplexity!

If it is too big, I could look into a smol-IQ2_KL but not sure if there is demand?

Yeah that would be slightly too large for my rig (I tried the 2_KL last time) as the 144GB vram is 6*24 and doesn't properly fill each card.

No need to make a quant just for me though, I can wait for https://huggingface.co/collections/Thireus/kimi-k2-thinking-thireus-special-split to be populated.

The only other demand I could think of would be those guys with 256GB on consumer AM5 and a single 96GB GPU.

The only other demand I could think of would be those guys with 256GB on consumer AM5 and a single 96GB GPU.

Yup, i know a guy who might be interested in that. i've cooked the smol-IQ2_KL to at least check perplexity, i'll let you know how the ppl turns out for the smol-IQ2_KL 329.195 GiB (2.755 BPW)

@gghfez ok it is uploaded! cheers!

Wow thanks! Downloading now. You have great upload bandwidth.

I'll user this one to try training control-vectors.

Okay testing IQ2_KL and it seems very promising!

Running with

./llama-server \
  -m '/Kimi-K2-Thinking-IQ2_KL-00001-of-00008.gguf' \
  -c 32768 \
  --no-mmap \
  -ngl 999 \
  -ot "blk.(0|1|2|3|4).ffn.=CUDA0" \
  -ot "blk.(5|6|7).ffn.=CUDA1" \
  -ot "blk.(8|9|10).ffn.=CUDA2" \
  -ot "blk.(11|12|13|14).ffn.=CUDA3" \
  -ot "blk.(15|16|17).ffn.=CUDA4" \
  -ot "blk.(18|19|20|21|22|23|24).ffn.=CUDA5" \
  -ot "blk.(25|26|27|28|29|30|31).ffn.=CUDA6" \
  -ot "exps=CPU" \
  -mg 0 \
  -ub 2048 \
  -mla 1

I get 177.11 t/s PP, 10.59 t/s TG. Ub maybe can be increased a bit more, and with this one I have 12GiB RAM left.

Also running with my cursed partial layers (to use more VRAM of the GPUs) with:

./llama-server \
  -m '/Kimi-K2-Thinking-IQ2_KL-00001-of-00008.gguf' \
  -c 32768 \
  --no-mmap \
  -ngl 999 \
  -ot "blk.(0|1|2|3|4).ffn.=CUDA0" \
  -ot "blk.(5|6|7).ffn.=CUDA1" \
  -ot "blk.(8|9|10).ffn.=CUDA2" \
  -ot "blk.(11|12|13|14).ffn.=CUDA3" \
  -ot "blk.(15|16|17).ffn.=CUDA4" \
  -ot "blk.(18|19|20|21|22|23|24).ffn.=CUDA5" \
  -ot "blk.(25|26|27|28|29|30|31).ffn.=CUDA6" \
  -ot "blk.32.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
  -ot "blk.32.ffn_gate_exps.weight=CUDA0" \
  -ot "blk.32.ffn_up_exps.weight=CUDA1" \
  -ot "blk.33.ffn_gate_exps.weight=CUDA1" \
  -ot "blk.34.ffn_gate_exps.weight=CUDA2" \
  -ot "blk.34.ffn_up_exps.weight=CUDA2" \
  -ot "blk.35.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA3" \
  -ot "blk.35.ffn_down_exps.weight=CUDA3" \
  -ot "blk.35.ffn_up_exps.weight=CUDA3" \
  -ot "blk.36.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA4" \
  -ot "blk.36.ffn_gate_exps.weight=CUDA4" \
  -ot "blk.37.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA5" \
  -ot "blk.37.ffn_gate_exps.weight=CUDA5" \
  -ot "exps=CPU" \
  -mg 0 \
  -ub 2048 \
  -mla 1 -no-fmoe

I get 180 t/s PP, 11 t/s TG. With this one I have 27 GiB RAM left (so basically added 15 GB of VRAM by using partial layers).

Maybe I will try to experiment a bit more to aim for 200 t/s PP.

Heya @Panchovix I'm getting back to my desk today and catching up.

I saw you closed this, are you still looking for something in that intermediate range that isn't covered?

YES,PLZ

@nwzjk

Great, check out the model card graph to see the available quants. both the smol-IQ2_KL and slightly larger IQ2_KL are both the best quality quants available for the size!

@nwzjk

Great, check out the model card graph to see the available quants. both the smol-IQ2_KL and slightly larger IQ2_KL are both the best quality quants available for the size!

Thanks, I'm downloading

image

@ubergarm
I'm running unsloth Q2_K_L with the following commad line, tg about 30t/s.
GPUS: 8*48

image


Would you please give me some advice on how to run the Kimi-K2-Thinking-IQ2_KL model?
especially on ik_llama.cpp arguments

···
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server
--model /data/models/unsloth/Kimi-K2-Thinking-GGUF/Q2_K_L/Kimi-K2-Thinking-Q2_K_L-00001-of-00008.gguf
--alias "coder"
--threads 48
--threads-batch 24
--cpu-range 0-47
--cpu-range-batch 24-47
--cpu-strict 1
--prio 2
--poll 80
--split-mode layer
--tensor-split 1,1,1,1,1,1,1,1
--batch-size 4096
--ubatch-size 1024
--cache-type-k q4_0
--cache-type-v q4_0
-fa on
--n-gpu-layers 999
--temp 1.0
--min_p 0.01
--ctx-size 128000
--host 0.0.0.0
--port 8000
--jinja
--reasoning-budget -1
--numa numactl
--special
···

@nwzjk

I'd build ik_llama.cpp first, then try running your unsloth quant with it.
I'm not up to date with some of these (poll 80, reasoning-budget-1, cpu-strict) but I can tell you taht this is gone:

-fa on -> flash attention is on by default and that flag will fail.

What does this do?

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

On hypothetical gotcha, I noticed mainline llama.cpp has a new server UI. I don't know if it'll break your local indexdb if you use the ik_llama.cpp one, so if you're using that, maybe test in an incognito window or other browser.

@Panchovix

Also running with my cursed partial layers (to use more VRAM of the GPUs) with:

"cursed" how? Looks like if anything, you're getting better performance doing that?

-mla 1

Why mla 1 btw?

And the -no-fmoe - is that necessary because of the cursed partial layers?

@nwzjk

I'd build ik_llama.cpp first, then try running your unsloth quant with it.
I'm not up to date with some of these (poll 80, reasoning-budget-1, cpu-strict) but I can tell you taht this is gone:

-fa on -> flash attention is on by default and that flag will fail.

What does this do?

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

On hypothetical gotcha, I noticed mainline llama.cpp has a new server UI. I don't know if it'll break your local indexdb if you use the ik_llama.cpp one, so if you're using that, maybe test in an incognito window or other browser.

@Panchovix

Also running with my cursed partial layers (to use more VRAM of the GPUs) with:

"cursed" how? Looks like if anything, you're getting better performance doing that?

-mla 1

Why mla 1 btw?

And the -no-fmoe - is that necessary because of the cursed partial layers?

https://github.com/ggml-org/llama.cpp/blob/97d5117217e4ad904493345e2f71dfe441a08e25/docs/build.md
Unified Memory
The environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as System Memory Fallback

@ubergarm
I complied ik_llama.cpp as follows (after git pull):

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 DLLAMA_CURL=ON DLLAMA_NUMA=ON \
cmake -B build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON \
-DGGML_CUDA_FORCE_MMQ=ON \
  -DGGML_CUDA_FORCE_DMMV=ON \
  -DGGML_CUDA_F16=1 \
  -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_SCHED_MAX_COPIES=1 \
  -DLLAMA_CURL=ON \
  -DLLAMA_NUMA=ON \
  -DLLAMA_ACCELERATE=ON \
  -DCMAKE_CUDA_ARCHITECTURES="89"

When running

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 GGML_CUDA_FORCE_MMQ=1 \
numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server \
    --model /data/models/unsloth/Kimi-K2-Thinking-GGUF/Q2_K_L/Kimi-K2-Thinking-Q2_K_L-00001-of-00008.gguf \
    --alias "coder" \
--threads 48 \
--threads-batch 24 \
--split-mode layer \
--tensor-split 1,1,1,1,1,1,1,1 \
--batch-size 4096 \
--ubatch-size 1024 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
    -fa on \
--n-gpu-layers 999 \
--temp 1.0 \
    --min_p 0.01 \
--ctx-size 128000 \
--host 0.0.0.0 \
    --port 8000 \
--jinja \
--reasoning-budget -1 \
--numa numactl \
--special

First , these args are not allowed[But llama.cpp supports]
--cpu-range 0-47
--cpu-range-batch 24-47
--cpu-strict 1
--prio 2
--poll 80

Second, tg about just 20t/s. [But llama.cpp 30t/s]

Why?

==============
Compile Settings to OFF make the difference! now [ik_llama.cpp 38t/s]]
-DGGML_CUDA_FORCE_MMQ=OFF
-DGGML_CUDA_FORCE_DMMV=OFF \

@gghfez cursed to the eyes haha.

I use MLA 1 because it uses way less VRAM vs mla 3, and when testing the only difference on my setup was 5% better PP. So I take more weights into the GPU than the extra 5%.

And yes, fmoe with partial layers tanks TG performance, like 1-2 t/s instead of 10+

@nwzjk

Why?

Woah you use a ton of stuff i never touch. Just because bells and whistles exist does not mean you have to use them all. Some of the other folks are doing unusual things as well but that is because they are trying to optimize a bunch of GPUs VRAM utilization and willing to give up other things to do this.

[ik_llama.cpp 38t/s]]

Okay I guess you found a good setting for your rig then, that is great! I advise testing with llama-sweep-bench to compare speeds and tune your commands for your rig.

Keep at it and have fun tuning all the knobs haha...

I use MLA 1 because it uses way less VRAM vs mla 3, and when testing the only difference on my setup was 5% better PP. So I take more weights into the GPU than the extra 5%.

And yes, fmoe with partial layers tanks TG performance, like 1-2 t/s instead of 10+

Must be a blackwell issue. I get the opposite result.

MLA3:

llama_new_context_with_model: KV self size  = 1612.69 MiB, c^KV (f16): 1612.69 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.62 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3958.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  1770.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  1770.01 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  1770.01 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =  1770.01 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =  1770.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   150.02 MiB

MLA1:

llama_new_context_with_model: KV self size  = 1612.69 MiB, c^KV (f16): 1612.69 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.62 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3677.75 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   947.06 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   947.06 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   947.06 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =   947.06 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =  1336.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   150.02 MiB

Looks good at first, but I hit OOM pretty fast during textgen with MLA1, where as the cache size doesn't grow mch with MLA3.
I suspect it pre-allocates when using MLA3?

I just copied your cursed offload for layers 32 and 35 (just randomly) and that's working so I'm glad you posted it.

It seems to both preacollate but compute buffers are a bit heavier, as you can see in your output.

I have multigen GPUs so not sure if affects it.

Yeah partial layers can increase a bit PP/TG by using more VRAM basically, as layers are pretty big sometimes to fit an entire one in some GPUs.

Sign up or log in to comment