Is it normal this model hoards 130GB RAM. 5x more than model size. *solved*
update This issue was solved simply by defining CTX=10240 # 10 k tokens, and didn't allow IK-llama reserve max CW what model allows (1M tokens). Thanks owao for help!
After that change mem consumption is not far from model size.
[2025-11-09_15:11:31]
Per-node process memory usage (in MBs) for PID 3030 (llama-server)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 0.00 0.00 0.00
Heap 41.41 0.00 41.41
Stack 0.09 0.00 0.09
Private 25970.79 0.11 25970.90
Total 26012.29 0.11 26012.40
Timestamp Total(kB) RSS(kB) Dirty(kB)
2025-11-09_15:13:06 total kB 29853816 26636704 1530988
2025-11-09_15:13:07 total kB 29853816 26636704 1530988
2025-11-09_15:13:08 total kB 29853816 26636704 1530988
I have (mradermacher's) i1 version of this model, but still scratching my head. When launching model launch_ik_llama_CPU0_8080_aquif-3.5-Max-42B-A3B.i1-Q4_K_M.gguf HTOP shows RAM consumption goes from 1.5GB to 131GB. Grafana shows 152GB. I have never seen that inference process takes about 5x more RAM than the model size.
aquif-3.5-Max-42B-A3B.i1-Q4_K_M.gguf
I've heard some explanations about its unique architecture etc, but those doesn't sound credible.
I have dual xeon 4116 + 192GB RAM. Im using IK-llama & Ubuntu server 24.04. I would like to be able to run this just on 1 side memory (96GB).
MODEL=/datahub/models/aquif-3.5-Max-42B-A3B.i1-Q4_K_M.gguf/datahub/ik_llama.cpp/build/bin/llama-server
BIN=
PORT=8080
SOCKET=0
Cores for Socket 0
CORES="0-11,24-35"
Change MEM to allow access to ALL memory nodes (0 and 1)
MEM="0,1"
numactl -C $CORES -m $MEM
"$BIN"
--model "$MODEL"
--threads 24
--parallel 2
--port $PORT
--host 0.0.0.0
speeds im getting...
Prompt
- Tokens: 499
- Time: 15823.784 ms
- Speed: 31.5 t/s
Generation - Tokens: 2120
- Time: 303319.544 ms
- Speed: 7.0 t/s
Can't tell what is wrong but just to share in my case with llama.cpp (not the fork you are using) I'm fitting the Q3_K_XL with 36k tokens window into 23 GB of RAM keeping the KV cache in f16, flash attention is enabled though.
Thank you! That is helpful datapoint!