Is it normal this model hoards 130GB RAM. 5x more than model size. *solved*

#2
by MartinPatterson - opened

update This issue was solved simply by defining CTX=10240 # 10 k tokens, and didn't allow IK-llama reserve max CW what model allows (1M tokens). Thanks owao for help!

After that change mem consumption is not far from model size.
[2025-11-09_15:11:31]

Per-node process memory usage (in MBs) for PID 3030 (llama-server)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 0.00 0.00 0.00
Heap 41.41 0.00 41.41
Stack 0.09 0.00 0.09
Private 25970.79 0.11 25970.90


Total 26012.29 0.11 26012.40

Timestamp Total(kB) RSS(kB) Dirty(kB)

2025-11-09_15:13:06 total kB 29853816 26636704 1530988
2025-11-09_15:13:07 total kB 29853816 26636704 1530988
2025-11-09_15:13:08 total kB 29853816 26636704 1530988


I have (mradermacher's) i1 version of this model, but still scratching my head. When launching model launch_ik_llama_CPU0_8080_aquif-3.5-Max-42B-A3B.i1-Q4_K_M.gguf HTOP shows RAM consumption goes from 1.5GB to 131GB. Grafana shows 152GB. I have never seen that inference process takes about 5x more RAM than the model size.
aquif-3.5-Max-42B-A3B.i1-Q4_K_M.gguf

I've heard some explanations about its unique architecture etc, but those doesn't sound credible.

I have dual xeon 4116 + 192GB RAM. Im using IK-llama & Ubuntu server 24.04. I would like to be able to run this just on 1 side memory (96GB).

MODEL=/datahub/models/aquif-3.5-Max-42B-A3B.i1-Q4_K_M.gguf
BIN=
/datahub/ik_llama.cpp/build/bin/llama-server
PORT=8080
SOCKET=0

Cores for Socket 0

CORES="0-11,24-35"

Change MEM to allow access to ALL memory nodes (0 and 1)

MEM="0,1"

numactl -C $CORES -m $MEM
"$BIN"
--model "$MODEL"
--threads 24
--parallel 2
--port $PORT
--host 0.0.0.0

speeds im getting...
Prompt

  • Tokens: 499
  • Time: 15823.784 ms
  • Speed: 31.5 t/s
    Generation
  • Tokens: 2120
  • Time: 303319.544 ms
  • Speed: 7.0 t/s

Can't tell what is wrong but just to share in my case with llama.cpp (not the fork you are using) I'm fitting the Q3_K_XL with 36k tokens window into 23 GB of RAM keeping the KV cache in f16, flash attention is enabled though.

Thank you! That is helpful datapoint!

MartinPatterson changed discussion title from Is it normal this model hoards 130GB RAM. 5x more than model size. to Is it normal this model hoards 130GB RAM. 5x more than model size. *solved*

Sign up or log in to comment