High CPU Usage / Slow Context Processing
#5
by
PussyHut - opened
For save your time:
If you have encounter high CPU usage/slow context processing problem, this is unrelated to quantification, it's llama.cpp issue:
- https://github.com/ggml-org/llama.cpp/issues/18948
- https://github.com/ggml-org/llama.cpp/issues/18944
Temporary quick fix is to disable flash attention. --flash-attn off
PussyHut changed discussion title from
High CPU usage when set `--flash-attn on/-fa on`
to High CPU Usage / Slow Context Processing
For save your time:
If you have encounter high CPU usage/slow context processing problem, this is unrelated to quantification, it's llama.cpp issue:
- https://github.com/ggml-org/llama.cpp/issues/18948
- https://github.com/ggml-org/llama.cpp/issues/18944
Temporary quick fix is to disable flash attention.
--flash-attn off
Thank you very helpful we shall put it in our guide if anyone experiences this!
NOTE this is now outdated! Llama.cpp has patched it in so you can enable flash attention now
shimmyshimmer changed discussion status to
closed