High CPU Usage / Slow Context Processing

by PussyHut - opened Jan 20

•

For save your time:

If you have encounter high CPU usage/slow context processing problem, this is unrelated to quantification, it's llama.cpp issue:

Temporary quick fix is to disable flash attention. --flash-attn off

PussyHut changed discussion title from High CPU usage when set `--flash-attn on/-fa on` to High CPU Usage / Slow Context Processing Jan 20

danielhanchen

Unsloth AI org Jan 20

For save your time:

If you have encounter high CPU usage/slow context processing problem, this is unrelated to quantification, it's llama.cpp issue:

https://github.com/ggml-org/llama.cpp/issues/18948

https://github.com/ggml-org/llama.cpp/issues/18944

Temporary quick fix is to disable flash attention. --flash-attn off

Thank you very helpful we shall put it in our guide if anyone experiences this!

shimmyshimmer

Unsloth AI org Jan 24

•

edited Jan 24

NOTE this is now outdated! Llama.cpp has patched it in so you can enable flash attention now

shimmyshimmer changed discussion status to closed Jan 24

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment