High CPU Usage / Slow Context Processing

#5
by PussyHut - opened

For save your time:

If you have encounter high CPU usage/slow context processing problem, this is unrelated to quantification, it's llama.cpp issue:

Temporary quick fix is to disable flash attention. --flash-attn off

PussyHut changed discussion title from High CPU usage when set `--flash-attn on/-fa on` to High CPU Usage / Slow Context Processing
Unsloth AI org

For save your time:

If you have encounter high CPU usage/slow context processing problem, this is unrelated to quantification, it's llama.cpp issue:

Temporary quick fix is to disable flash attention. --flash-attn off

Thank you very helpful we shall put it in our guide if anyone experiences this!

Unsloth AI org
β€’
edited Jan 24

NOTE this is now outdated! Llama.cpp has patched it in so you can enable flash attention now

shimmyshimmer changed discussion status to closed

Sign up or log in to comment