CUDA error when prompt start processing
Idk why, but this error stands for all model quantization
/deploy/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:119: CUDA error
CUDA error: an illegal memory access was encountered
current device: 0, in function launch_mul_mat_q at /deploy/ai/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:4122
cudaFuncSetAttribute(mul_mat_q<type, mmq_x, 8, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)
/deploy/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:119: CUDA error
Hrmm, you'll have to give more information including:
- what GPU(s) do you have (are you trying to use multiple older P40s for example, as those might have issues with many quants)
- what OS (e.g. Linux and kernel version, or windows, CUDA version and driver, etc)
- what is exact command you're using to try to start (e.g. are you using --jinja or not, -fmoe or not, etc)
My hunch is you are trying to use -ot to put (gate|up) tensors across two different GPUs and still trying to use -fmoe fused moe ops or similar which might cause this issue.
The quick thing to try is to run without -fmoe and also to add --no-fused-up-gate to see if that works. Generally you want to use -fmoe and not use --no-fused-up-gate for speed-up but will need to adjust your -ot.