Performance problem: 32B 4x slower than 30B

#27
by jagusztinl - opened

32B and 30B models are similar size, But there is huge difference between the performance:

X86 Cuda (4x diff)
llama.cpp build: 6a2bc8bf (5415)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A10-24Q, compute capability 8.6, VMM: yes
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | CUDA | 99 | 1 | pp512 | 672.53 ± 3.28 |
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | CUDA | 99 | 1 | tg128 | 23.22 ± 0.01 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | CUDA | 99 | 1 | pp512 | 1328.67 ± 15.35 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | CUDA | 99 | 1 | tg128 | 103.50 ± 0.17 |

ARM (2.5x diff):
llama.cpp build: 814f795e (5307)
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | BLAS | 64 | 1 | pp512 | 131.40 ± 0.21 |
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | BLAS | 64 | 1 | tg128 | 14.44 ± 0.10 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | BLAS | 64 | 1 | pp512 | 383.43 ± 3.12 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | BLAS | 64 | 1 | tg128 | 39.50 ± 0.16 |

What is the explanation to this?

You should learn about what is MOE model (Qwen 30B) and dense model (Qwen 32B) first.

jagusztinl changed discussion status to closed

Sign up or log in to comment