Performance problem: 32B 4x slower than 30B

#27

by jagusztinl - opened May 20, 2025

Discussion

jagusztinl

May 20, 2025

•

edited May 20, 2025

32B and 30B models are similar size, But there is huge difference between the performance:

X86 Cuda (4x diff)
llama.cpp build: 6a2bc8bf (5415)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A10-24Q, compute capability 8.6, VMM: yes
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | CUDA | 99 | 1 | pp512 | 672.53 ± 3.28 |
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | CUDA | 99 | 1 | tg128 | 23.22 ± 0.01 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | CUDA | 99 | 1 | pp512 | 1328.67 ± 15.35 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | CUDA | 99 | 1 | tg128 | 103.50 ± 0.17 |

ARM (2.5x diff):
llama.cpp build: 814f795e (5307)
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | BLAS | 64 | 1 | pp512 | 131.40 ± 0.21 |
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | BLAS | 64 | 1 | tg128 | 14.44 ± 0.10 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | BLAS | 64 | 1 | pp512 | 383.43 ± 3.12 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | BLAS | 64 | 1 | tg128 | 39.50 ± 0.16 |

What is the explanation to this?

zletpm

May 20, 2025

You should learn about what is MOE model (Qwen 30B) and dense model (Qwen 32B) first.

jagusztinl changed discussion status to closed May 20, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment