my-llm-tests / Qwen3_4B_kv_cache_f16_vs_q8_vs_q4.md
adriabama06's picture
Update Qwen3_4B_kv_cache_f16_vs_q8_vs_q4.md
556d36e verified

Benchmark Results for Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf

Command used to run the models:

$ llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\user\Desktop\llama.cpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\user\Desktop\llama.cpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\user\Desktop\llama.cpp\ggml-cpu-icelake.dll
version: 6327 (4d74393b)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

$ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk f16 -ctv f16 -m MODEL.gguf
$ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk q8_0 -ctv q8_0 -m MODEL.gguf
$ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk q4_0 -ctv q4_0 -m MODEL.gguf

Model Specifications:

  • Tested Model: Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf (4-bit quantized)
  • Original Model Reference: Qwen3-4B-Instruct-2507 (bf16)
  • All tests performed using quantized weights with different KV cache configurations

MMLU-PRO and GPQA Results

Quantization MMLU-PRO (Overall) GPQA (Accuracy) GPQA (Refusal Fraction)
Original (bf16 weights) 69.60% N/A N/A
f16 cache 66.53% 26.79% 35.49%
q8_0 cache 66.26% 25.89% 35.94%
q4_0 cache 63.70% 26.12% 37.72%

MMLU-PRO by Subject (Quantized Model)

Subject f16 cache q8_0 cache q4_0 cache
Overall 66.53% 66.26% 63.70%
Biology 82.29% 81.45% 79.36%
Business 73.64% 72.50% 72.24%
Chemistry 73.14% 72.88% 70.14%
Computer Science 71.71% 73.41% 69.27%
Economics 76.42% 76.18% 72.27%
Engineering 47.27% 47.57% 43.34%
Health 65.28% 66.14% 62.96%
History 49.34% 51.44% 48.29%
Law 36.60% 33.61% 31.34%
Math 84.60% 84.53% 82.68%
Philosophy 57.72% 55.11% 53.31%
Physics 75.60% 75.13% 71.67%
Psychology 68.55% 69.92% 67.54%
Other 56.71% 57.14% 56.28%

Note: All tests performed using Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf (4-bit quantized weights) with different KV cache precision settings
Raw results inside the folder Qwen3_4B_kv_cache_f16_vs_q8_vs_q4

Scripts used (default settings with temperature=0.7 and top_p=0.8):

Hardware:

  • RTX 3080 10GB