Benchmark Results for Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf

Command used to run the models:

$ llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\user\Desktop\llama.cpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\user\Desktop\llama.cpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\user\Desktop\llama.cpp\ggml-cpu-icelake.dll
version: 6327 (4d74393b)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

$ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk f16 -ctv f16 -m MODEL.gguf
$ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk q8_0 -ctv q8_0 -m MODEL.gguf
$ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk q4_0 -ctv q4_0 -m MODEL.gguf

Model Specifications:

Tested Model: Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf (4-bit quantized)
Original Model Reference: Qwen3-4B-Instruct-2507 (bf16)
All tests performed using quantized weights with different KV cache configurations

MMLU-PRO and GPQA Results

Quantization	MMLU-PRO (Overall)	GPQA (Accuracy)	GPQA (Refusal Fraction)
Original (bf16 weights)	69.60%	N/A	N/A
f16 cache	66.53%	26.79%	35.49%
q8_0 cache	66.26%	25.89%	35.94%
q4_0 cache	63.70%	26.12%	37.72%

MMLU-PRO by Subject (Quantized Model)

Subject	f16 cache	q8_0 cache	q4_0 cache
Overall	66.53%	66.26%	63.70%
Biology	82.29%	81.45%	79.36%
Business	73.64%	72.50%	72.24%
Chemistry	73.14%	72.88%	70.14%
Computer Science	71.71%	73.41%	69.27%
Economics	76.42%	76.18%	72.27%
Engineering	47.27%	47.57%	43.34%
Health	65.28%	66.14%	62.96%
History	49.34%	51.44%	48.29%
Law	36.60%	33.61%	31.34%
Math	84.60%	84.53%	82.68%
Philosophy	57.72%	55.11%	53.31%
Physics	75.60%	75.13%	71.67%
Psychology	68.55%	69.92%	67.54%
Other	56.71%	57.14%	56.28%

Note: All tests performed using Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf (4-bit quantized weights) with different KV cache precision settings
Raw results inside the folder Qwen3_4B_kv_cache_f16_vs_q8_vs_q4

Scripts used (default settings with temperature=0.7 and top_p=0.8):

Hardware:

RTX 3080 10GB