# Benchmark Results for Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf

Command used to run the models:
```
$ llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\user\Desktop\llama.cpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\user\Desktop\llama.cpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\user\Desktop\llama.cpp\ggml-cpu-icelake.dll
version: 6327 (4d74393b)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

$ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk f16 -ctv f16 -m MODEL.gguf
$ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk q8_0 -ctv q8_0 -m MODEL.gguf
$ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk q4_0 -ctv q4_0 -m MODEL.gguf
```

**Model Specifications:**
- Tested Model: Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf (4-bit quantized)
- Original Model Reference: Qwen3-4B-Instruct-2507 (bf16)
- All tests performed using quantized weights with different KV cache configurations

### MMLU-PRO and GPQA Results

| Quantization | MMLU-PRO (Overall) | GPQA (Accuracy) | GPQA (Refusal Fraction) |
| :--- | :--- | :--- | :--- |
| **Original (bf16 weights)** | **69.60%** | N/A | N/A |
| **f16 cache** | 66.53% | 26.79% | 35.49% |
| **q8_0 cache** | 66.26% | 25.89% | 35.94% |
| **q4_0 cache** | 63.70% | 26.12% | 37.72% |

### MMLU-PRO by Subject (Quantized Model)

| Subject | f16 cache | q8_0 cache | q4_0 cache |
| :--- | :--- | :--- | :--- |
| **Overall** | **66.53%** | **66.26%** | **63.70%** |
| Biology | 82.29% | 81.45% | 79.36% |
| Business | 73.64% | 72.50% | 72.24% |
| Chemistry | 73.14% | 72.88% | 70.14% |
| Computer Science | 71.71% | 73.41% | 69.27% |
| Economics | 76.42% | 76.18% | 72.27% |
| Engineering | 47.27% | 47.57% | 43.34% |
| Health | 65.28% | 66.14% | 62.96% |
| History | 49.34% | 51.44% | 48.29% |
| Law | 36.60% | 33.61% | 31.34% |
| Math | 84.60% | 84.53% | 82.68% |
| Philosophy | 57.72% | 55.11% | 53.31% |
| Physics | 75.60% | 75.13% | 71.67% |
| Psychology | 68.55% | 69.92% | 67.54% |
| Other | 56.71% | 57.14% | 56.28% |

*Note: All tests performed using Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf (4-bit quantized weights) with different KV cache precision settings*  
*Raw results inside the folder `Qwen3_4B_kv_cache_f16_vs_q8_vs_q4`*  

### Scripts used (default settings with temperature=0.7 and top_p=0.8):
- https://github.com/chigkim/Ollama-MMLU-Pro
- https://github.com/chigkim/openai-api-gpqa

### Hardware:
- RTX 3080 10GB