# Benchmark Results for Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf Command used to run the models: ``` $ llama-server.exe --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\user\Desktop\llama.cpp\ggml-cuda.dll load_backend: loaded RPC backend from C:\Users\user\Desktop\llama.cpp\ggml-rpc.dll load_backend: loaded CPU backend from C:\Users\user\Desktop\llama.cpp\ggml-cpu-icelake.dll version: 6327 (4d74393b) built with clang version 19.1.5 for x86_64-pc-windows-msvc $ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk f16 -ctv f16 -m MODEL.gguf $ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk q8_0 -ctv q8_0 -m MODEL.gguf $ llama-server --host 0.0.0.0 --port 8000 --no-mmap -c 32768 -ub 4096 -fa on -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --jinja -np 8 -ctk q4_0 -ctv q4_0 -m MODEL.gguf ``` **Model Specifications:** - Tested Model: Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf (4-bit quantized) - Original Model Reference: Qwen3-4B-Instruct-2507 (bf16) - All tests performed using quantized weights with different KV cache configurations ### MMLU-PRO and GPQA Results | Quantization | MMLU-PRO (Overall) | GPQA (Accuracy) | GPQA (Refusal Fraction) | | :--- | :--- | :--- | :--- | | **Original (bf16 weights)** | **69.60%** | N/A | N/A | | **f16 cache** | 66.53% | 26.79% | 35.49% | | **q8_0 cache** | 66.26% | 25.89% | 35.94% | | **q4_0 cache** | 63.70% | 26.12% | 37.72% | ### MMLU-PRO by Subject (Quantized Model) | Subject | f16 cache | q8_0 cache | q4_0 cache | | :--- | :--- | :--- | :--- | | **Overall** | **66.53%** | **66.26%** | **63.70%** | | Biology | 82.29% | 81.45% | 79.36% | | Business | 73.64% | 72.50% | 72.24% | | Chemistry | 73.14% | 72.88% | 70.14% | | Computer Science | 71.71% | 73.41% | 69.27% | | Economics | 76.42% | 76.18% | 72.27% | | Engineering | 47.27% | 47.57% | 43.34% | | Health | 65.28% | 66.14% | 62.96% | | History | 49.34% | 51.44% | 48.29% | | Law | 36.60% | 33.61% | 31.34% | | Math | 84.60% | 84.53% | 82.68% | | Philosophy | 57.72% | 55.11% | 53.31% | | Physics | 75.60% | 75.13% | 71.67% | | Psychology | 68.55% | 69.92% | 67.54% | | Other | 56.71% | 57.14% | 56.28% | *Note: All tests performed using Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf (4-bit quantized weights) with different KV cache precision settings* *Raw results inside the folder `Qwen3_4B_kv_cache_f16_vs_q8_vs_q4`* ### Scripts used (default settings with temperature=0.7 and top_p=0.8): - https://github.com/chigkim/Ollama-MMLU-Pro - https://github.com/chigkim/openai-api-gpqa ### Hardware: - RTX 3080 10GB