Aider Polyglot Benchmark
#1
by
whoisjeremylam
- opened
Here are the results of a single pass through the current Aider benchmark. These results are for the non-quant'd as per this repo.
pass_rate_2 seems to be a little lower than the reported rate for the default BF16. In the Aider Discord, they seem to report BF16 to be 32.4.
I'm a newb when it comes to Aider and vllm, so I've also included the command lines I used in case there is something wrong.
Serve:
CUDA_VISIBLE_DEVICES=0,1 \
vllm serve \
--port 5000 \
--gpu-memory-utilization=0.95 \
--max_model_len=131072 \
--pipeline-parallel-size 2 \
--model=/home/ai/models/cerebras/Qwen3-Coder-REAP-25B-A3B \
--served-model-name Qwen3-Coder-REAP-25B-A3B
Aider:
./benchmark/benchmark.py Qwen3-Coder-REAP-25B-A3B --model openai/Qwen3-Coder-REAP-25B-A3B --edit-format diff --threads 1 --exercises-dir polyglot-benchmark
Results:
- dirname: 2025-10-22-11-21-46--Qwen3-Coder-REAP-25B-A3B
test_cases: 225
model: openai/Qwen3-Coder-REAP-25B-A3B
edit_format: diff
commit_hash: 11516d6
pass_rate_1: 12.0
pass_rate_2: 27.6
pass_num_1: 27
pass_num_2: 62
percent_cases_well_formed: 94.7
error_outputs: 20
num_malformed_responses: 19
num_with_malformed_responses: 12
user_asks: 114
lazy_comments: 1
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 1
prompt_tokens: 3341547
completion_tokens: 549320
test_timeouts: 5
total_tests: 225
command: aider --model openai/Qwen3-Coder-REAP-25B-A3B
date: 2025-10-22
versions: 0.86.2.dev
seconds_per_case: 35.2
total_cost: 0.0000
Interesting, so it did decrease a bit in pass rate. Thank you.