Post
81
MARS Achieves Strong Results on Google DeepMind's IMO-Bench
We evaluated OptiLLM's MARS (Multi-Agent Reasoning System) approach on IMO-Bench, Google DeepMind's challenging mathematical reasoning benchmark with International Mathematical Olympiad-level problems.
What is MARS?
MARS is a multi-agent reasoning technique that works with any LLM. It uses 3 parallel reasoning agents that independently solve problems, then verifies their solutions through consensus and iterative refinement. The key advantage: it's model-agnostic and can be applied to any base model through OptiLLM's inference proxy.
Results on IMO-Bench:
AnswerBench (400 short-answer problems):
MARS: 36.0% (144/400 correct)
Baseline: 24.5% (98/400 correct)
Improvement: +11.5pp across all domains
Category breakdown:
- Algebra: 33% (vs 21% baseline)
- Combinatorics: 26% (vs 19% baseline)
- Geometry: 43% (vs 28% baseline)
- Number Theory: 42% (vs 30% baseline)
ProofBench (60 proof construction problems):
MARS: 26.7% (16/60 correct)
Baseline: 18.3% (11/60 correct)
Improvement: +8.4pp
Category breakdown:
- Number Theory: 42.9% (vs 14.3% baseline)
- Combinatorics: 37.5% (vs 31.2% baseline)
- Algebra: 18.8% (vs 25.0% baseline)
- Geometry: 7.1% (vs 0.0% baseline)
All results achieved using google/gemini-2.5-flash-lite-preview-09-2025 as the base model. The same MARS approach can enhance reasoning for any model through OptiLLM's OpenAI-compatible API.
Datasets available at:
AnswerBench: huggingface.co/datasets/Hwilner/imo-answerbench
ProofBench: huggingface.co/datasets/Hwilner/imo-proofbench
Try it yourself:
python optillm.py --approach mars --model google/gemini-2.5-flash-lite-preview-09-2025
Or via API with approach prefix:
model: "mars-google/gemini-2.5-flash-lite-preview-09-2025"
Full evaluation code and results available at: github.com/algorithmicsuperintelligence/optillm
We evaluated OptiLLM's MARS (Multi-Agent Reasoning System) approach on IMO-Bench, Google DeepMind's challenging mathematical reasoning benchmark with International Mathematical Olympiad-level problems.
What is MARS?
MARS is a multi-agent reasoning technique that works with any LLM. It uses 3 parallel reasoning agents that independently solve problems, then verifies their solutions through consensus and iterative refinement. The key advantage: it's model-agnostic and can be applied to any base model through OptiLLM's inference proxy.
Results on IMO-Bench:
AnswerBench (400 short-answer problems):
MARS: 36.0% (144/400 correct)
Baseline: 24.5% (98/400 correct)
Improvement: +11.5pp across all domains
Category breakdown:
- Algebra: 33% (vs 21% baseline)
- Combinatorics: 26% (vs 19% baseline)
- Geometry: 43% (vs 28% baseline)
- Number Theory: 42% (vs 30% baseline)
ProofBench (60 proof construction problems):
MARS: 26.7% (16/60 correct)
Baseline: 18.3% (11/60 correct)
Improvement: +8.4pp
Category breakdown:
- Number Theory: 42.9% (vs 14.3% baseline)
- Combinatorics: 37.5% (vs 31.2% baseline)
- Algebra: 18.8% (vs 25.0% baseline)
- Geometry: 7.1% (vs 0.0% baseline)
All results achieved using google/gemini-2.5-flash-lite-preview-09-2025 as the base model. The same MARS approach can enhance reasoning for any model through OptiLLM's OpenAI-compatible API.
Datasets available at:
AnswerBench: huggingface.co/datasets/Hwilner/imo-answerbench
ProofBench: huggingface.co/datasets/Hwilner/imo-proofbench
Try it yourself:
python optillm.py --approach mars --model google/gemini-2.5-flash-lite-preview-09-2025
Or via API with approach prefix:
model: "mars-google/gemini-2.5-flash-lite-preview-09-2025"
Full evaluation code and results available at: github.com/algorithmicsuperintelligence/optillm