OJBench: A Competition Level Code Benchmark For Large Language Models Paper • 2506.16395 • Published Jun 19 • 4
Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering Paper • 2512.06915 • Published 8 days ago • 12