Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench Paper • 2510.26865 • Published 10 days ago • 11
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? Paper • 2509.16941 • Published Sep 21 • 20
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper • 2509.17177 • Published Sep 21 • 13
Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information Paper • 2508.11252 • Published Aug 15 • 3
SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications Paper • 2506.18951 • Published Jun 23 • 21