SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale Paper • 2602.23866 • Published 11 days ago • 80
Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models Paper • 2602.24264 • Published 11 days ago • 14
Running on CPU Upgrade 13.9k Open LLM Leaderboard 🏆 13.9k Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade 244 MMLU-Pro Leaderboard 🥇 244 More advanced and challenging multi-task evaluation
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols Paper • 2510.09462 • Published Oct 10, 2025 • 6
DISCO: Diversifying Sample Condensation for Efficient Model Evaluation Paper • 2510.07959 • Published Oct 9, 2025 • 15
DISCO: Diversifying Sample Condensation for Efficient Model Evaluation Paper • 2510.07959 • Published Oct 9, 2025 • 15 • 2
Diffusion Classifiers Understand Compositionality, but Conditions Apply Paper • 2505.17955 • Published May 23, 2025 • 22