SWE-Universe: Scale Real-World Verifiable Environments to Millions Paper • 2602.02361 • Published 10 days ago • 59
PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues Paper • 2601.17277 • Published 20 days ago • 6
PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues Paper • 2601.17277 • Published 20 days ago • 6
VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos Paper • 2510.19488 • Published Oct 22, 2025 • 20
SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark Paper • 2402.05138 • Published Feb 6, 2024 • 2
MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training Paper • 2510.12831 • Published Oct 12, 2025 • 5
Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting Paper • 2509.00482 • Published Aug 30, 2025
Thai Semantic End-of-Turn Detection for Real-Time Voice Agents Paper • 2510.04016 • Published Oct 5, 2025 • 4
Predicting the Order of Upcoming Tokens Improves Language Modeling Paper • 2508.19228 • Published Aug 26, 2025 • 23
Mangosteen: An Open Thai Corpus for Language Model Pretraining Paper • 2507.14664 • Published Jul 19, 2025 • 7
Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security Paper • 2507.19399 • Published Jul 25, 2025 • 2
LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators Paper • 2507.15339 • Published Jul 21, 2025 • 1
Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation Paper • 2507.11966 • Published Jul 16, 2025
Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications Paper • 2507.09820 • Published Jul 13, 2025
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages Paper • 2507.05980 • Published Jul 8, 2025 • 2
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling Paper • 2506.20512 • Published Jun 25, 2025 • 47