DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
Abstract
DeepPlanning benchmark addresses limitations of current LLM planning assessments by introducing complex, real-world tasks requiring both global optimization and local constraint reasoning.
While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.
Community
DeepPlanning — a new benchmark for long-horizon agent planning in real-world scenarios!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning (2025)
- TourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning (2026)
- ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation (2026)
- TriFlow: A Progressive Multi-Agent Framework for Intelligent Trip Planning (2025)
- AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts (2026)
- Programming over Thinking: Efficient and Robust Multi-Constraint Planning (2026)
- LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper