Geometrically-Constrained Agent for Spatial Reasoning
Abstract
Geometrically-Constrained Agent (GCA) addresses the semantic-to-geometric gap in vision language models by decoupling semantic analysis and task solving with formal constraints, achieving state-of-the-art performance in spatial reasoning.
Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Pursuing Minimal Sufficiency in Spatial Reasoning (2025)
- TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics (2025)
- Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation (2025)
- SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models (2025)
- Vision-Language Memory for Spatial Reasoning (2025)
- ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use (2025)
- Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper