G2VLM-2B-MoT

Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

We present G²VLM, a geometry grounded vision-language model proficient in both spatial 3D reconstruction and spatial understanding tasks. For spatial reasoning questions, G²VLM can natively predict 3D geometry and employ interleaved reasoning for an answer.

This repository hosts the model weights for G²VLM. For installation, usage instructions, and further documentation, please visit our GitHub repository.

🧠 Method

G²VLM is a unified model that integrates both a geometric perception expert for 3D reconstruction and a semantic perception expert for multimodal understanding and spatial reasoning tasks. All tokens can do shared multi-modal self attention in each transformer block.

License

G2VLM is licensed under the Apache 2.0 license.

✍️ Citation

@article{hu2025g2vlmgeometrygroundedvision,
      title={G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning}, 
      author={Wenbo Hu and Jingli Lin and Yilin Long and Yunlong Ran and Lihan Jiang and Yifan Wang and Chenming Zhu and Runsen Xu and Tai Wang and Jiangmiao Pang},
      year={2025},
      eprint={2511.21688},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21688}, 
}

Downloads last month: 205

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for InternRobotics/G2VLM-2B-MoT

Base model

Qwen/Qwen2-VL-2B

Finetuned

(15)

this model