--- license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers base_model: - Qwen/Qwen2-VL-2B --- # G2VLM-2B-MoT ## Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

> We present G²VLM, a geometry grounded vision-language model proficient in both spatial 3D reconstruction and spatial understanding tasks. For spatial reasoning questions, G²VLM can natively predict 3D geometry and employ interleaved reasoning for an answer. This repository hosts the model weights for G²VLM. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/InternRobotics/G2VLM).

## 🧠 Method G²VLM is a unified model that integrates both a geometric perception expert for 3D reconstruction and a semantic perception expert for multimodal understanding and spatial reasoning tasks. All tokens can do shared multi-modal self attention in each transformer block.
## License G2VLM is licensed under the Apache 2.0 license. ## ✍️ Citation ```bibtex @article{hu2025g2vlmgeometrygroundedvision, title={G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning}, author={Wenbo Hu and Jingli Lin and Yilin Long and Yunlong Ran and Lihan Jiang and Yifan Wang and Chenming Zhu and Runsen Xu and Tai Wang and Jiangmiao Pang}, year={2025}, eprint={2511.21688}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.21688}, } ```