|
|
---
|
|
|
license: apache-2.0
|
|
|
language:
|
|
|
- en
|
|
|
pipeline_tag: image-text-to-text
|
|
|
tags:
|
|
|
- multimodal
|
|
|
library_name: transformers
|
|
|
base_model:
|
|
|
- Qwen/Qwen2-VL-2B
|
|
|
---
|
|
|
|
|
|
# G2VLM-2B-MoT
|
|
|
## Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
|
|
|
|
|
|
<p align="left">
|
|
|
<img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/icon.png" alt="G2VLM" width="200"/>
|
|
|
</p>
|
|
|
|
|
|
|
|
|
<p align="left">
|
|
|
<a href="https://gordonhu608.github.io/g2vlm.github.io/">
|
|
|
<img
|
|
|
src="https://img.shields.io/badge/G2VLM-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
|
|
alt="G2VLM Website"
|
|
|
/>
|
|
|
</a>
|
|
|
<a href="https://arxiv.org/abs/2511.21688">
|
|
|
<img
|
|
|
src="https://img.shields.io/badge/G2VLM-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
|
|
|
alt="G2VLM Paper on arXiv"
|
|
|
/>
|
|
|
</a>
|
|
|
<a href="https://github.com/InternRobotics/G2VLM" target="_blank" style="margin: 2px;">
|
|
|
<img
|
|
|
alt="Github" src="https://img.shields.io/badge/G2VLM-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
|
|
|
alt="G2VLM Codebase"
|
|
|
/>
|
|
|
</a>
|
|
|
</p>
|
|
|
|
|
|
|
|
|
> We present <b>G<sup>2</sup>VLM</b>, a geometry grounded vision-language model proficient in both spatial 3D reconstruction and spatial understanding tasks. For spatial reasoning questions, G<sup>2</sup>VLM can natively predict 3D geometry and employ interleaved reasoning for an answer.
|
|
|
|
|
|
|
|
|
This repository hosts the model weights for <b>G<sup>2</sup>VLM</b>. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/InternRobotics/G2VLM).
|
|
|
|
|
|
|
|
|
<p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/teaser.png" width="100%"></p>
|
|
|
|
|
|
|
|
|
|
|
|
## 🧠 Method
|
|
|
<i>G<sup>2</sup>VLM is a unified model that integrates both a geometric perception expert for 3D reconstruction and a semantic perception expert for multimodal understanding and spatial reasoning tasks. All tokens can do shared multi-modal self attention in each transformer block.
|
|
|
|
|
|
<p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/method.png" width="100%"></p>
|
|
|
|
|
|
|
|
|
## License
|
|
|
G2VLM is licensed under the Apache 2.0 license.
|
|
|
|
|
|
## ✍️ Citation
|
|
|
```bibtex
|
|
|
@article{hu2025g2vlmgeometrygroundedvision,
|
|
|
title={G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning},
|
|
|
author={Wenbo Hu and Jingli Lin and Yilin Long and Yunlong Ran and Lihan Jiang and Yifan Wang and Chenming Zhu and Runsen Xu and Tai Wang and Jiangmiao Pang},
|
|
|
year={2025},
|
|
|
eprint={2511.21688},
|
|
|
archivePrefix={arXiv},
|
|
|
primaryClass={cs.CV},
|
|
|
url={https://arxiv.org/abs/2511.21688},
|
|
|
}
|
|
|
``` |