InternRobotics
/

G2VLM-2B-MoT

Image-Text-to-Text

Model card Files Files and versions

G2VLM-2B-MoT / README.md

gordonhubackup's picture

init

e7809ba 11 days ago

|

2.94 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: image-text-to-text
	tags:
	- multimodal
	library_name: transformers
	base_model:
	- Qwen/Qwen2-VL-2B
	---

	# G2VLM-2B-MoT
	## Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

	<p align="left">
	<img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/icon.png" alt="G2VLM" width="200"/>
	</p>


	<p align="left">
	<a href="https://gordonhu608.github.io/g2vlm.github.io/">
	<img
	src="https://img.shields.io/badge/G2VLM-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
	alt="G2VLM Website"
	/>
	</a>
	<a href="https://arxiv.org/abs/2511.21688">
	<img
	src="https://img.shields.io/badge/G2VLM-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
	alt="G2VLM Paper on arXiv"
	/>
	</a>
	<a href="https://github.com/InternRobotics/G2VLM" target="_blank" style="margin: 2px;">
	<img
	alt="Github" src="https://img.shields.io/badge/G2VLM-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
	alt="G2VLM Codebase"
	/>
	</a>
	</p>


	> We present <b>G<sup>2</sup>VLM</b>, a geometry grounded vision-language model proficient in both spatial 3D reconstruction and spatial understanding tasks. For spatial reasoning questions, G<sup>2</sup>VLM can natively predict 3D geometry and employ interleaved reasoning for an answer.


	This repository hosts the model weights for <b>G<sup>2</sup>VLM</b>. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/InternRobotics/G2VLM).


	<p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/teaser.png" width="100%"></p>



	## 🧠 Method
	<i>G<sup>2</sup>VLM is a unified model that integrates both a geometric perception expert for 3D reconstruction and a semantic perception expert for multimodal understanding and spatial reasoning tasks. All tokens can do shared multi-modal self attention in each transformer block.

	<p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/method.png" width="100%"></p>


	## License
	G2VLM is licensed under the Apache 2.0 license.

	## ✍️ Citation
	```bibtex
	@article{hu2025g2vlmgeometrygroundedvision,
	title={G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning},
	author={Wenbo Hu and Jingli Lin and Yilin Long and Yunlong Ran and Lihan Jiang and Yifan Wang and Chenming Zhu and Runsen Xu and Tai Wang and Jiangmiao Pang},
	year={2025},
	eprint={2511.21688},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2511.21688},
	}
	```