Commit
·
e7809ba
1
Parent(s):
a8916ec
init
Browse files- .gitattributes +11 -0
- README.md +71 -1
- assets/icon.png +3 -0
- assets/method.png +3 -0
- assets/teaser.png +3 -0
- chat_template.json +3 -0
- config.json +3 -0
- dino_config.json +3 -0
- generation_config.json +3 -0
- preprocessor_config.json +3 -0
- text_config.json +3 -0
- tokenizer.json +3 -0
- tokenizer_config.json +3 -0
- vit_config.json +3 -0
- vocab.json +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,14 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
generation_config.json filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
preprocessor_config.json filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
text_config.json filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
vit_config.json filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
chat_template.json filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
config.json filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
dino_config.json filter=lfs diff=lfs merge=lfs -text
|
| 44 |
+
tokenizer_config.json filter=lfs diff=lfs merge=lfs -text
|
| 45 |
+
vocab.json filter=lfs diff=lfs merge=lfs -text
|
| 46 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,3 +1,73 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: image-text-to-text
|
| 6 |
+
tags:
|
| 7 |
+
- multimodal
|
| 8 |
+
library_name: transformers
|
| 9 |
+
base_model:
|
| 10 |
+
- Qwen/Qwen2-VL-2B
|
| 11 |
---
|
| 12 |
+
|
| 13 |
+
# G2VLM-2B-MoT
|
| 14 |
+
## Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
|
| 15 |
+
|
| 16 |
+
<p align="left">
|
| 17 |
+
<img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/icon.png" alt="G2VLM" width="200"/>
|
| 18 |
+
</p>
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
<p align="left">
|
| 22 |
+
<a href="https://gordonhu608.github.io/g2vlm.github.io/">
|
| 23 |
+
<img
|
| 24 |
+
src="https://img.shields.io/badge/G2VLM-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
| 25 |
+
alt="G2VLM Website"
|
| 26 |
+
/>
|
| 27 |
+
</a>
|
| 28 |
+
<a href="https://arxiv.org/abs/2511.21688">
|
| 29 |
+
<img
|
| 30 |
+
src="https://img.shields.io/badge/G2VLM-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
|
| 31 |
+
alt="G2VLM Paper on arXiv"
|
| 32 |
+
/>
|
| 33 |
+
</a>
|
| 34 |
+
<a href="https://github.com/InternRobotics/G2VLM" target="_blank" style="margin: 2px;">
|
| 35 |
+
<img
|
| 36 |
+
alt="Github" src="https://img.shields.io/badge/G2VLM-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
|
| 37 |
+
alt="G2VLM Codebase"
|
| 38 |
+
/>
|
| 39 |
+
</a>
|
| 40 |
+
</p>
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
> We present <b>G<sup>2</sup>VLM</b>, a geometry grounded vision-language model proficient in both spatial 3D reconstruction and spatial understanding tasks. For spatial reasoning questions, G<sup>2</sup>VLM can natively predict 3D geometry and employ interleaved reasoning for an answer.
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
This repository hosts the model weights for <b>G<sup>2</sup>VLM</b>. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/InternRobotics/G2VLM).
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
<p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/teaser.png" width="100%"></p>
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
## 🧠 Method
|
| 54 |
+
<i>G<sup>2</sup>VLM is a unified model that integrates both a geometric perception expert for 3D reconstruction and a semantic perception expert for multimodal understanding and spatial reasoning tasks. All tokens can do shared multi-modal self attention in each transformer block.
|
| 55 |
+
|
| 56 |
+
<p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/method.png" width="100%"></p>
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
## License
|
| 60 |
+
G2VLM is licensed under the Apache 2.0 license.
|
| 61 |
+
|
| 62 |
+
## ✍️ Citation
|
| 63 |
+
```bibtex
|
| 64 |
+
@article{hu2025g2vlmgeometrygroundedvision,
|
| 65 |
+
title={G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning},
|
| 66 |
+
author={Wenbo Hu and Jingli Lin and Yilin Long and Yunlong Ran and Lihan Jiang and Yifan Wang and Chenming Zhu and Runsen Xu and Tai Wang and Jiangmiao Pang},
|
| 67 |
+
year={2025},
|
| 68 |
+
eprint={2511.21688},
|
| 69 |
+
archivePrefix={arXiv},
|
| 70 |
+
primaryClass={cs.CV},
|
| 71 |
+
url={https://arxiv.org/abs/2511.21688},
|
| 72 |
+
}
|
| 73 |
+
```
|
assets/icon.png
ADDED
|
|
Git LFS Details
|
assets/method.png
ADDED
|
Git LFS Details
|
assets/teaser.png
ADDED
|
Git LFS Details
|
chat_template.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ad60d90252ed0b0705ba14e2d0ad0fec0beac1ea955642b54059b36052d8bc96
|
| 3 |
+
size 1050
|
config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:422adefa19e62dd175961cec85bc0400344fe5bf9b22bd1182e05aaae78556e0
|
| 3 |
+
size 1196
|
dino_config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:03eee42f646659a9480f8911a81fdd81efeedd7ff39083c8e36398068daf72f5
|
| 3 |
+
size 1003
|
generation_config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d2864bf1edea5863d331edfff48106b586a366f5a2c41aa77731fadc53aa25d2
|
| 3 |
+
size 272
|
preprocessor_config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b5eaad0c2815f07631535dcc58f3c462b0d73693638ad21d19f3c50820eae1cc
|
| 3 |
+
size 347
|
text_config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:422adefa19e62dd175961cec85bc0400344fe5bf9b22bd1182e05aaae78556e0
|
| 3 |
+
size 1196
|
tokenizer.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cb63a0a23eef3d5b01063a9880a1925a65aaf4d1591d519910ee3527852950a0
|
| 3 |
+
size 7029741
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ff5c4fd898fe8c39591eb70e5d39d2782802d4204d6ae9ba1223252f354842a0
|
| 3 |
+
size 4190
|
vit_config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e376158b1b95be08e1aab39196db5103a9b7961b8a7afe9682b066cd744c6964
|
| 3 |
+
size 218
|
vocab.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ca10d7e9fb3ed18575dd1e277a2579c16d108e32f27439684afa0e10b1440910
|
| 3 |
+
size 2776833
|