gordonhubackup commited on
Commit
e7809ba
·
1 Parent(s): a8916ec
.gitattributes CHANGED
@@ -33,3 +33,14 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ generation_config.json filter=lfs diff=lfs merge=lfs -text
37
+ preprocessor_config.json filter=lfs diff=lfs merge=lfs -text
38
+ text_config.json filter=lfs diff=lfs merge=lfs -text
39
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
40
+ vit_config.json filter=lfs diff=lfs merge=lfs -text
41
+ chat_template.json filter=lfs diff=lfs merge=lfs -text
42
+ config.json filter=lfs diff=lfs merge=lfs -text
43
+ dino_config.json filter=lfs diff=lfs merge=lfs -text
44
+ tokenizer_config.json filter=lfs diff=lfs merge=lfs -text
45
+ vocab.json filter=lfs diff=lfs merge=lfs -text
46
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,73 @@
1
  ---
2
- license: bsd-3-clause
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: image-text-to-text
6
+ tags:
7
+ - multimodal
8
+ library_name: transformers
9
+ base_model:
10
+ - Qwen/Qwen2-VL-2B
11
  ---
12
+
13
+ # G2VLM-2B-MoT
14
+ ## Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
15
+
16
+ <p align="left">
17
+ <img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/icon.png" alt="G2VLM" width="200"/>
18
+ </p>
19
+
20
+
21
+ <p align="left">
22
+ <a href="https://gordonhu608.github.io/g2vlm.github.io/">
23
+ <img
24
+ src="https://img.shields.io/badge/G2VLM-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
25
+ alt="G2VLM Website"
26
+ />
27
+ </a>
28
+ <a href="https://arxiv.org/abs/2511.21688">
29
+ <img
30
+ src="https://img.shields.io/badge/G2VLM-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
31
+ alt="G2VLM Paper on arXiv"
32
+ />
33
+ </a>
34
+ <a href="https://github.com/InternRobotics/G2VLM" target="_blank" style="margin: 2px;">
35
+ <img
36
+ alt="Github" src="https://img.shields.io/badge/G2VLM-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
37
+ alt="G2VLM Codebase"
38
+ />
39
+ </a>
40
+ </p>
41
+
42
+
43
+ > We present <b>G<sup>2</sup>VLM</b>, a geometry grounded vision-language model proficient in both spatial 3D reconstruction and spatial understanding tasks. For spatial reasoning questions, G<sup>2</sup>VLM can natively predict 3D geometry and employ interleaved reasoning for an answer.
44
+
45
+
46
+ This repository hosts the model weights for <b>G<sup>2</sup>VLM</b>. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/InternRobotics/G2VLM).
47
+
48
+
49
+ <p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/teaser.png" width="100%"></p>
50
+
51
+
52
+
53
+ ## 🧠 Method
54
+ <i>G<sup>2</sup>VLM is a unified model that integrates both a geometric perception expert for 3D reconstruction and a semantic perception expert for multimodal understanding and spatial reasoning tasks. All tokens can do shared multi-modal self attention in each transformer block.
55
+
56
+ <p align="left"><img src="https://huggingface.co/InternRobotics/G2VLM-2B-MoT/resolve/main/assets/method.png" width="100%"></p>
57
+
58
+
59
+ ## License
60
+ G2VLM is licensed under the Apache 2.0 license.
61
+
62
+ ## ✍️ Citation
63
+ ```bibtex
64
+ @article{hu2025g2vlmgeometrygroundedvision,
65
+ title={G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning},
66
+ author={Wenbo Hu and Jingli Lin and Yilin Long and Yunlong Ran and Lihan Jiang and Yifan Wang and Chenming Zhu and Runsen Xu and Tai Wang and Jiangmiao Pang},
67
+ year={2025},
68
+ eprint={2511.21688},
69
+ archivePrefix={arXiv},
70
+ primaryClass={cs.CV},
71
+ url={https://arxiv.org/abs/2511.21688},
72
+ }
73
+ ```
assets/icon.png ADDED

Git LFS Details

  • SHA256: 99a01bb4656afbc75a4aaa214a938638a7c282e76d627f7a1c9595bb0cc48e74
  • Pointer size: 131 Bytes
  • Size of remote file: 732 kB
assets/method.png ADDED

Git LFS Details

  • SHA256: 4fc1e81b11de6fcf7a93ed85c148b62e91262485130935b352628f2ead1a45f0
  • Pointer size: 131 Bytes
  • Size of remote file: 598 kB
assets/teaser.png ADDED

Git LFS Details

  • SHA256: 08dec49bd34395370157a949235d685536cb4bc8d5717f4a751ee62d877c7727
  • Pointer size: 132 Bytes
  • Size of remote file: 1.48 MB
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad60d90252ed0b0705ba14e2d0ad0fec0beac1ea955642b54059b36052d8bc96
3
+ size 1050
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:422adefa19e62dd175961cec85bc0400344fe5bf9b22bd1182e05aaae78556e0
3
+ size 1196
dino_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03eee42f646659a9480f8911a81fdd81efeedd7ff39083c8e36398068daf72f5
3
+ size 1003
generation_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d2864bf1edea5863d331edfff48106b586a366f5a2c41aa77731fadc53aa25d2
3
+ size 272
preprocessor_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b5eaad0c2815f07631535dcc58f3c462b0d73693638ad21d19f3c50820eae1cc
3
+ size 347
text_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:422adefa19e62dd175961cec85bc0400344fe5bf9b22bd1182e05aaae78556e0
3
+ size 1196
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cb63a0a23eef3d5b01063a9880a1925a65aaf4d1591d519910ee3527852950a0
3
+ size 7029741
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff5c4fd898fe8c39591eb70e5d39d2782802d4204d6ae9ba1223252f354842a0
3
+ size 4190
vit_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e376158b1b95be08e1aab39196db5103a9b7961b8a7afe9682b066cd744c6964
3
+ size 218
vocab.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca10d7e9fb3ed18575dd1e277a2579c16d108e32f27439684afa0e10b1440910
3
+ size 2776833