nielsr HF Staff commited on
Commit
88bcd09
Β·
verified Β·
1 Parent(s): 545bb9e

Add model card for VGT

Browse files

This PR adds a comprehensive model card for the VGT (Visual Generation Tuning) model, based on the official paper and GitHub repository.

The updates include:
- A clear description of the model and its capabilities.
- Links to the paper ([Visual Generation Tuning](https://huggingface.co/papers/2511.23469)) and the official GitHub repository ([https://github.com/hustvl/VGT](https://github.com/hustvl/VGT)).
- The `pipeline_tag: text-to-image` metadata for improved discoverability on the Hugging Face Hub, reflecting the model's core functionality.
- Verbatim installation instructions and inference code snippets from the GitHub README to help users get started quickly, ensuring no code is made up.
- Integration of key highlights, a "What is VGT?" summary, a table of pretrained models, and the citation information.
- Visual examples from the GitHub repository.

Please review and merge this PR.

Files changed (1) hide show
  1. README.md +125 -3
README.md CHANGED
@@ -1,3 +1,125 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-to-image
4
+ ---
5
+
6
+ <div align="center">
7
+ <img src="https://github.com/hustvl/VGT/raw/main/asserts/vgt_logo.png" alt="VGT" width="200">
8
+ <h2>πŸš€ VGT: Visual Generation Tuning</h2>
9
+ **_Unleashing Visual Generation Capabilities from Any Pretrained VLM_**
10
+ </div>
11
+
12
+ This repository hosts models from the paper [Visual Generation Tuning](https://huggingface.co/papers/2511.23469).
13
+
14
+ **VGT (Visual Generation Tuning)** is a groundbreaking paradigm designed to stimulate the underlying capabilities of visual generation within any pretrained Vision-Language Model (VLM). It significantly mitigates alignment costs and accelerates the convergence of autoregressive modeling in the continuous space, enabling efficient and high-quality image generation from text descriptions.
15
+
16
+ * **Paper**: [Visual Generation Tuning](https://huggingface.co/papers/2511.23469)
17
+ * **Code**: [https://github.com/hustvl/VGT](https://github.com/hustvl/VGT)
18
+
19
+ <div align="center">
20
+ <img src="https://github.com/hustvl/VGT/raw/main/asserts/case_show.png" alt="VGT Generated Images">
21
+ </div>
22
+
23
+ ---
24
+
25
+ ## ✨ Highlights
26
+
27
+ - **🎯 Novel Paradigm**: Transform ANY pretrained Vision-Language Model into a powerful image generator through efficient visual generation tuning
28
+ - **⚑ 20Γ— Speedup**: Achieve dramatically faster convergence compared to vanilla VAE-based autoregressive models
29
+ - **πŸ“Š SOTA Performance**: GenEval **0.83** and DPG-Bench **81.28** with minimal training data
30
+ - **πŸš€ Extreme Data Efficiency**: Reach GenEval 0.55 in just 10K iterations, 0.60 in 30K iterations
31
+ - **πŸ”„ Parallel Inference**: QueryAR mechanism enables 16Γ— parallel decoding while maintaining high-quality generation
32
+ - **🎨 Superior Reconstruction**: 26.67 PSNR and 0.50 rFID at 28Γ— compression ratio, outperforming specialized VAEs
33
+
34
+ ---
35
+
36
+ ## πŸ’‘ What is VGT?
37
+
38
+ **VGT (Visual Generation Tuning)** is a groundbreaking paradigm that answers a fundamental question:
39
+
40
+ *Can we directly leverage the well-aligned semantic representations in pretrained VLMs to enable visual generation capabilities?*
41
+
42
+ VGT bridges this gap through two key innovations:
43
+
44
+ **1. VGT-AE (Visual Generation Tuning - AutoEncoder)**
45
+ - Aligns semantic encoders from pretrained VLMs with latent representations of pixel decoders
46
+ - Achieves **26.67 PSNR** and **0.50 rFID** at **28Γ— compression**, outperforming specialized VAEs
47
+
48
+ **2. VGT-AR (Visual Generation Tuning - AutoRegressive)**
49
+ - Position-query mechanism for autoregressive formulation with partial parallel decoding
50
+ - Dramatically accelerates convergence (**20Γ— speedup**) compared to vanilla VAE-based models
51
+
52
+ ---
53
+
54
+ ## πŸš€ Getting Started
55
+
56
+ ### Installation
57
+
58
+ ```bash
59
+ # Clone the repository
60
+ git clone https://github.com/hustvl/VGT.git
61
+ cd VGT
62
+
63
+ # Install dependencies
64
+ conda create -n vgt python=3.10
65
+ conda activate vgt
66
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
67
+ pip install mmengine xtuner tqdm timm
68
+ pip install diffusers transformers==4.57.1
69
+ pip install flash-attn --no-build-isolation
70
+ ```
71
+
72
+ ### Pretrained Models
73
+
74
+ We provide VGT-tuned models based on Qwen2.5-VL and InternVL3 (448px):
75
+
76
+ | Model | Base Model | GenEval | DPG-Bench | Download |
77
+ |:------|:-----------|:-------:|:---------:|:--------:|
78
+ | VGT-InternVL3-1.6B-Pretrain | InternVL3-1.6B | 0.58 | 73.05 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_pretrain) |
79
+ | VGT-InternVL3-1.6B-SFT | InternVL3-1.6B | 0.83 | 76.33 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_sft) |
80
+ | VGT-Qwen2.5-VL-2B-Pretrain | Qwen2.5-VL-2B | 0.63 | 78.02 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_pretrain) |
81
+ | VGT-Qwen2.5-VL-2B-SFT | Qwen2.5-VL-2B | 0.83 | 81.28 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft) |
82
+
83
+ ### Inference
84
+
85
+ Download the sft model checkpoint:
86
+
87
+ ```bash
88
+ cd VGT
89
+ mkdir ckpts
90
+ hf download hustvl/vgt_qwen25vl_2B_sft --repo-type model --local-dir ckpts/hustvl/vgt_qwen25vl_2B_sft
91
+ hf download hustvl/vgt_internvl3_1_6B_sft --repo-type model --local-dir ckpts/hustvl/vgt_internvl3_1_6B_sft
92
+ ```
93
+
94
+ Generate images from text prompts:
95
+
96
+ ```bash
97
+ export PYTHONPATH=./:$PYTHONPATH
98
+
99
+ # use InternVL3-1.6B generate
100
+ python scripts/sample_text_list_vgt_intervl3_0.6B.py
101
+ ```
102
+
103
+ > Note: We found that under the same training method, VGT-Qwen2.5-VL-2B performs better in face generation, while VGT-InternVL3-1.6B performs better in generating landscapes, light and shadow, and animals. You can explore on your own.
104
+
105
+ ---
106
+
107
+ ## πŸ“ Citation
108
+
109
+ If you find our work useful, please cite our paper:
110
+
111
+ ```bibtex
112
+ @misc{guo2025vgt,
113
+ title={Visual Generation Tuning},
114
+ author={Jiahao Guo and Sinan Du and Jingfeng Yao and Wenyu Liu and Bo Li and Haoxiang Cao and Kun Gai and Chun Yuan and Kai Wu and Xinggang Wang},
115
+ year={2025},
116
+ eprint={2511.23469},
117
+ archivePrefix={arXiv},
118
+ }
119
+ ```
120
+
121
+ ---
122
+
123
+ ## πŸ“„ License
124
+
125
+ This project is released under the MIT License. See [LICENSE](LICENSE) for details.