nielsr HF Staff commited on
Commit
63bdabf
Β·
verified Β·
1 Parent(s): 568e816

Enhance model card for VGT: Add metadata, paper, code, and usage

Browse files

This PR significantly enhances the model card for 'Visual Generation Tuning (VGT)' by:
- Adding the `pipeline_tag: text-to-image` to improve discoverability on the Hugging Face Hub.
- Including a detailed overview of the model and its key highlights.
- Providing a direct link to the official paper: [Visual Generation Tuning](https://huggingface.co/papers/2511.23469).
- Linking to the GitHub repository: https://github.com/hustvl/VGT.
- Adding comprehensive installation instructions and a sample inference code snippet directly from the GitHub README to guide users on how to use the model.
- Ensuring the academic citation information is present.

This update aims to make the model card more informative, user-friendly, and discoverable.

Files changed (1) hide show
  1. README.md +105 -3
README.md CHANGED
@@ -1,3 +1,105 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-to-image
4
+ ---
5
+
6
+ # VGT: Visual Generation Tuning
7
+
8
+ **_Unleashing Visual Generation Capabilities from Any Pretrained VLM_**
9
+
10
+ VGT (Visual Generation Tuning) is a novel paradigm designed to stimulate the underlying capabilities of visual generation within any Vision-Language Model (VLM). By performing efficient visual generation tuning on well-pretrained VLMs, VGT significantly mitigates alignment costs and accelerates the convergence of autoregressive modeling in the continuous space (20x speedup). It aims to transform any pretrained VLM into a powerful image generator, achieving state-of-the-art outcomes in visual generation tasks with dramatically faster convergence and extreme data efficiency.
11
+
12
+ **GenEval 0.83 | DPG-Bench 81.28 | 20Γ— Faster Convergence**
13
+
14
+ <div align="center">
15
+ <img src="https://github.com/hustvl/VGT/raw/main/asserts/case_show.png" alt="VGT Generated Images">
16
+ </div>
17
+
18
+ ## ✨ Highlights
19
+
20
+ - **🎯 Novel Paradigm**: Transform ANY pretrained Vision-Language Model into a powerful image generator through efficient visual generation tuning.
21
+ - **⚑ 20Γ— Speedup**: Achieve dramatically faster convergence compared to vanilla VAE-based autoregressive models.
22
+ - **πŸ“Š SOTA Performance**: GenEval **0.83** and DPG-Bench **81.28** with minimal training data.
23
+ - **πŸš€ Extreme Data Efficiency**: Reach GenEval 0.55 in just 10K iterations, 0.60 in 30K iterations.
24
+ - **πŸ”„ Parallel Inference**: QueryAR mechanism enables 16Γ— parallel decoding while maintaining high-quality generation.
25
+ - **🎨 Superior Reconstruction**: 26.67 PSNR and 0.50 rFID at 28Γ— compression ratio, outperforming specialized VAEs.
26
+
27
+ ## Paper
28
+
29
+ The model was presented in the paper:
30
+ [**Visual Generation Tuning**](https://huggingface.co/papers/2511.23469)
31
+
32
+ ## Code
33
+
34
+ The official implementation and code are available on the [GitHub repository](https://github.com/hustvl/VGT).
35
+
36
+ ## πŸš€ Getting Started
37
+
38
+ ### Installation
39
+
40
+ To get started, clone the repository and install the required dependencies:
41
+
42
+ ```bash
43
+ # Clone the repository
44
+ git clone https://github.com/hustvl/VGT.git
45
+ cd VGT
46
+
47
+ # Install dependencies
48
+ conda create -n vgt python=3.10
49
+ conda activate vgt
50
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
51
+ pip install mmengine xtuner tqdm timm
52
+ pip install diffusers transformers==4.57.1
53
+ pip install flash-attn --no-build-isolation
54
+ ```
55
+
56
+ ### Pretrained Models
57
+
58
+ Pretrained VGT-tuned models based on Qwen2.5-VL and InternVL3 (448px) are available for download:
59
+
60
+ | Model | Base Model | GenEval | DPG-Bench | Download |
61
+ |:------|:-----------|:-------:|:---------:|:--------:|
62
+ | VGT-InternVL3-1.6B-Pretrain | InternVL3-1.6B | 0.58 | 73.05 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_pretrain) |
63
+ | VGT-InternVL3-1.6B-SFT | InternVL3-1.6B | 0.83 | 76.33 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_sft) |
64
+ | VGT-Qwen2.5-VL-2B-Pretrain | Qwen2.5-VL-2B | 0.63 | 78.02 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_pretrain) |
65
+ | VGT-Qwen2.5-VL-2B-SFT | Qwen2.5-VL-2B | 0.83 | 81.28 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft) |
66
+
67
+ ### Inference (Sample Usage)
68
+
69
+ Download the SFT model checkpoints and generate images from text prompts:
70
+
71
+ ```bash
72
+ # Ensure you are in the 'VGT' directory
73
+ cd VGT
74
+
75
+ # Create a directory for checkpoints
76
+ mkdir -p ckpts/hustvl
77
+
78
+ # Download the sft model checkpoints
79
+ hf download hustvl/vgt_qwen25vl_2B_sft --repo-type model --local-dir ckpts/hustvl/vgt_qwen25vl_2B_sft
80
+ hf download hustvl/vgt_internvl3_1_6B_sft --repo-type model --local-dir ckpts/hustvl/vgt_internvl3_1_6B_sft
81
+
82
+ # Set Python path and run the inference script for InternVL3-1.6B
83
+ export PYTHONPATH=./:$PYTHONPATH
84
+ python scripts/sample_text_list_vgt_intervl3_0.6B.py
85
+ ```
86
+
87
+ *Note: VGT-Qwen2.5-VL-2B performs better in face generation, while VGT-InternVL3-1.6B excels in generating landscapes, light and shadow, and animals. You can explore these differences yourself.*
88
+
89
+ ## πŸ“ Citation
90
+
91
+ If you find our work useful, please cite our paper:
92
+
93
+ ```bibtex
94
+ @misc{guo2025vgt,
95
+ title={Visual Generation Tuning},
96
+ author={Jiahao Guo and Sinan Du and Jingfeng Yao and Wenyu Liu and Bo Li and Haoxiang Cao and Kun Gai and Chun Yuan and Kai Wu and Xinggang Wang},
97
+ year={2025},
98
+ eprint={2511.23469},
99
+ archivePrefix={arXiv},
100
+ }
101
+ ```
102
+
103
+ ## πŸ“„ License
104
+
105
+ This project is released under the MIT License. See the [LICENSE](https://github.com/hustvl/VGT/blob/main/LICENSE) file for details.