|
|
--- |
|
|
license: mit |
|
|
pipeline_tag: text-to-image |
|
|
--- |
|
|
|
|
|
# VGT: Visual Generation Tuning |
|
|
|
|
|
**_Unleashing Visual Generation Capabilities from Any Pretrained VLM_** |
|
|
|
|
|
VGT (Visual Generation Tuning) is a novel paradigm designed to stimulate the underlying capabilities of visual generation within any Vision-Language Model (VLM). By performing efficient visual generation tuning on well-pretrained VLMs, VGT significantly mitigates alignment costs and accelerates the convergence of autoregressive modeling in the continuous space (20x speedup). It aims to transform any pretrained VLM into a powerful image generator, achieving state-of-the-art outcomes in visual generation tasks with dramatically faster convergence and extreme data efficiency. |
|
|
|
|
|
**GenEval 0.83 | DPG-Bench 81.28 | 20Γ Faster Convergence** |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://github.com/hustvl/VGT/raw/main/asserts/case_show.png" alt="VGT Generated Images"> |
|
|
</div> |
|
|
|
|
|
## β¨ Highlights |
|
|
|
|
|
- **π― Novel Paradigm**: Transform ANY pretrained Vision-Language Model into a powerful image generator through efficient visual generation tuning. |
|
|
- **β‘ 20Γ Speedup**: Achieve dramatically faster convergence compared to vanilla VAE-based autoregressive models. |
|
|
- **π SOTA Performance**: GenEval **0.83** and DPG-Bench **81.28** with minimal training data. |
|
|
- **π Extreme Data Efficiency**: Reach GenEval 0.55 in just 10K iterations, 0.60 in 30K iterations. |
|
|
- **π Parallel Inference**: QueryAR mechanism enables 16Γ parallel decoding while maintaining high-quality generation. |
|
|
- **π¨ Superior Reconstruction**: 26.67 PSNR and 0.50 rFID at 28Γ compression ratio, outperforming specialized VAEs. |
|
|
|
|
|
## Paper |
|
|
|
|
|
The model was presented in the paper: |
|
|
[**Visual Generation Tuning**](https://huggingface.co/papers/2511.23469) |
|
|
|
|
|
## Code |
|
|
|
|
|
The official implementation and code are available on the [GitHub repository](https://github.com/hustvl/VGT). |
|
|
|
|
|
## π Getting Started |
|
|
|
|
|
### Installation |
|
|
|
|
|
To get started, clone the repository and install the required dependencies: |
|
|
|
|
|
```bash |
|
|
# Clone the repository |
|
|
git clone https://github.com/hustvl/VGT.git |
|
|
cd VGT |
|
|
|
|
|
# Install dependencies |
|
|
conda create -n vgt python=3.10 |
|
|
conda activate vgt |
|
|
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 |
|
|
pip install mmengine xtuner tqdm timm |
|
|
pip install diffusers transformers==4.57.1 |
|
|
pip install flash-attn --no-build-isolation |
|
|
``` |
|
|
|
|
|
### Pretrained Models |
|
|
|
|
|
Pretrained VGT-tuned models based on Qwen2.5-VL and InternVL3 (448px) are available for download: |
|
|
|
|
|
| Model | Base Model | GenEval | DPG-Bench | Download | |
|
|
|:------|:-----------|:-------:|:---------:|:--------:| |
|
|
| VGT-InternVL3-1.6B-Pretrain | InternVL3-1.6B | 0.58 | 73.05 | [π€ HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_pretrain) | |
|
|
| VGT-InternVL3-1.6B-SFT | InternVL3-1.6B | 0.83 | 76.33 | [π€ HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_sft) | |
|
|
| VGT-Qwen2.5-VL-2B-Pretrain | Qwen2.5-VL-2B | 0.63 | 78.02 | [π€ HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_pretrain) | |
|
|
| VGT-Qwen2.5-VL-2B-SFT | Qwen2.5-VL-2B | 0.83 | 81.28 | [π€ HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft) | |
|
|
|
|
|
### Inference (Sample Usage) |
|
|
|
|
|
Download the SFT model checkpoints and generate images from text prompts: |
|
|
|
|
|
```bash |
|
|
# Ensure you are in the 'VGT' directory |
|
|
cd VGT |
|
|
|
|
|
# Create a directory for checkpoints |
|
|
mkdir -p ckpts/hustvl |
|
|
|
|
|
# Download the sft model checkpoints |
|
|
hf download hustvl/vgt_qwen25vl_2B_sft --repo-type model --local-dir ckpts/hustvl/vgt_qwen25vl_2B_sft |
|
|
hf download hustvl/vgt_internvl3_1_6B_sft --repo-type model --local-dir ckpts/hustvl/vgt_internvl3_1_6B_sft |
|
|
|
|
|
# Set Python path and run the inference script for InternVL3-1.6B |
|
|
export PYTHONPATH=./:$PYTHONPATH |
|
|
python scripts/sample_text_list_vgt_intervl3_0.6B.py |
|
|
``` |
|
|
|
|
|
*Note: VGT-Qwen2.5-VL-2B performs better in face generation, while VGT-InternVL3-1.6B excels in generating landscapes, light and shadow, and animals. You can explore these differences yourself.* |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you find our work useful, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{guo2025vgt, |
|
|
title={Visual Generation Tuning}, |
|
|
author={Jiahao Guo and Sinan Du and Jingfeng Yao and Wenyu Liu and Bo Li and Haoxiang Cao and Kun Gai and Chun Yuan and Kai Wu and Xinggang Wang}, |
|
|
year={2025}, |
|
|
eprint={2511.23469}, |
|
|
archivePrefix={arXiv}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## π License |
|
|
|
|
|
This project is released under the MIT License. See the [LICENSE](https://github.com/hustvl/VGT/blob/main/LICENSE) file for details. |