Enhance model card for VGT: Add metadata, paper, code, and usage (#1)

010fb33 verified 15 days ago

4.56 kB

	---
	license: mit
	pipeline_tag: text-to-image
	---

	# VGT: Visual Generation Tuning

	_Unleashing Visual Generation Capabilities from Any Pretrained VLM_

	VGT (Visual Generation Tuning) is a novel paradigm designed to stimulate the underlying capabilities of visual generation within any Vision-Language Model (VLM). By performing efficient visual generation tuning on well-pretrained VLMs, VGT significantly mitigates alignment costs and accelerates the convergence of autoregressive modeling in the continuous space (20x speedup). It aims to transform any pretrained VLM into a powerful image generator, achieving state-of-the-art outcomes in visual generation tasks with dramatically faster convergence and extreme data efficiency.

	GenEval 0.83 \| DPG-Bench 81.28 \| 20× Faster Convergence

	<div align="center">
	<img src="https://github.com/hustvl/VGT/raw/main/asserts/case_show.png" alt="VGT Generated Images">
	</div>

	## ✨ Highlights

	- 🎯 Novel Paradigm: Transform ANY pretrained Vision-Language Model into a powerful image generator through efficient visual generation tuning.
	- ⚡ 20× Speedup: Achieve dramatically faster convergence compared to vanilla VAE-based autoregressive models.
	- 📊 SOTA Performance: GenEval 0.83 and DPG-Bench 81.28 with minimal training data.
	- 🚀 Extreme Data Efficiency: Reach GenEval 0.55 in just 10K iterations, 0.60 in 30K iterations.
	- 🔄 Parallel Inference: QueryAR mechanism enables 16× parallel decoding while maintaining high-quality generation.
	- 🎨 Superior Reconstruction: 26.67 PSNR and 0.50 rFID at 28× compression ratio, outperforming specialized VAEs.

	## Paper

	The model was presented in the paper:
	[Visual Generation Tuning](https://huggingface.co/papers/2511.23469)

	## Code

	The official implementation and code are available on the [GitHub repository](https://github.com/hustvl/VGT).

	## 🚀 Getting Started

	### Installation

	To get started, clone the repository and install the required dependencies:

	```bash
	# Clone the repository
	git clone https://github.com/hustvl/VGT.git
	cd VGT

	# Install dependencies
	conda create -n vgt python=3.10
	conda activate vgt
	pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
	pip install mmengine xtuner tqdm timm
	pip install diffusers transformers==4.57.1
	pip install flash-attn --no-build-isolation
	```

	### Pretrained Models

	Pretrained VGT-tuned models based on Qwen2.5-VL and InternVL3 (448px) are available for download:

	\| Model \| Base Model \| GenEval \| DPG-Bench \| Download \|
	\|:------\|:-----------\|:-------:\|:---------:\|:--------:\|
	\| VGT-InternVL3-1.6B-Pretrain \| InternVL3-1.6B \| 0.58 \| 73.05 \| [🤗 HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_pretrain) \|
	\| VGT-InternVL3-1.6B-SFT \| InternVL3-1.6B \| 0.83 \| 76.33 \| [🤗 HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_sft) \|
	\| VGT-Qwen2.5-VL-2B-Pretrain \| Qwen2.5-VL-2B \| 0.63 \| 78.02 \| [🤗 HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_pretrain) \|
	\| VGT-Qwen2.5-VL-2B-SFT \| Qwen2.5-VL-2B \| 0.83 \| 81.28 \| [🤗 HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft) \|

	### Inference (Sample Usage)

	Download the SFT model checkpoints and generate images from text prompts:

	```bash
	# Ensure you are in the 'VGT' directory
	cd VGT

	# Create a directory for checkpoints
	mkdir -p ckpts/hustvl

	# Download the sft model checkpoints
	hf download hustvl/vgt_qwen25vl_2B_sft --repo-type model --local-dir ckpts/hustvl/vgt_qwen25vl_2B_sft
	hf download hustvl/vgt_internvl3_1_6B_sft --repo-type model --local-dir ckpts/hustvl/vgt_internvl3_1_6B_sft

	# Set Python path and run the inference script for InternVL3-1.6B
	export PYTHONPATH=./:$PYTHONPATH
	python scripts/sample_text_list_vgt_intervl3_0.6B.py
	```

	Note: VGT-Qwen2.5-VL-2B performs better in face generation, while VGT-InternVL3-1.6B excels in generating landscapes, light and shadow, and animals. You can explore these differences yourself.

	## 📝 Citation

	If you find our work useful, please cite our paper:

	```bibtex
	@misc{guo2025vgt,
	title={Visual Generation Tuning},
	author={Jiahao Guo and Sinan Du and Jingfeng Yao and Wenyu Liu and Bo Li and Haoxiang Cao and Kun Gai and Chun Yuan and Kai Wu and Xinggang Wang},
	year={2025},
	eprint={2511.23469},
	archivePrefix={arXiv},
	}
	```

	## 📄 License

	This project is released under the MIT License. See the [LICENSE](https://github.com/hustvl/VGT/blob/main/LICENSE) file for details.