Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration

πŸ“š Paper | 🌐 Project Page | πŸ’» Code

For more quantitative results and visual results, go checkout our project page


🎬 Overview

overall_structure

πŸ”§ Dependencies and Installation

  1. Clone Repo

    git clone https://github.com/csbhr/Vivid-VR.git
    cd Vivid-VR
    
  2. Create Conda Environment and Install Dependencies

    # create new conda env
    conda create -n Vivid-VR python=3.10
    conda activate Vivid-VR
    
    # install pytorch
    pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
    
    # install python dependencies
    pip install -r requirements.txt
    
    # install easyocr [Optional, for text fix]
    pip install easyocr
    pip install numpy==1.26.4  # numpy2.x maybe installed when installing easyocr, which will cause conflicts
    
  3. Download Models

    • [Required] Download CogVideoX1.5-5B checkpoints from [huggingface].
    • [Required] Download cogvlm2-llama3-caption checkpoints from [huggingface].
      • Please replace modeling_cogvlm.py with ./VRDiT/cogvlm2-llama3-caption/modeling_cogvlm.py to remove the dependency on pytorchvideo.
    • [Required] Download Vivid-VR checkpoints from [huggingface].
    • [Optional, for text fix] Download easyocr checkpoints [english_g2] [zh_sim_g2] [craft_mlt_25k].
    • [Optional, for text fix] Download Real-ESRGAN checkpoints [RealESRGAN_x2plus].
    • Put them under the ./ckpts folder.

    The ckpts directory structure should be arranged as:

    β”œβ”€β”€ ckpts
    β”‚   β”œβ”€β”€ CogVideoX1.5-5B
    β”‚   β”‚   β”œβ”€β”€ ...
    β”‚   β”œβ”€β”€ cogvlm2-llama3-caption
    β”‚   β”‚   β”œβ”€β”€ ...
    β”‚   β”œβ”€β”€ Vivid-VR
    β”‚   β”‚   β”œβ”€β”€ controlnet
    β”‚   β”‚       β”œβ”€β”€ config.json
    β”‚   β”‚       β”œβ”€β”€ diffusion_pytorch_model.safetensors
    β”‚   β”‚   β”œβ”€β”€ connectors.pt
    β”‚   β”‚   β”œβ”€β”€ control_feat_proj.pt
    β”‚   β”‚   β”œβ”€β”€ control_patch_embed.pt
    β”‚   β”œβ”€β”€ easyocr
    β”‚   β”‚   β”œβ”€β”€ craft_mlt_25k.pth
    β”‚   β”‚   β”œβ”€β”€ english_g2.pth
    β”‚   β”‚   β”œβ”€β”€ zh_sim_g2.pth
    β”‚   β”œβ”€β”€ RealESRGAN
    β”‚   β”‚   β”œβ”€β”€ RealESRGAN_x2plus.pth
    

β˜•οΈ Quick Inference

Run the following commands to try it out:

python VRDiT/inference.py \
    --ckpt_dir=./ckpts \
    --cogvideox_ckpt_path=./ckpts/CogVideoX1.5-5B \
    --cogvlm2_ckpt_path=./ckpts/cogvlm2-llama3-caption \
    --input_dir=/dir/to/input/videos \
    --output_dir=/dir/to/output/videos \
    --num_temporal_process_frames=121 \  # For long video inference, if video longer than num_temporal_process_frames, aggregate sampling will be enabled in the temporal dimension
    --upscale=0 \  # Optional, if set to 0, the short-size of output videos will be 1024
    --textfix \  # Optional, if given, the text region will be replaced by the output of Real-ESRGAN
    --save_images  # Optional, if given, the video frames will be saved

GPU memory usage:

  • For a 121-frame video, it requires approximately 43GB GPU memory.
  • If you want to reduce GPU memory usage, replace "pipe.enable_model_cpu_offload" with "pipe.enable_sequential_cpu_offload" in ./VRDiT/inference.py. GPU memory usage is reduced to 25GB, but the inference time is longer.
  • For the arg "--num_temporal_process_frames", smaller values ​​require less GPU memory but increase inference time.

πŸ“§ Citation

If you find our repo useful for your research, please consider citing it:

@article{bai2025vividvr,
   title={Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration}, 
   author={Bai, Haoran and Chen, Xiaoxu and Yang, Canqian and He, Zongyao and Deng, Sibin and Chen, Ying},
   journal={arXiv preprint arXiv:2508.14483},
   year={2025},
   url={https://arxiv.org/abs/2508.14483}
 }

πŸ“„ License

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support