JavisDiT
/

JavisDiT-v0.1-audio

PyTorch

VASTDiT3

Model card Files Files and versions

xet

Community

kkail8 commited on Sep 29

Commit

900c1f6

verified ·

1 Parent(s): 45a17fb

Update README.md

Browse files

Files changed (1) hide show

README.md +58 -3

README.md CHANGED Viewed

@@ -1,3 +1,58 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+arxiv: 2503.23377
+---
+## <div align="center"> JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization</div>
+<div align="center">
+[[`HomePage`](https://javisdit.github.io/)]
+[[`ArXiv Paper`](https://arxiv.org/pdf/2503.23377)]
+[[`HF Paper`](https://huggingface.co/papers/2503.23377)]
+[[`GitHub`](https://github.com/JavisDiT/JavisDiT/)]
+[[`Models`](https://huggingface.co/collections/JavisDiT/javisdit-v01-67f2ac8a0def71591f7e2974)]
+<!-- [[`Gradio Demo`](https://447c629bc8648ce599.gradio.live)] -->
+</div>
+We introduce **JavisDiT**, a novel & SoTA Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG) from open-ended user prompts.
+## 📰 News
+- **[2025.08.11]** 🔥 We released the data and code for JAVG evaluation. For more details refer to [here](#evaluation) and [eval/javisbench/README.md](eval/javisbench/README.md).
+- **[2025.04.15]** 🔥 We released the data preparation and model training instructions. You can train JavisDiT with your own dataset!
+- **[2025.04.07]** 🔥 We released the inference code and a preview model of **JavisDiT-v0.1** at [HuggingFace](https://huggingface.co/JavisDiT), which includes **JavisDiT-v0.1-audio**, **JavisDiT-v0.1-prior**, and **JavisDiT-v0.1-jav** (with a [low-resolution version](https://huggingface.co/JavisDiT/JavisDiT-v0.1-jav-240p4s) and a [full-resolution version](https://huggingface.co/JavisDiT/JavisDiT-v0.1-jav)).
+- **[2025.04.03]** We release the repository of [JavisDiT](https://arxiv.org/pdf/2503.23377). Code, model, and data are coming soon.
+### 👉 TODO
+- [ ] Release the data and evaluation code for JavisScore.
+- [ ] Deriving a more efficient and powerful JAVG model.
+## Brief Introduction
+**JavisDiT** addresses the key bottleneck of JAVG with Hierarchical Spatio-Temporal Prior Synchronization.
+- We introduce **JavisDiT**, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG) from open-ended user prompts.
+- We propose **JavisBench**, a new benchmark consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios.
+- We devise **JavisScore**, a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content.
+- We curate **JavisEval**, a dataset with 3,000 human-annotated samples to quantitatively evaluate the accuracy of synchronization estimate metrics.
+We hope to set a new standard for the JAVG community. For more technical details, kindly refer to the original [paper](https://arxiv.org/pdf/2503.23377.pdf).
+## Citation
+If you find JavisDiT is useful and use it in your project, please kindly cite:
+```bibtex
+@inproceedings{liu2025javisdit,
+      title={JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization},
+      author={Kai Liu and Wei Li and Lai Chen and Shengqiong Wu and Yanhao Zheng and Jiayi Ji and Fan Zhou and Rongxin Jiang and Jiebo Luo and Hao Fei and Tat-Seng Chua},
+      booktitle={arxiv},
+      year={2025},
+      eprint={2503.23377},
+}
+```