Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy Paper • 2502.05177 • Published Feb 7 • 2
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation Paper • 2510.09607 • Published Oct 10 • 2
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting Paper • 2510.21817 • Published Oct 21 • 41