TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control Paper • 2507.01424 • Published Jul 2, 2025 • 1
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding Paper • 2507.06719 • Published Jul 9, 2025
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning Paper • 2503.23297 • Published Mar 30, 2025
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control Paper • 2601.05138 • Published 13 days ago • 16
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers Paper • 2012.15840 • Published Dec 31, 2020 • 3
Intelligent Director: An Automatic Framework for Dynamic Visual Composition using ChatGPT Paper • 2402.15746 • Published Feb 24, 2024
VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation Paper • 2502.07531 • Published Feb 11, 2025 • 12
ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context Paper • 2407.09774 • Published Jul 13, 2024