MSDF: A General Open-Domain Multi-Skill Dialog Framework Paper • 2206.08626 • Published Jun 17, 2022 • 2
A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text Paper • 2305.02265 • Published May 3, 2023 • 2
LMEye: An Interactive Perception Network for Large Language Models Paper • 2305.03701 • Published May 5, 2023 • 2
A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering Paper • 2311.07536 • Published Nov 13, 2023 • 3
LLMs Meet Long Video: Advancing Long Video Comprehension with An Interactive Visual Adapter in LLMs Paper • 2402.13546 • Published Feb 21, 2024 • 3
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment Paper • 2402.13561 • Published Feb 21, 2024 • 1
A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation Paper • 2402.13587 • Published Feb 21, 2024 • 2
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts Paper • 2405.11273 • Published May 18, 2024 • 19
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning Paper • 2406.11303 • Published Jun 17, 2024 • 3
Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation Paper • 2408.09787 • Published Aug 19, 2024 • 10
UI-TARS: Pioneering Automated GUI Interaction with Native Agents Paper • 2501.12326 • Published Jan 21 • 65
VideoVista-CulturalLingo: 360^circ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension Paper • 2504.17821 • Published Apr 23 • 24
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Paper • 2505.04921 • Published May 8 • 185