MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots
Abstract
A unified vision-language-action framework, MobileVLA-R1, enhances reasoning and control for quadruped robots through supervised chain-of-thought alignment and GRPO reinforcement learning, achieving superior performance in complex environments.
Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real world. To address these issues, we present MobileVLA-R1, a unified vision-language-action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity chain-of-thought (CoT) for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments. Code: https://github.com/AIGeeksGroup/MobileVLA-R1. Website: https://aigeeksgroup.github.io/MobileVLA-R1.
Community
MobileVLA-R1 introduces a unified vision-language-action framework for quadruped robots that combines multi-granularity chain-of-thought (MobileVLA-CoT) with GRPO reinforcement learning. This two-stage training improves reasoning consistency and long-horizon control, and we validate it both in simulation and on a real quadruped platform. Feedback on the CoT design and RL reward formulation is very welcome!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VLA-R1: Enhancing Reasoning in Vision-Language-Action Models (2025)
- Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation (2025)
- Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation (2025)
- SITCOM: Scaling Inference-Time COMpute for VLAs (2025)
- LaViRA: Language-Vision-Robot Actions Translation for Zero-Shot Vision Language Navigation in Continuous Environments (2025)
- Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment (2025)
- DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper