Robotics
Safetensors

Ο€RL\pi_{RL}: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

This repository contains artifacts related to the Ο€RL\pi_{RL} framework, as introduced in the paper Ο€_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models.

The Ο€RL\pi_{RL} framework is an open-source solution for training flow-based Vision-Language-Action (VLA) models in parallel simulation, addressing the challenges of applying large-scale reinforcement learning to systems with intractable action log-likelihoods from iterative denoising.

Abstract

Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., Ο€0\pi_0, Ο€0.5\pi_{0.5} remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with Ο€RL\pi_{RL}, an open-source framework for training flow-based VLAs in parallel simulation. Ο€RL\pi_{RL} implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate Ο€RL\pi_{RL} on LIBERO and ManiSkill benchmarks. On LIBERO, Ο€RL\pi_{RL} boosts few-shot SFT models Ο€0\pi_0 and Ο€0.5\pi_{0.5} from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train Ο€RL\pi_{RL} in 320 parallel environments, improving Ο€0\pi_0 from 41.6% to 85.7% and Ο€0.5\pi_{0.5} from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, Ο€RL\pi_{RL} achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.

Further Resources

Citation

If you find this work helpful, please cite the following papers:

@misc{chen2025pitextttrlonlinerlfinetuning,
      title={$\pi_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models}, 
      author={Kang Chen and Zhihao Liu and Tonghe Zhang and Zhen Guo and Si Xu and Hao Lin and Hongzhi Zang and Quanlu Zhang and Zhaofei Yu and Guoliang Fan and Tiejun Huang and Yu Wang and Chao Yu},
      year={2025},
      eprint={2510.25889},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.25889}, 
}

If you use the broader RLinf framework, please also cite its main paper:

@misc{yu2025rlinfflexibleefficientlargescale,
  title={RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation}, 
  author={Chao Yu and Yuanqing Wang and Zhen Guo and Hao Lin and Si Xu and Hongzhi Zang and Quanlu Zhang and Yongji Wu and Chunyang Zhu and Junhao Hu and Zixiao Huang and Mingjie Wei and Yuqing Xie and Ke Yang and Bo Dai and Zhexuan Xu and Xiangyuan Wang and Xu Fu and Zhihao Liu and Kang Chen and Weilin Liu and Gang Liu and Boxun Li and Jianlei Yang and Zhi Yang and Guohao Dai and Yu Wang},
  year={2025},
  eprint={2509.15965},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2509.15965}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
4B params
Tensor type
F32
Β·
BF16
Β·
Video Preview
loading