Introduction
Model Description
We present MUA-RL Model, a multi-turn user-interacting agent reinforcement learning model designed for agentic tool use. This model is specifically designed for multi-turn conversation scenarios where agents need to maintain context across conversations while effectively utilizing tools to complete complex tasks. MUA-RL is the first framework to integrate LLM-simulated users into the reinforcement learning loop for agentic tool use, enabling autonomous learning of models to communicate with users efficiently and use various tools to solve practical problems in dynamic multi-turn interactions.
Performance
MUA-RL achieves competitive performance across multiple multi-turn tool-using benchmarks:
| Model | TAU2 Retail | TAU2 Airline | TAU2 Telecom | BFCL-V3 Multi Turn | ACEBench Agent |
|---|---|---|---|---|---|
| GPT-4.1 | 70.2 | 53.0 | 38.9 | 40.5 | 86.7 |
| DeepSeek-V3-0324 | 64.7 | 37.0 | 32.9 | 29.8 | 74.2 |
| Qwen3-32B-A22B Non-thinking | 64.9 | 36.0 | 24.6 | 30.0 | 71.7 |
| MUA-RL-32B Non-thinking | 67.3 | 45.4 | 28.3 | 28.4 | 82.5 |
| Qwen3-32B Non-thinking | 50.2 | 23.5 | 24.8 | 19.6 | 72.5 |
| MUA-RL-14B Non-thinking | 66.0 | 38.0 | 33.4 | 25.3 | 78.3 |
| Qwen3-14B Non-thinking | 43.1 | 14.8 | 29.9 | 17.6 | 60.0 |
| MUA-RL-8B Non-thinking | 49.8 | 19.0 | 21.8 | 14.6 | 53.3 |
| Qwen3-8B Non-thinking | 41.0 | 12.5 | 19.1 | 11.8 | 39.2 |
The model outperforms or matches the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.
Training Details
Architecture
- Model Size: 8B parameters
- Train Context Length: 32K tokens
Training Process
- Reinforcement Learning: Group Relative Policy Optimization (GRPO)
- User Simulation: LLM-simulated users integrated into RL loop (GPT-4o-2024-11-20 as user)
- Environment Management: Environment creation for each rollout
- Tool Integration: Seamless tool calling and response handling
Citation
If you use MUA-RL in your research, please cite our paper:
@misc{zhao2025mua,
title={MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for Agentic Tool Use},
author={Weikang Zhao and Xili Wang and Chengdi Ma and Lingbin Kong and Zhaohua Yang and Mingxiang Tuo and Xiaowei Shi and Yitao Zhai and Xunliang Cai},
year={2025},
eprint={2508.18669},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.18669}
}
- Downloads last month
- 4