Introduction

Model Description

We present MUA-RL Model, a multi-turn user-interacting agent reinforcement learning model designed for agentic tool use. This model is specifically designed for multi-turn conversation scenarios where agents need to maintain context across conversations while effectively utilizing tools to complete complex tasks. MUA-RL is the first framework to integrate LLM-simulated users into the reinforcement learning loop for agentic tool use, enabling autonomous learning of models to communicate with users efficiently and use various tools to solve practical problems in dynamic multi-turn interactions.

Performance

MUA-RL achieves competitive performance across multiple multi-turn tool-using benchmarks:

Model	TAU2 Retail	TAU2 Airline	TAU2 Telecom	BFCL-V3 Multi Turn	ACEBench Agent
GPT-4.1	70.2	53.0	38.9	40.5	86.7
DeepSeek-V3-0324	64.7	37.0	32.9	29.8	74.2
Qwen3-32B-A22B Non-thinking	64.9	36.0	24.6	30.0	71.7
MUA-RL-32B Non-thinking	67.3	45.4	28.3	28.4	82.5
Qwen3-32B Non-thinking	50.2	23.5	24.8	19.6	72.5
MUA-RL-14B Non-thinking	66.0	38.0	33.4	25.3	78.3
Qwen3-14B Non-thinking	43.1	14.8	29.9	17.6	60.0
MUA-RL-8B Non-thinking	49.8	19.0	21.8	14.6	53.3
Qwen3-8B Non-thinking	41.0	12.5	19.1	11.8	39.2

The model outperforms or matches the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.

Training Details

Architecture

Model Size: 8B parameters
Train Context Length: 32K tokens

Training Process

Reinforcement Learning: Group Relative Policy Optimization (GRPO)
User Simulation: LLM-simulated users integrated into RL loop (GPT-4o-2024-11-20 as user)
Environment Management: Environment creation for each rollout
Tool Integration: Seamless tool calling and response handling

Citation

If you use MUA-RL in your research, please cite our paper:

@misc{zhao2025mua,
  title={MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for Agentic Tool Use},
  author={Weikang Zhao and Xili Wang and Chengdi Ma and Lingbin Kong and Zhaohua Yang and Mingxiang Tuo and Xiaowei Shi and Yitao Zhai and Xunliang Cai},
  year={2025},
  eprint={2508.18669},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2508.18669}
}

Downloads last month: 4

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zzwkk/MUA-RL-8B

Base model

Qwen/Qwen3-32B

Finetuned

(136)

this model

Quantizations

2 models