zzwkk commited on
Commit
f2d35f8
·
verified ·
1 Parent(s): d33730e
Files changed (1) hide show
  1. README.md +54 -3
README.md CHANGED
@@ -1,3 +1,54 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Introduction
2
+
3
+ ## Model Description
4
+
5
+ We present **MUA-RL Model**, a multi-turn user-interacting agent reinforcement learning model designed for agentic tool use. This model is specifically designed for **multi-turn conversation scenarios** where agents need to maintain context across conversations while effectively utilizing tools to complete complex tasks.
6
+
7
+ MUA-RL is the first framework to integrate LLM-simulated users into the reinforcement learning loop for agentic tool use, enabling autonomous learning of models to communicate with users efficiently and use various tools to solve practical problems in dynamic multi-turn interactions.
8
+
9
+ ## Performance
10
+
11
+ MUA-RL achieves competitive performance across multiple multi-turn tool-using benchmarks:
12
+
13
+ | Model | TAU2 Retail | TAU2 Airline | TAU2 Telecom | BFCL-V3 Multi Turn | ACEBench Agent |
14
+ |-------------------------|-------------|--------------|--------------|--------------------|----------------|
15
+ | GPT-4.1 | 70.2 | 53.0 | 38.9 | 40.5 | 86.7 |
16
+ | DeepSeek-V3-0324 | 64.7 | 37.0 | 32.9 | 29.8 | 74.2 |
17
+ | Qwen3-32B-A22B Non-thinking | 64.9 | 36.0 | 24.6 | 30.0 | 71.7 |
18
+ | MUA-RL-32B Non-thinking | 67.3 | 45.4 | 28.3 | 28.4 | 82.5 |
19
+ | Qwen3-32B Non-thinking | 50.2 | 23.5 | 24.8 | 19.6 | 72.5 |
20
+ | MUA-RL-14B Non-thinking | 66.0 | 38.0 | 33.4 | 25.3 | 78.3 |
21
+ | Qwen3-14B Non-thinking | 43.1 | 14.8 | 29.9 | 17.6 | 60.0 |
22
+ | MUA-RL-8B Non-thinking | 49.8 | 19.0 | 21.8 | 14.6 | 53.3 |
23
+ | Qwen3-8B Non-thinking | 41.0 | 12.5 | 19.1 | 11.8 | 39.2 |
24
+
25
+ The model outperforms or matches the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.
26
+
27
+ ## Training Details
28
+
29
+ ### Architecture
30
+ - **Base Model**: Qwen3
31
+ - **Model Size**: 32B parameters
32
+ - **Train Context Length**: 32K tokens
33
+
34
+ ### Training Process
35
+ - **Reinforcement Learning**: Group Relative Policy Optimization (GRPO)
36
+ - **User Simulation**: LLM-simulated users integrated into RL loop (GPT-4o-2024-11-20 as user)
37
+ - **Environment Management**: Environment creation for each rollout
38
+ - **Tool Integration**: Seamless tool calling and response handling
39
+
40
+ ## Citation
41
+
42
+ If you use MUA-RL in your research, please cite our paper:
43
+
44
+ ```bibtex
45
+ @misc{zhao2025mua,
46
+ title={MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for Agentic Tool Use},
47
+ author={Weikang Zhao and Xili Wang and Chengdi Ma and Lingbin Kong and Zhaohua Yang and Mingxiang Tuo and Xiaowei Shi and Yitao Zhai and Xunliang Cai},
48
+ year={2025},
49
+ eprint={2508.18669},
50
+ archivePrefix={arXiv},
51
+ primaryClass={cs.AI},
52
+ url={https://arxiv.org/abs/2508.18669}
53
+ }
54
+ ```