zzwkk
/

MUA-RL-32B

Model card Files Files and versions

zzwkk commited on Sep 29

Commit

f2d35f8

·

verified ·

1 Parent(s): d33730e

init

Files changed (1) hide show

README.md +54 -3

README.md CHANGED Viewed

@@ -1,3 +1,54 @@
----
-license: mit
----

+# Introduction
+## Model Description
+We present **MUA-RL Model**, a multi-turn user-interacting agent reinforcement learning model designed for agentic tool use. This model is specifically designed for **multi-turn conversation scenarios** where agents need to maintain context across conversations while effectively utilizing tools to complete complex tasks.
+MUA-RL is the first framework to integrate LLM-simulated users into the reinforcement learning loop for agentic tool use, enabling autonomous learning of models to communicate with users efficiently and use various tools to solve practical problems in dynamic multi-turn interactions.
+## Performance
+MUA-RL achieves competitive performance across multiple multi-turn tool-using benchmarks:
+| Model                   | TAU2 Retail | TAU2 Airline | TAU2 Telecom | BFCL-V3 Multi Turn | ACEBench Agent |
+|-------------------------|-------------|--------------|--------------|--------------------|----------------|
+| GPT-4.1                 | 70.2        | 53.0         | 38.9         | 40.5               | 86.7           |
+| DeepSeek-V3-0324        | 64.7        | 37.0         | 32.9         | 29.8               | 74.2           |
+| Qwen3-32B-A22B Non-thinking | 64.9    | 36.0         | 24.6         | 30.0               | 71.7           |
+| MUA-RL-32B Non-thinking | 67.3        | 45.4         | 28.3         | 28.4               | 82.5           |
+| Qwen3-32B Non-thinking  | 50.2        | 23.5         | 24.8         | 19.6               | 72.5           |
+| MUA-RL-14B Non-thinking | 66.0        | 38.0        | 33.4         | 25.3              | 78.3           |
+| Qwen3-14B Non-thinking  | 43.1        | 14.8         | 29.9         | 17.6               | 60.0           |
+| MUA-RL-8B Non-thinking | 49.8        | 19.0         | 21.8         | 14.6               | 53.3           |
+| Qwen3-8B Non-thinking  | 41.0        | 12.5         | 19.1         | 11.8               | 39.2           |
+The model outperforms or matches the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.
+## Training Details
+### Architecture
+- **Base Model**: Qwen3
+- **Model Size**: 32B parameters
+- **Train Context Length**: 32K tokens
+### Training Process
+- **Reinforcement Learning**: Group Relative Policy Optimization (GRPO)
+- **User Simulation**: LLM-simulated users integrated into RL loop (GPT-4o-2024-11-20 as user)
+- **Environment Management**: Environment creation for each rollout
+- **Tool Integration**: Seamless tool calling and response handling
+## Citation
+If you use MUA-RL in your research, please cite our paper:
+```bibtex
+@misc{zhao2025mua,
+  title={MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for Agentic Tool Use},
+  author={Weikang Zhao and Xili Wang and Chengdi Ma and Lingbin Kong and Zhaohua Yang and Mingxiang Tuo and Xiaowei Shi and Yitao Zhai and Xunliang Cai},
+  year={2025},
+  eprint={2508.18669},
+  archivePrefix={arXiv},
+  primaryClass={cs.AI},
+  url={https://arxiv.org/abs/2508.18669}
+}
+```