xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning
Welcome to xRouter, Salesforce AI Research's intelligent LLM routing system trained with reinforcement learning to dynamically select optimal models from 20+ available LLMs while optimizing for both performance and cost.
Modern LLM deployments face a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. xRouter learns end-to-end routing policies that balance quality and cost through explicit cost-aware reward shaping, eliminating the need for hand-engineered routing rules.
β Highlights
- Cost-Aware Optimization: RL-trained policies minimize costs (up to 60% reduction) while maintaining quality
- Adaptive Routing: Dynamic model selection based on query complexity - routes simple queries to budget models, complex ones to premium models
- Tool-Calling Architecture: Learns to effectively invoke 20+ models (GPT-5, o3/o4, DeepSeek R1, Qwen3, Kimi K2, etc.) and select best responses
- Multi-Model Orchestration: Coordinates responses from multiple LLMs for complex reasoning tasks
- Learned Prompt Engineering: Automatically generates optimized system prompts for target models
π Model Details
- Developed by: Salesforce AI Research
- Base Model: Qwen/Qwen2.5-7B-Instruct
- Model Type: Instruction-tuned language model with tool-calling capabilities
- Training Algorithm: DAPO (Distributional Advantage Policy Optimization) with cost-aware reward shaping
- Training Data: Derived from Reasoning360 - math, code, reasoning, and STEM tasks
- License: CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)
π Key Results
- Substantial cost reductions (up to 60%) at comparable task completion rates
- Evaluated on 17 diverse benchmarks spanning math, coding, reasoning, and OOD tasks
- Adaptive behavior: Learns when to use premium vs. budget models without explicit rules
- Multi-turn reasoning: Effectively coordinates multiple model calls for complex tasks
For detailed results, see our paper.
π οΈ Usage
Installation
# Clone the repository
git clone https://github.com/SalesforceAIResearch/xRouter.git
cd xRouter
# Set up environment
conda create -n xrouter python=3.12
conda activate xrouter
pip install uv
uv pip install torch==2.6.0
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install -e .[gpu,math,vllm,test]
pip install litellm rich python-dotenv
Configure API Keys
export OPENAI_API_KEY="your_openai_key"
export TOGETHER_API_KEY="your_together_key"
export GEMINI_API_KEY="your_gemini_key" # optional
π Deployment
# Host the router model
cd evaluation
bash host_router.sh # Serves on port 8000
# Launch the router API (in another terminal)
bash serve_router.sh # Serves on port 8800
π¬ Usage Example
import openai
# Initialize client
client = openai.OpenAI(
base_url="http://localhost:8800/v1",
api_key="dummy"
)
# Send request
response = client.chat.completions.create(
model="router-tool-rl",
messages=[
{"role": "user", "content": "Solve: If x^2 + 2x + 1 = 0, what are the values of x?"}
],
max_tokens=1000
)
print(response.choices[0].message.content)
# Access routing metadata
metadata = response.router_metadata
print(f"Model used: {metadata['model_used']}")
print(f"Total cost: ${metadata['total_cost']:.6f}")
π Training Methodology
xRouter uses DAPO (Distributional Advantage Policy Optimization) with cost-aware reward shaping:
reward = quality - Ξ» Γ normalized_cost
Training Features:
- Cost-aware rewards penalize expensive routing decisions
- Multi-turn credit assignment across conversation turns
- Tool augmentation with 20+ model tools + response selection
- Curriculum learning from simple to complex tasks
Supported Model Tiers:
| Tier | Models | Best For |
|---|---|---|
| Premium | GPT-5, GPT-4.1, o3, Qwen3-235B-Instruct, Kimi K2 | Mission-critical tasks |
| Standard | GPT-5-Mini, GPT-4.1-Mini, o4-Mini, GPT-OSS-120B | Balanced performance |
| Budget | GPT-5-Nano, GPT-4.1-Nano, GPT-4o-Mini, GPT-OSS-20B | High-volume tasks |
| Specialized | o3, DeepSeek-R1, Qwen3-235B-Thinking, Qwen3-Coder-480B | Domain-specific |
π Citation
@article{qian2025xrouter,
title={xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning},
author={Qian, Cheng and Liu, Zuxin and Kokane, Shirley and Prabhakar, Akshara and Qiu, Jielin and Chen, Haolin and Liu, Zhiwei and Ji, Heng and Yao, Weiran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan},
journal={arXiv preprint arXiv:2510.08439},
year={2025}
}
π Resources
- π Paper: arXiv:2510.08439
- π» Code Repository: github.com/SalesforceAIResearch/xRouter
- π€ Model Hub: Salesforce/xRouter
π Acknowledgements
This project builds upon exceptional work from the open-source community:
- Reasoning360: Foundational RL training framework
- VERL: RL infrastructure for distributed LLM training
- SGLang: High-performance LLM serving backend
- LiteLLM: Unified API interface for 20+ LLM providers
π’ Developed by Salesforce AI Research
- Downloads last month
- 8