xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning

Welcome to xRouter, Salesforce AI Research's intelligent LLM routing system trained with reinforcement learning to dynamically select optimal models from 20+ available LLMs while optimizing for both performance and cost.

Modern LLM deployments face a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. xRouter learns end-to-end routing policies that balance quality and cost through explicit cost-aware reward shaping, eliminating the need for hand-engineered routing rules.

⭐ Highlights

Cost-Aware Optimization: RL-trained policies minimize costs (up to 60% reduction) while maintaining quality
Adaptive Routing: Dynamic model selection based on query complexity - routes simple queries to budget models, complex ones to premium models
Tool-Calling Architecture: Learns to effectively invoke 20+ models (GPT-5, o3/o4, DeepSeek R1, Qwen3, Kimi K2, etc.) and select best responses
Multi-Model Orchestration: Coordinates responses from multiple LLMs for complex reasoning tasks
Learned Prompt Engineering: Automatically generates optimized system prompts for target models

📊 Model Details

Developed by: Salesforce AI Research
Base Model: Qwen/Qwen2.5-7B-Instruct
Model Type: Instruction-tuned language model with tool-calling capabilities
Training Algorithm: DAPO (Distributional Advantage Policy Optimization) with cost-aware reward shaping
Training Data: Derived from Reasoning360 - math, code, reasoning, and STEM tasks
License: CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)

📈 Key Results

Substantial cost reductions (up to 60%) at comparable task completion rates
Evaluated on 17 diverse benchmarks spanning math, coding, reasoning, and OOD tasks
Adaptive behavior: Learns when to use premium vs. budget models without explicit rules
Multi-turn reasoning: Effectively coordinates multiple model calls for complex tasks

For detailed results, see our paper.

🛠️ Usage

Installation

# Clone the repository
git clone https://github.com/SalesforceAIResearch/xRouter.git
cd xRouter

# Set up environment
conda create -n xrouter python=3.12
conda activate xrouter

pip install uv
uv pip install torch==2.6.0
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install -e .[gpu,math,vllm,test]
pip install litellm rich python-dotenv

Configure API Keys

export OPENAI_API_KEY="your_openai_key"
export TOGETHER_API_KEY="your_together_key"
export GEMINI_API_KEY="your_gemini_key"  # optional

🚀 Deployment

# Host the router model
cd evaluation
bash host_router.sh  # Serves on port 8000

# Launch the router API (in another terminal)
bash serve_router.sh  # Serves on port 8800

💬 Usage Example

import openai

# Initialize client
client = openai.OpenAI(
    base_url="http://localhost:8800/v1",
    api_key="dummy"
)

# Send request
response = client.chat.completions.create(
    model="router-tool-rl",
    messages=[
        {"role": "user", "content": "Solve: If x^2 + 2x + 1 = 0, what are the values of x?"}
    ],
    max_tokens=1000
)

print(response.choices[0].message.content)

# Access routing metadata
metadata = response.router_metadata
print(f"Model used: {metadata['model_used']}")
print(f"Total cost: ${metadata['total_cost']:.6f}")

🎓 Training Methodology

xRouter uses DAPO (Distributional Advantage Policy Optimization) with cost-aware reward shaping:

reward = quality - λ × normalized_cost

Training Features:

Cost-aware rewards penalize expensive routing decisions
Multi-turn credit assignment across conversation turns
Tool augmentation with 20+ model tools + response selection
Curriculum learning from simple to complex tasks

Supported Model Tiers:

Tier	Models	Best For
Premium	GPT-5, GPT-4.1, o3, Qwen3-235B-Instruct, Kimi K2	Mission-critical tasks
Standard	GPT-5-Mini, GPT-4.1-Mini, o4-Mini, GPT-OSS-120B	Balanced performance
Budget	GPT-5-Nano, GPT-4.1-Nano, GPT-4o-Mini, GPT-OSS-20B	High-volume tasks
Specialized	o3, DeepSeek-R1, Qwen3-235B-Thinking, Qwen3-Coder-480B	Domain-specific

📚 Citation

@article{qian2025xrouter,
  title={xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning},
  author={Qian, Cheng and Liu, Zuxin and Kokane, Shirley and Prabhakar, Akshara and Qiu, Jielin and Chen, Haolin and Liu, Zhiwei and Ji, Heng and Yao, Weiran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan},
  journal={arXiv preprint arXiv:2510.08439},
  year={2025}
}

🔗 Resources

📄 Paper: arXiv:2510.08439
💻 Code Repository: github.com/SalesforceAIResearch/xRouter
🤗 Model Hub: Salesforce/xRouter

🙏 Acknowledgements

This project builds upon exceptional work from the open-source community:

Reasoning360: Foundational RL training framework
VERL: RL infrastructure for distributed LLM training
SGLang: High-performance LLM serving backend
LiteLLM: Unified API interface for 20+ LLM providers

🏢 Developed by Salesforce AI Research

Downloads last month: 8

Safetensors

Model size

8B params

Tensor type

F32