xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning

arXiv GitHub License

Welcome to xRouter, Salesforce AI Research's intelligent LLM routing system trained with reinforcement learning to dynamically select optimal models from 20+ available LLMs while optimizing for both performance and cost.

Modern LLM deployments face a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. xRouter learns end-to-end routing policies that balance quality and cost through explicit cost-aware reward shaping, eliminating the need for hand-engineered routing rules.

⭐ Highlights

  • Cost-Aware Optimization: RL-trained policies minimize costs (up to 60% reduction) while maintaining quality
  • Adaptive Routing: Dynamic model selection based on query complexity - routes simple queries to budget models, complex ones to premium models
  • Tool-Calling Architecture: Learns to effectively invoke 20+ models (GPT-5, o3/o4, DeepSeek R1, Qwen3, Kimi K2, etc.) and select best responses
  • Multi-Model Orchestration: Coordinates responses from multiple LLMs for complex reasoning tasks
  • Learned Prompt Engineering: Automatically generates optimized system prompts for target models

πŸ“Š Model Details

  • Developed by: Salesforce AI Research
  • Base Model: Qwen/Qwen2.5-7B-Instruct
  • Model Type: Instruction-tuned language model with tool-calling capabilities
  • Training Algorithm: DAPO (Distributional Advantage Policy Optimization) with cost-aware reward shaping
  • Training Data: Derived from Reasoning360 - math, code, reasoning, and STEM tasks
  • License: CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)

πŸ“ˆ Key Results

  • Substantial cost reductions (up to 60%) at comparable task completion rates
  • Evaluated on 17 diverse benchmarks spanning math, coding, reasoning, and OOD tasks
  • Adaptive behavior: Learns when to use premium vs. budget models without explicit rules
  • Multi-turn reasoning: Effectively coordinates multiple model calls for complex tasks

For detailed results, see our paper.

πŸ› οΈ Usage

Installation

# Clone the repository
git clone https://github.com/SalesforceAIResearch/xRouter.git
cd xRouter

# Set up environment
conda create -n xrouter python=3.12
conda activate xrouter

pip install uv
uv pip install torch==2.6.0
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install -e .[gpu,math,vllm,test]
pip install litellm rich python-dotenv

Configure API Keys

export OPENAI_API_KEY="your_openai_key"
export TOGETHER_API_KEY="your_together_key"
export GEMINI_API_KEY="your_gemini_key"  # optional

πŸš€ Deployment

# Host the router model
cd evaluation
bash host_router.sh  # Serves on port 8000

# Launch the router API (in another terminal)
bash serve_router.sh  # Serves on port 8800

πŸ’¬ Usage Example

import openai

# Initialize client
client = openai.OpenAI(
    base_url="http://localhost:8800/v1",
    api_key="dummy"
)

# Send request
response = client.chat.completions.create(
    model="router-tool-rl",
    messages=[
        {"role": "user", "content": "Solve: If x^2 + 2x + 1 = 0, what are the values of x?"}
    ],
    max_tokens=1000
)

print(response.choices[0].message.content)

# Access routing metadata
metadata = response.router_metadata
print(f"Model used: {metadata['model_used']}")
print(f"Total cost: ${metadata['total_cost']:.6f}")

πŸŽ“ Training Methodology

xRouter uses DAPO (Distributional Advantage Policy Optimization) with cost-aware reward shaping:

reward = quality - Ξ» Γ— normalized_cost

Training Features:

  • Cost-aware rewards penalize expensive routing decisions
  • Multi-turn credit assignment across conversation turns
  • Tool augmentation with 20+ model tools + response selection
  • Curriculum learning from simple to complex tasks

Supported Model Tiers:

Tier Models Best For
Premium GPT-5, GPT-4.1, o3, Qwen3-235B-Instruct, Kimi K2 Mission-critical tasks
Standard GPT-5-Mini, GPT-4.1-Mini, o4-Mini, GPT-OSS-120B Balanced performance
Budget GPT-5-Nano, GPT-4.1-Nano, GPT-4o-Mini, GPT-OSS-20B High-volume tasks
Specialized o3, DeepSeek-R1, Qwen3-235B-Thinking, Qwen3-Coder-480B Domain-specific

πŸ“š Citation

@article{qian2025xrouter,
  title={xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning},
  author={Qian, Cheng and Liu, Zuxin and Kokane, Shirley and Prabhakar, Akshara and Qiu, Jielin and Chen, Haolin and Liu, Zhiwei and Ji, Heng and Yao, Weiran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan},
  journal={arXiv preprint arXiv:2510.08439},
  year={2025}
}

πŸ”— Resources

πŸ™ Acknowledgements

This project builds upon exceptional work from the open-source community:

  • Reasoning360: Foundational RL training framework
  • VERL: RL infrastructure for distributed LLM training
  • SGLang: High-performance LLM serving backend
  • LiteLLM: Unified API interface for 20+ LLM providers

🏒 Developed by Salesforce AI Research

Downloads last month
8
Safetensors
Model size
8B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support