katanemo/Plano-Orchestrator-30B-A3B

Overview

Plano-Orchestrator is a family of state-of-the-art routing and orchestration models that decide which agent(s) or LLM(s) should handle each request, and in what sequence. Built for multi-agent orchestration systems, Plano-Orchestrator excels at analyzing user intent and conversation context to make precise routing and orchestration decisions. Designed for real-world deployments, it delivers strong performance across general conversations, coding tasks, and long-context multi-turn conversations, while remaining efficient enough for low-latency production environments.

Key capabilities

Multi-turn Context Understanding: Makes routing decisions based on full conversation history, maintaining contextual awareness across extended dialogues with evolving user needs.
Multi-intent Detection: Identifies when a single user message requires multiple agents simultaneously, enabling parallel/sequential routing to fulfill complex requests.
Context-dependent Routing: Correctly interprets ambiguous or referential messages by leveraging prior conversation context for accurate routing decisions.
Conversational Flow Handling: Understands diverse interaction patterns including follow-ups, clarifications, confirmations, and corrections within ongoing conversations.
Negative Case Detection: Recognizes when no specialized routing is needed, avoiding unnecessary LLM or agent calls for casual conversation.

Benchmark

We evaluate on 1,958 user messages across 605 multi-turn conversations with more than 130 different agents, covering three scenarios:

General (1,438 messages): Everyday conversational queries spanning diverse topics and agent types
Coding (285 messages): Development-focused conversations including debugging, code generation, and technical assistance
Long-context (235 messages): Extended conversations requiring understanding of extensive prior context

Each message is annotated with routing-relevant attributes, including not limited to intent multiplicity, context dependency, and continuation type. Below is the evaluation result.

For evaluation, please note that all models were evaluated with minimal reasoning to ensure routing remains efficient.

Example

import json
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM


ORCHESTRATION_PROMPT = (
    "You are a helpful assistant that selects the most suitable routes based on user intent.\n"
    "You are provided with a list of available routes enclosed within <routes></routes> XML tags:\n"
    "<routes>\n{routes}\n</routes>\n\n"
    "You are also given the conversation context enclosed within <conversation></conversation> XML tags:\n"
    "<conversation>\n{conversation}\n</conversation>\n\n"
    "## Instructions\n"
    "1. Analyze the latest user intent from the conversation.\n"
    "2. Compare it against the available routes to find which routes can help fulfill the request.\n"
    "3. Respond only with the exact route names from <routes>.\n"
    "4. If no routes can help or the intent is already fulfilled, return an empty list.\n\n"
    "## Response Format\n"
    "Return your answer strictly in JSON as follows:\n"
    '{{"route": ["route_name_1", "route_name_2", "..."]}}\n'
    "If no routes are needed, return an empty list for `route`."
)

def convert_agents_to_routes(agents):
    tools = [
        {
            "name": agent["name"],
            "description": agent["description"],
        }
        for agent in agents
    ]
    return "\n".join([json.dumps(tool, ensure_ascii=False) for tool in tools])

def build_messages(available_agents, conversation):
    routes = convert_agents_to_routes(available_agents)
    conversation_str = json.dumps(conversation, indent=4, ensure_ascii=False)
    prompt = ORCHESTRATION_PROMPT.format(routes=routes, conversation=conversation_str)
    return [{"role": "user", "content": prompt}]

# Load model
model_name = "katanemo/Plano-Orchestrator-30B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Define available agents
available_agents = [
    {"name": "WeatherAgent", "description": "Provides weather forecasts and current conditions for any location"},
    {"name": "CodeAgent", "description": "Generates, debugs, explains, and reviews code in multiple programming languages"}
]

# Conversation history
conversation = [
    {"role": "user", "content": "What's the weather like today?"},
    {"role": "assistant", "content": "I can help you with that. Could you tell me your location?"},
    {"role": "user", "content": "San Francisco"},
]

# Build messages and generate
model_inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt", return_dict=True
).to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
# Output: {"route": ["WeatherAgent"]}