Apriel-H1-15b-Thinker

thumbnail /ˈɑː.pri.əl/

A 15B-parameter hybrid reasoning model combining Transformer attention and Mamba State Space layers for high efficiency and scalability. Derived from Apriel-Nemotron-15B-Thinker through progressive distillation, Apriel-H1 replaces less critical attention layers with linear Mamba blocks—achieving over 2× higher inference throughput in vLLM with minimal loss in reasoning, math, and coding performance.

Model Size: 15B parameters
Context Length: 65K (target; runtime dependent)
Languages: English (best)

Highlights

Hybrid Transformer–SSM architecture
~2× throughput improvement over the base Thinker model
Retains strong reasoning, math, and coding capabilities
Built via efficient distillation—no training from scratch required

Model Overview

Apriel-H1-15b-Thinker is designed for agentic tasks, code assistance, and multi-step reasoning. It follows Apriel’s “think then answer” style: the model first produces a hidden chain-of-thought and then a concise final response. Where reasoning traces are undesired, configure prompts to favor concise outputs.

Technical report: Apriel-H1 Report

Efficient and strong among hybrids

All models were evaluated with vllm server endpoints using FlashInfer (except for AI21-Jamba-Reasoning-3B which used FlashAttention2), mamba_cache was set to fp32 for models: NVIDIA-Nemotron-Nano-9B-v2 and AI21-Jamba-Reasoning-3B.

Comparing with Thinker ~2x speedup!

How to Use

Install dependencies:

pip install transformers==4.53.2

Basic usage with Transformers generate:

import re
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ServiceNow-AI/Apriel-H1-15b-Thinker-SFT"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "Positive real numbers $x$ and $y$ satisfy $y^3=x^2$ and $(y-x)^2=4y^2$. What is $x+y$?\nMark your solution with \\boxed"
messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    tools=[]
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

response = re.findall(r"\\[BEGIN FINAL RESPONSE\\](.*?)\\[END FINAL RESPONSE\\]", output, re.DOTALL)[0].strip()
print("response:", response)

Recommended settings: temperature 0.6; increase max_new_tokens for complex reasoning.

Use it with vLLM

💻 Local Installation

1. Create and activate a Python environment

You can use any environment manager. The example below uses uv:

uv venv --python 3.12 --seed
source .venv/bin/activate

2. Install vLLM and the Apriel plugin

Find our plugin at https://github.com/ServiceNow/apriel. You may need to install a version of vLLM compatible with your CUDA version.

In this example, we use the default CUDA version and let vLLM automatically select the correct backend.

git clone git@github.com:ServiceNow/apriel.git
cd apriel
uv pip install vllm==0.10.2 --torch-backend=auto
pip install .

🧠 Running a vLLM Server

Option 1: Run locally (from source install)

Once installed, you can launch a vLLM OpenAI-compatible API server with your Apriel model:

vllm serve \
  --model ServiceNow-AI/Apriel-H1-15b-Thinker-SFT \
  --port 8000

Option 2: Run via Docker

You can run the server directly using the prebuilt container:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    ghcr.io/servicenow/apriel:latest \
    --model ServiceNow-AI/Apriel-H1-15b-Thinker-SFT \

Chat Template

<|system|>
You are a thoughtful and systematic AI assistant built by ServiceNow Language Models (SLAM) lab. Before providing an answer, analyze the problem carefully and present your reasoning step by step. After explaining your thought process, provide the final solution in the following format: [BEGIN FINAL RESPONSE] ... [END FINAL RESPONSE].
<|end|>
<|user|>
# user message here
<|end|>
<|assistant|>
Here are my reasoning steps:
# thoughts here
[BEGIN FINAL RESPONSE]
# assistant response here
[END FINAL RESPONSE]
<|end|>

The model will first generate its thinking process and then generate its final response between [BEGIN FINAL RESPONSE] and [END FINAL RESPONSE]. Here is a code snippet demonstrating the application of the chat template:

from transformers import AutoTokenizer
model_name = "ServiceNow-AI/Apriel-H1-15b-Thinker-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# prepare the model input
custom_system_prompt = "Answer like a pirate."
prompt = "You are an expert assistant in the implementation of customer experience management aspect of retail applications \n \nYou will be using Python as the programming language. \n \nYou will utilize a factory design pattern for the implementation and following the dependency inversion principle \n \nYou will modify the implementation based on user requirements. \n \nUpon user request, you will add, update, and remove the features & enhancements in the implementation provided by you. \n \nYou will ask whether the user wants to refactor the provided code or needs a sample implementation for reference. Upon user confirmation, I will proceed accordingly. \n \n**Guidelines:** \n 1. **User Requirements:** \n - You have to ask users about their requirements, clarify the user expectations, and suggest the best possible solution by providing examples of Python code snippets. \n - Ask users about which type of reports they need to assess the AI model's performance, accuracy, and reliability. \n - After providing the solution, you have to ask the user about the trial of the solution and modify the solution based on the user feedback. \n \n 2. **Libraries/Frameworks:** \n - You will be utilizing Python as a programming language. \n - You will be using Flask framework for REST APIS implementation \n \n 3. **Communication Gesture:** \n - Your conversation with the user should be interactive, supportive, courageous, and professional. \n - You have to break down the complex concepts into sub-concepts and try to explain them to the user. \n - You have to ask the user for the required parameters. If the user refuses to provide in 2 attempts, politely exit the conversation. \n - You have to provide your supported parameters to the user, if the user refuses to accept them then you have to put an apology note and exit the conversation. \n - You have to track the conversation about unasked questions by the user. If some/one of the questions remain then you have to remind the user about these questions and proceed to answer them based on the user's confirmation \n \n 4. **Implementation:** \n - Your code/implementations should be reliable, scalable, modular, and reusable. \n - You will be providing unit tests for the implementation upon user request. \n - You will be following MVC architecture for the applications \n - Your implementations must be well-commented and readable \n \n \n- Today's date is 23rd August 2024. \n- The default sender email is sender-assistant@email.com.\nHi, I am conducting research on retail customer feedback systems and I need assistance with designing and implementing them. Could you kindly provide me with a list of general customer feedback system modules?"
messages = [
    {"role": "user", "content": custom_system_prompt + "\n\n" + prompt}
]
# example tools
tools = [{"type": "function", "function": {"name": "getRetailFeedbackModules", "description": "Returns the list of modules usually present in the retail industry", "parameters": {"type": "object", "properties": {"page": {"type": "integer", "description": "The current page number.", "default": 1}, "page_size": {"type": "integer", "description": "The number of items per page.", "default": 3}}}}}, {"type": "function", "function": {"name": "verifyImplementation", "description": "Returns the list of modules usually present in the retail industry", "parameters": {"type": "object", "properties": {"coding_language": {"type": "string", "description": "The supported languages for verification of implementation.", "default": "python", "enum": ["python", "java", "php"]}, "code": {"type": "string", "description": "The code which needs verification"}, "design_pattern": {"type": "string", "description": "The design pattern to verify in the implementation", "enum": ["factory", "strategy", "singleton"]}, "verify_best_practices": {"type": "boolean", "description": "The verification of the coding style based on the language selected", "default": true}}}}}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools
)
model_inputs = tokenizer([text], return_tensors="pt")

Usage Guidelines

Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message.
We recommend setting temperature to 0.6.
We ensure the model starts with Here are my reasoning steps:\n during all our evaluations. This is implemented in the default chat template.

Intended Use

The Apriel family of models are designed for a variety of general-purpose instruction tasks, including:

Code assistance and generation
Logical reasoning and multi-step tasks
Question answering and information retrieval
Function calling, complex instruction following and agent use cases

They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy.

Limitations

Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts.
Bias: May reflect societal, cultural, or systemic biases present in training data.
Ethics: Do not use the model to produce harmful, unlawful, or unethical content.
Language: Strongest performance is in English. Output quality may degrade in underrepresented languages.
Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards.

Security and Responsible Use

Security Responsibilities:
Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF).

Guidelines for Deployers:

Regularly conduct robustness assessments to identify and mitigate adversarial inputs.
Implement validation and filtering processes to prevent harmful or biased outputs.
Continuously perform data privacy checks to guard against unintended data leaks.
Document and communicate the model's limitations, intended usage, and known security risks to all end-users.
Schedule periodic security reviews and updates to address emerging threats and vulnerabilities.

Guidelines for Users:

Follow established security policies and usage guidelines provided by deployers.
Protect and manage sensitive information when interacting with the model.
Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers.
Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions.

Disclaimer:
Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.

Software

Training stack: Fast-LLM

License

MIT

Citation

@misc{apriel_h1_2025,
  title        = {Apriel-H1: Towards Efficient Enterprise Reasoning Models},
  author       = {ServiceNow Language Models Lab},
  howpublished = {https://huggingface.co/ServiceNow-AI/Apriel-H1-15b-Thinker-SFT},
  year         = {2025}
}

Downloads last month: 209

Safetensors

Model size

16B params

Tensor type

F32

BF16

Collection including ServiceNow-AI/Apriel-H1-15b-Thinker-SFT

Apriel-H1

Collection

Introducing Apriel-H1 hybrids each blending Attention and Mamba State Space layers in varying proportions. • 8 items • Updated 3 days ago • 4