Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

Ministral 3 8B Reasoning 2512

A balanced model in the Ministral 3 family, Ministral 3 8B is a powerful, efficient tiny language model with vision capabilities.

This model is the reasoning post-trained version, trained for reasoning tasks, making it ideal for math, coding and stem related use cases.

The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware. Ministral 3 8B can even be deployed locally, capable of fitting in 24GB of VRAM in BF16, and less than 12GB of RAM/VRAM when quantized.

Key Features

Ministral 3 8B consists of two main architectural components:

  • 8.4B Language Model
  • 0.4B Vision Encoder

The Ministral 3 8B Reasoning model offers the following capabilities:

  • Vision: Enables the model to analyze images and provide insights based on visual content, in addition to text.
  • Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic.
  • System Prompt: Maintains strong adherence and support for system prompts.
  • Agentic: Offers best-in-class agentic capabilities with native function calling and JSON outputting.
  • Reasoning: Excels at complex, multi-step reasoning and dynamic problem-solving.
  • Edge-Optimized: Delivers best-in-class performance at a small scale, deployable anywhere.
  • Apache 2.0 License: Open-source license allowing usage and modification for both commercial and non-commercial purposes.
  • Large Context Window: Supports a 256k context window.

Use Cases

Perfect for balanced performance in local or embedded systems, combining versatility with efficiency.

  • Chat interfaces in constrained environments
  • Local daily-driver AI assistant
  • Image/document description and understanding
  • Translation and content generation
  • Specialized agentic use cases
  • Fine-tuning and specialization
  • And more...

Bringing advanced AI capabilities to resource-constrained environments.

Ministral 3 Family

Model Name Type Precision Link
Ministral 3 3B Base 2512 Base pre-trained BF16 Hugging Face
Ministral 3 3B Instruct 2512 Instruct post-trained BF16 Hugging Face
Ministral 3 3B Reasoning 2512 Reasoning capable BF16 Hugging Face
Ministral 3 8B Base 2512 Base pre-trained BF16 Hugging Face
Ministral 3 8B Instruct 2512 Instruct post-trained BF16 Hugging Face
Ministral 3 8B Reasoning 2512 Reasoning capable BF16 Hugging Face
Ministral 3 14B Base 2512 Base pre-trained BF16 Hugging Face
Ministral 3 14B Instruct 2512 Instruct post-trained BF16 Hugging Face
Ministral 3 14B Reasoning 2512 Reasoning capable BF16 Hugging Face

Other formats available here.

Benchmark Results

We compare Ministral 3 to similar sized models.

Reasoning

Model AIME25 AIME24 GPQA Diamond LiveCodeBench
Ministral 3 14B 0.850 0.898 0.712 0.646
Qwen3-14B (Thinking) 0.737 0.837 0.663 0.593
Ministral 3 8B 0.787 0.860 0.668 0.616
Qwen3-VL-8B-Thinking 0.798 0.860 0.671 0.580
Ministral 3 3B 0.721 0.775 0.534 0.548
Qwen3-VL-4B-Thinking 0.697 0.729 0.601 0.513

Instruct

Model Arena Hard WildBench MATH Maj@1 MM MTBench
Ministral 3 14B 0.551 68.5 0.904 8.49
Qwen3 14B (Non-Thinking) 0.427 65.1 0.870 NOT MULTIMODAL
Gemma3-12B-Instruct 0.436 63.2 0.854 6.70
Ministral 3 8B 0.509 66.8 0.876 8.08
Qwen3-VL-8B-Instruct 0.528 66.3 0.946 8.00
Ministral 3 3B 0.305 56.8 0.830 7.83
Qwen3-VL-4B-Instruct 0.438 56.8 0.900 8.01
Qwen3-VL-2B-Instruct 0.163 42.2 0.786 6.36
Gemma3-4B-Instruct 0.318 49.1 0.759 5.23

Base

Model Multilingual MMLU MATH CoT 2-Shot AGIEval 5-shot MMLU Redux 5-shot MMLU 5-shot TriviaQA 5-shot
Ministral 3 14B 0.742 0.676 0.648 0.820 0.794 0.749
Qwen3 14B Base 0.754 0.620 0.661 0.837 0.804 0.703
Gemma 3 12B Base 0.690 0.487 0.587 0.766 0.745 0.788
Ministral 3 8B 0.706 0.626 0.591 0.793 0.761 0.681
Qwen 3 8B Base 0.700 0.576 0.596 0.794 0.760 0.639
Ministral 3 3B 0.652 0.601 0.511 0.735 0.707 0.592
Qwen 3 4B Base 0.677 0.405 0.570 0.759 0.713 0.530
Gemma 3 4B Base 0.516 0.294 0.430 0.626 0.589 0.640

Usage

The model can be used with the following frameworks;

vLLM

We recommend using this model with vLLM.

Installation

Make sure to install vLLM >= 0.12.0:

pip install vllm --upgrade

Doing so should automatically install mistral_common >= 1.8.6.

To check:

python -c "import mistral_common; print(mistral_common.__version__)"

You can also make use of a ready-to-go docker image or on the docker hub.

Serve

Due to their size, Ministral-3-3B-Reasoning-2512 and Ministral-3-8B-Reasoning-2512 can run on a single 1xH200 GPU.

A simple launch command is:


vllm serve mistralai/Ministral-3-8B-Reasoning-2512-FP8 \
  --enable-auto-tool-choice --tool-call-parser mistral \
  --reasoning-parser mistral

Key parameter notes:

  • enable-auto-tool-choice: Required when enabling tool usage.
  • tool-call-parser mistral: Required when enabling tool usage.
  • reasoning-parser mistral: Required when enabling reasoning.

Additional flags:

  • You can set --max-model-len to preserve memory. By default it is set to 262144 which is quite large but not necessary for most scenarios.
  • You can set --max-num-batched-tokens to balance throughput and latency, higher means higher throughput but higher latency.

Usage of the model

Here we asumme that the model mistralai/Ministral-3-8B-Reasoning-2512 is served and you can ping it to the domain localhost with the port 8000 which is the default for vLLM.

Vision Reasoning

Let's see if the Ministral 3 model knows when to pick a fight !

from typing import Any

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()

    index_begin_think = system_prompt.find("[THINK]")
    index_end_think = system_prompt.find("[/THINK]")

    return {
        "role": "system",
        "content": [
            {"type": "text", "text": system_prompt[:index_begin_think]},
            {
                "type": "thinking",
                "thinking": system_prompt[
                    index_begin_think + len("[THINK]") : index_end_think
                ],
                "closed": True,
            },
            {
                "type": "text",
                "text": system_prompt[index_end_think + len("[/THINK]") :],
            },
        ],
    }


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    SYSTEM_PROMPT,
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]


stream = client.chat.completions.create(
    model=model,
    messages=messages,
    stream=True,
    temperature=TEMP,
    top_p=TOP_P,
    max_tokens=MAX_TOK,
)

print("client: Start streaming chat completions...:\n")
printed_reasoning_content = False
answer = []

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoning_content = chunk.choices[0].delta.reasoning_content
    if hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

    if reasoning_content is not None:
        if not printed_reasoning_content:
            printed_reasoning_content = True
            print("Start reasoning:\n", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    elif content is not None:
        # Extract and print the content
        if not reasoning_content and printed_reasoning_content:
            answer.extend(content)
        print(content, end="", flush=True)

if answer:
    print("\n\n=============\nAnswer\n=============\n")
    print("".join(answer))
else:
    print("\n\n=============\nNo Answer\n=============\n")
    print(
        "No answer was generated by the model, probably because the maximum number of tokens was reached."
    )

Now we'll make it compute some maths !

from typing import Any

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()

    index_begin_think = system_prompt.find("[THINK]")
    index_end_think = system_prompt.find("[/THINK]")

    return {
        "role": "system",
        "content": [
            {"type": "text", "text": system_prompt[:index_begin_think]},
            {
                "type": "thinking",
                "thinking": system_prompt[
                    index_begin_think + len("[THINK]") : index_end_think
                ],
                "closed": True,
            },
            {
                "type": "text",
                "text": system_prompt[index_end_think + len("[/THINK]") :],
            },
        ],
    }


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

image_url = "https://i.ytimg.com/vi/5Y3xLHeyKZU/hqdefault.jpg"

messages = [
    SYSTEM_PROMPT,
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Solve the equations. If they contain only numbers, use your calculator, else only think. Answer in the language of the image.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]

stream = client.chat.completions.create(
    model=model,
    messages=messages,
    stream=True,
    temperature=TEMP,
    top_p=TOP_P,
    max_tokens=MAX_TOK,
)

print("client: Start streaming chat completions...:\n")
printed_reasoning_content = False
answer = []

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoning_content = chunk.choices[0].delta.reasoning_content
    if hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

    if reasoning_content is not None:
        if not printed_reasoning_content:
            printed_reasoning_content = True
            print("Start reasoning:\n", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    if content is not None:
        # Extract and print the content
        if not reasoning_content and printed_reasoning_content:
            answer.extend(content)
        print(content, end="", flush=True)

if answer:
    print("\n\n=============\nAnswer\n=============\n")
    print("".join(answer))
else:
    print("\n\n=============\nNo Answer\n=============\n")
    print(
        "No answer was generated by the model, probably because the maximum number of tokens was reached."
    )
Text-Only Request

Let's do more maths and leave it up to the model to figure out how to achieve a result.

from typing import Any
from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()

    index_begin_think = system_prompt.find("[THINK]")
    index_end_think = system_prompt.find("[/THINK]")

    return {
        "role": "system",
        "content": [
            {"type": "text", "text": system_prompt[:index_begin_think]},
            {
                "type": "thinking",
                "thinking": system_prompt[
                    index_begin_think + len("[THINK]") : index_end_think
                ],
                "closed": True,
            },
            {
                "type": "text",
                "text": system_prompt[index_end_think + len("[/THINK]") :],
            },
        ],
    }


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

query = "Use each number in 2,5,6,3 exactly once, along with any combination of +, -, ×, ÷ (and parentheses for grouping), to make the number 24."

messages = [
    SYSTEM_PROMPT,
    {"role": "user", "content": query}
]
stream = client.chat.completions.create(
  model=model,
  messages=messages,
  stream=True,
  temperature=TEMP,
  top_p=TOP_P,
  max_tokens=MAX_TOK,
)

print("client: Start streaming chat completions...:\n")
printed_reasoning_content = False
answer = []

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoning_content = chunk.choices[0].delta.reasoning_content
    if hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

    if reasoning_content is not None:
        if not printed_reasoning_content:
            printed_reasoning_content = True
            print("Start reasoning:\n", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    if content is not None:
        # Extract and print the content
        if not reasoning_content and printed_reasoning_content:
            answer.extend(content)
        print(content, end="", flush=True)

if answer:
    print("\n\n=============\nAnswer\n=============\n")
    print("".join(answer))
else:
    print("\n\n=============\nNo Answer\n=============\n")
    print("No answer was generated by the model, probably because the maximum number of tokens was reached.")

Transformers

You can also use Ministral 3 8B Reasoning 2512 with Transformers !

To make the best use of our model with Transformers make sure to have installed mistral-common >= 1.8.6 to use our tokenizer.

pip install mistral-common --upgrade

Then load our tokenizer along with the model and generate:

Python snippet
import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = "mistralai/Ministral-3-8B-Reasoning-2512"

tokenizer = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]

tokenized = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True)

tokenized["input_ids"] = tokenized["input_ids"].to(device="cuda")
tokenized["pixel_values"] = tokenized["pixel_values"].to(dtype=torch.bfloat16, device="cuda")
image_sizes = [tokenized["pixel_values"].shape[-2:]]

output = model.generate(
    **tokenized,
    image_sizes=image_sizes,
    max_new_tokens=8092,
)[0]

decoded_output = tokenizer.decode(output[len(tokenized["input_ids"][0]):])
print(decoded_output)

License

This model is licensed under the Apache 2.0 License.

You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.

Downloads last month
37
Safetensors
Model size
9B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for unsloth/Ministral-3-8B-Reasoning-2512-bnb-4bit

Collection including unsloth/Ministral-3-8B-Reasoning-2512-bnb-4bit