Ministral-3-8B-Reasoning-2512 / README.md

Update README.md

f511871 verified 4 days ago

22.8 kB

	---
	library_name: vllm
	language:
	- en
	- fr
	- es
	- de
	- it
	- pt
	- nl
	- zh
	- ja
	- ko
	- ar
	license: apache-2.0
	inference: false
	base_model:
	- mistralai/Ministral-3-8B-Base-2512
	extra_gated_description: >-
	If you want to learn more about how we process your personal data, please read
	our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
	tags:
	- mistral-common
	---

	# Ministral 3 8B Reasoning 2512
	A balanced model in the Ministral 3 family, Ministral 3 8B is a powerful, efficient tiny language model with vision capabilities.

	This model is the reasoning post-trained version, trained for reasoning tasks, making it ideal for math, coding and stem related use cases.

	The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware. Ministral 3 8B can even be deployed locally, capable of fitting in 24GB of VRAM in BF16, and less than 12GB of RAM/VRAM when quantized.

	## Key Features
	Ministral 3 8B consists of two main architectural components:
	- 8.4B Language Model
	- 0.4B Vision Encoder

	The Ministral 3 8B Reasoning model offers the following capabilities:
	- Vision: Enables the model to analyze images and provide insights based on visual content, in addition to text.
	- Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic.
	- System Prompt: Maintains strong adherence and support for system prompts.
	- Agentic: Offers best-in-class agentic capabilities with native function calling and JSON outputting.
	- Reasoning: Excels at complex, multi-step reasoning and dynamic problem-solving.
	- Edge-Optimized: Delivers best-in-class performance at a small scale, deployable anywhere.
	- Apache 2.0 License: Open-source license allowing usage and modification for both commercial and non-commercial purposes.
	- Large Context Window: Supports a 256k context window.

	### Use Cases
	Perfect for balanced performance in local or embedded systems, combining versatility with efficiency.
	- Chat interfaces in constrained environments
	- Local daily-driver AI assistant
	- Image/document description and understanding
	- Translation and content generation
	- Specialized agentic use cases
	- Fine-tuning and specialization
	- And more...

	Bringing advanced AI capabilities to resource-constrained environments.

	### Recommended Settings

	We recommend deploying with the following best practices:
	- System Prompt: Use our provided [system prompt](https://huggingface.co/mistralai/Ministral-3-8B-Reasoning-2512/blob/main/SYSTEM_PROMPT.txt), and append it to your custom system prompt to define a clear environment and use case, including guidance on how to effectively leverage tools in agentic systems.
	- Multi-turn Traces: We highly recommend keeping the reasoning traces in context.
	- Sampling Parameters: Use a temperature of 0.7 for most environments ; Different temperatures may be explored for different use cases - developers are encouraged to experiment with alternative settings.
	- Tools: Keep the set of tools well-defined and limit their number to the minimum required for the use case - Avoiding overloading the model with an excessive number of tools.
	- Vision: When deploying with vision capabilities, we recommend maintaining an aspect ratio close to 1:1 (width-to-height) for images. Avoiding the use of overly thin or wide images - crop them as needed to ensure optimal performance.

	## Ministral 3 Family

	\| Model Name \| Type \| Precision \| Link \|
	\|--------------------------------\|--------------------\|-----------\|------------------------------------------------------------------------------------------\|
	\| Ministral 3 3B Base 2512 \| Base pre-trained \| BF16 \| [Hugging Face](https://huggingface.co/mistralai/Ministral-3-3B-Base-2512) \|
	\| Ministral 3 3B Instruct 2512 \| Instruct post-trained \| FP8 \| [Hugging Face](https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512) \|
	\| Ministral 3 3B Reasoning 2512 \| Reasoning capable \| BF16 \| [Hugging Face](https://huggingface.co/mistralai/Ministral-3-3B-Reasoning-2512) \|
	\| Ministral 3 8B Base 2512 \| Base pre-trained \| BF16 \| [Hugging Face](https://huggingface.co/mistralai/Ministral-3-8B-Base-2512) \|
	\| Ministral 3 8B Instruct 2512 \| Instruct post-trained \| FP8 \| [Hugging Face](https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512) \|
	\| Ministral 3 8B Reasoning 2512 \| Reasoning capable \| BF16 \| [Hugging Face](https://huggingface.co/mistralai/Ministral-3-8B-Reasoning-2512) \|
	\| Ministral 3 14B Base 2512 \| Base pre-trained \| BF16 \| [Hugging Face](https://huggingface.co/mistralai/Ministral-3-14B-Base-2512) \|
	\| Ministral 3 14B Instruct 2512 \| Instruct post-trained \| FP8 \| [Hugging Face](https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512) \|
	\| Ministral 3 14B Reasoning 2512 \| Reasoning capable \| BF16 \| [Hugging Face](https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512) \|

	Other formats available [here](https://huggingface.co/collections/mistralai/ministral-3-additional-checkpoints).

	## Benchmark Results

	We compare Ministral 3 to similar sized models.

	### Reasoning

	\| Model \| AIME25 \| AIME24 \| GPQA Diamond \| LiveCodeBench \|
	\|---------------------------\|-------------\|-------------\|--------------\|---------------\|
	\| Ministral 3 14B \| <u>0.850</u>\| <u>0.898</u>\| <u>0.712</u> \| <u>0.646</u> \|
	\| Qwen3-14B (Thinking) \| 0.737 \| 0.837 \| 0.663 \| 0.593 \|
	\| \| \| \| \| \|
	\| Ministral 3 8B \| 0.787 \| <u>0.860</u>\| 0.668 \| <u>0.616</u> \|
	\| Qwen3-VL-8B-Thinking \| <u>0.798</u>\| <u>0.860</u>\| <u>0.671</u> \| 0.580 \|
	\| \| \| \| \| \|
	\| Ministral 3 3B \| <u>0.721</u>\| <u>0.775</u>\| 0.534 \| <u>0.548</u> \|
	\| Qwen3-VL-4B-Thinking \| 0.697 \| 0.729 \| <u>0.601</u> \| 0.513 \|

	### Instruct

	\| Model \| Arena Hard \| WildBench \| MATH Maj@1 \| MM MTBench \|
	\|---------------------------\|-------------\|------------\|-------------\|------------------\|
	\| Ministral 3 14B \| <u>0.551</u>\| <u>68.5</u>\| <u>0.904</u>\| <u>8.49</u> \|
	\| Qwen3 14B (Non-Thinking) \| 0.427 \| 65.1 \| 0.870 \| NOT MULTIMODAL \|
	\| Gemma3-12B-Instruct \| 0.436 \| 63.2 \| 0.854 \| 6.70 \|
	\| \| \| \| \| \|
	\| Ministral 3 8B \| 0.509 \| <u>66.8</u>\| 0.876 \| <u>8.08</u> \|
	\| Qwen3-VL-8B-Instruct \| <u>0.528</u>\| 66.3 \| <u>0.946</u>\| 8.00 \|
	\| \| \| \| \| \|
	\| Ministral 3 3B \| 0.305 \| <u>56.8</u>\| 0.830 \| 7.83 \|
	\| Qwen3-VL-4B-Instruct \| <u>0.438</u>\| <u>56.8</u>\| <u>0.900</u>\| <u>8.01</u> \|
	\| Qwen3-VL-2B-Instruct \| 0.163 \| 42.2 \| 0.786 \| 6.36 \|
	\| Gemma3-4B-Instruct \| 0.318 \| 49.1 \| 0.759 \| 5.23 \|

	### Base

	\| Model \| Multilingual MMLU \| MATH CoT 2-Shot \| AGIEval 5-shot \| MMLU Redux 5-shot \| MMLU 5-shot \| TriviaQA 5-shot \|
	\|---------------------\|-------------------\|-----------------\|----------------\|-------------------\|-------------\|-----------------\|
	\| Ministral 3 14B \| 0.742 \| <u>0.676</u> \| 0.648 \| 0.820 \| 0.794 \| 0.749 \|
	\| Qwen3 14B Base \| <u>0.754</u> \| 0.620 \| <u>0.661</u> \| <u>0.837</u> \| <u>0.804</u>\| 0.703 \|
	\| Gemma 3 12B Base \| 0.690 \| 0.487 \| 0.587 \| 0.766 \| 0.745 \| <u>0.788</u> \|
	\| \| \| \| \| \| \| \|
	\| Ministral 3 8B \| <u>0.706</u> \| <u>0.626</u> \| 0.591 \| 0.793 \| <u>0.761</u>\| <u>0.681</u> \|
	\| Qwen 3 8B Base \| 0.700 \| 0.576 \| <u>0.596</u> \| <u>0.794</u> \| 0.760 \| 0.639 \|
	\| \| \| \| \| \| \| \|
	\| Ministral 3 3B \| 0.652 \| <u>0.601</u> \| 0.511 \| 0.735 \| 0.707 \| 0.592 \|
	\| Qwen 3 4B Base \| <u>0.677</u> \| 0.405 \| <u>0.570</u> \| <u>0.759</u> \| <u>0.713</u>\| 0.530 \|
	\| Gemma 3 4B Base \| 0.516 \| 0.294 \| 0.430 \| 0.626 \| 0.589 \| <u>0.640</u> \|

	## Usage

	The model can be used with the following frameworks;
	- [`vllm`](https://github.com/vllm-project/vllm): See [here](#vllm)
	- [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)

	### vLLM

	We recommend using this model with [vLLM](https://github.com/vllm-project/vllm).

	#### Installation

	Make sure to install vllm >= 0.12.0:

	```
	pip install vllm --upgrade
	```

	Doing so should automatically install [`mistral_common >= 1.8.6`](https://github.com/mistralai/mistral-common/releases/tag/v1.8.6).

	To check:
	```
	python -c "import mistral_common; print(mistral_common.__version__)"
	```

	You can also make use of a ready-to-go [docker image](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile) or on the [docker hub](https://hub.docker.com/layers/vllm/vllm-openai/latest).

	#### Serve

	Due to their size, `Ministral-3-3B-Reasoning-2512` and `Ministral-3-8B-Reasoning-2512` can run on a single 1xH200 GPU.

	A simple launch command is:

	```bash

	vllm serve mistralai/Ministral-3-8B-Reasoning-2512 \
	--tokenizer_mode mistral --config_format mistral --load_format mistral \
	--enable-auto-tool-choice --tool-call-parser mistral \
	--reasoning-parser mistral
	```

	Key parameter notes:

	* enable-auto-tool-choice: Required when enabling tool usage.
	* tool-call-parser mistral: Required when enabling tool usage.
	* reasoning-parser mistral: Required when enabling reasoning.

	Additional flags:

	* You can set `--max-model-len` to preserve memory. By default it is set to `262144` which is quite large but not necessary for most scenarios.
	* You can set `--max-num-batched-tokens` to balance throughput and latency, higher means higher throughput but higher latency.

	#### Usage of the model

	Here we assume that the model `mistralai/Ministral-3-8B-Reasoning-2512` is served and you can ping it to the domain `localhost` with the port `8000` which is the default for vLLM.

	<details>
	<summary>Vision Reasoning</summary>

	Let's see if the Ministral 3 model knows when to pick a fight !

	```python
	from typing import Any

	from openai import OpenAI
	from huggingface_hub import hf_hub_download

	# Modify OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"
	openai_api_base = "http://localhost:8000/v1"

	TEMP = 0.7
	TOP_P = 0.95
	MAX_TOK = 262144
	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	models = client.models.list()
	model = models.data[0].id


	def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
	file_path = hf_hub_download(repo_id=repo_id, filename=filename)
	with open(file_path, "r") as file:
	system_prompt = file.read()

	index_begin_think = system_prompt.find("[THINK]")
	index_end_think = system_prompt.find("[/THINK]")

	return {
	"role": "system",
	"content": [
	{"type": "text", "text": system_prompt[:index_begin_think]},
	{
	"type": "thinking",
	"thinking": system_prompt[
	index_begin_think + len("[THINK]") : index_end_think
	],
	"closed": True,
	},
	{
	"type": "text",
	"text": system_prompt[index_end_think + len("[/THINK]") :],
	},
	],
	}


	SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

	image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

	messages = [
	SYSTEM_PROMPT,
	{
	"role": "user",
	"content": [
	{
	"type": "text",
	"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
	},
	{"type": "image_url", "image_url": {"url": image_url}},
	],
	},
	]


	stream = client.chat.completions.create(
	model=model,
	messages=messages,
	stream=True,
	temperature=TEMP,
	top_p=TOP_P,
	max_tokens=MAX_TOK,
	)

	print("client: Start streaming chat completions...:\n")
	printed_reasoning_content = False
	answer = []

	for chunk in stream:
	reasoning_content = None
	content = None
	# Check the content is reasoning_content or content
	if hasattr(chunk.choices[0].delta, "reasoning_content"):
	reasoning_content = chunk.choices[0].delta.reasoning_content
	if hasattr(chunk.choices[0].delta, "content"):
	content = chunk.choices[0].delta.content

	if reasoning_content is not None:
	if not printed_reasoning_content:
	printed_reasoning_content = True
	print("Start reasoning:\n", end="", flush=True)
	print(reasoning_content, end="", flush=True)
	elif content is not None:
	# Extract and print the content
	if not reasoning_content and printed_reasoning_content:
	answer.extend(content)
	print(content, end="", flush=True)

	if answer:
	print("\n\n=============\nAnswer\n=============\n")
	print("".join(answer))
	else:
	print("\n\n=============\nNo Answer\n=============\n")
	print(
	"No answer was generated by the model, probably because the maximum number of tokens was reached."
	)
	```

	Now we'll make it compute some maths !

	```python
	from typing import Any

	from openai import OpenAI
	from huggingface_hub import hf_hub_download

	# Modify OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"
	openai_api_base = "http://localhost:8000/v1"

	TEMP = 0.7
	TOP_P = 0.95
	MAX_TOK = 262144
	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	models = client.models.list()
	model = models.data[0].id


	def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
	file_path = hf_hub_download(repo_id=repo_id, filename=filename)
	with open(file_path, "r") as file:
	system_prompt = file.read()

	index_begin_think = system_prompt.find("[THINK]")
	index_end_think = system_prompt.find("[/THINK]")

	return {
	"role": "system",
	"content": [
	{"type": "text", "text": system_prompt[:index_begin_think]},
	{
	"type": "thinking",
	"thinking": system_prompt[
	index_begin_think + len("[THINK]") : index_end_think
	],
	"closed": True,
	},
	{
	"type": "text",
	"text": system_prompt[index_end_think + len("[/THINK]") :],
	},
	],
	}


	SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

	image_url = "https://i.ytimg.com/vi/5Y3xLHeyKZU/hqdefault.jpg"

	messages = [
	SYSTEM_PROMPT,
	{
	"role": "user",
	"content": [
	{
	"type": "text",
	"text": "Solve the equations. If they contain only numbers, use your calculator, else only think. Answer in the language of the image.",
	},
	{"type": "image_url", "image_url": {"url": image_url}},
	],
	},
	]

	stream = client.chat.completions.create(
	model=model,
	messages=messages,
	stream=True,
	temperature=TEMP,
	top_p=TOP_P,
	max_tokens=MAX_TOK,
	)

	print("client: Start streaming chat completions...:\n")
	printed_reasoning_content = False
	answer = []

	for chunk in stream:
	reasoning_content = None
	content = None
	# Check the content is reasoning_content or content
	if hasattr(chunk.choices[0].delta, "reasoning_content"):
	reasoning_content = chunk.choices[0].delta.reasoning_content
	if hasattr(chunk.choices[0].delta, "content"):
	content = chunk.choices[0].delta.content

	if reasoning_content is not None:
	if not printed_reasoning_content:
	printed_reasoning_content = True
	print("Start reasoning:\n", end="", flush=True)
	print(reasoning_content, end="", flush=True)
	if content is not None:
	# Extract and print the content
	if not reasoning_content and printed_reasoning_content:
	answer.extend(content)
	print(content, end="", flush=True)

	if answer:
	print("\n\n=============\nAnswer\n=============\n")
	print("".join(answer))
	else:
	print("\n\n=============\nNo Answer\n=============\n")
	print(
	"No answer was generated by the model, probably because the maximum number of tokens was reached."
	)
	```

	</details>

	<details>
	<summary>Text-Only Request</summary>

	Let's do more maths and leave it up to the model to figure out how to achieve a result.

	```python
	from typing import Any
	from openai import OpenAI
	from huggingface_hub import hf_hub_download

	# Modify OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"
	openai_api_base = "http://localhost:8000/v1"

	TEMP = 0.7
	TOP_P = 0.95
	MAX_TOK = 262144
	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	models = client.models.list()
	model = models.data[0].id


	def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
	file_path = hf_hub_download(repo_id=repo_id, filename=filename)
	with open(file_path, "r") as file:
	system_prompt = file.read()

	index_begin_think = system_prompt.find("[THINK]")
	index_end_think = system_prompt.find("[/THINK]")

	return {
	"role": "system",
	"content": [
	{"type": "text", "text": system_prompt[:index_begin_think]},
	{
	"type": "thinking",
	"thinking": system_prompt[
	index_begin_think + len("[THINK]") : index_end_think
	],
	"closed": True,
	},
	{
	"type": "text",
	"text": system_prompt[index_end_think + len("[/THINK]") :],
	},
	],
	}


	SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

	query = "Use each number in 2,5,6,3 exactly once, along with any combination of +, -, ×, ÷ (and parentheses for grouping), to make the number 24."

	messages = [
	SYSTEM_PROMPT,
	{"role": "user", "content": query}
	]
	stream = client.chat.completions.create(
	model=model,
	messages=messages,
	stream=True,
	temperature=TEMP,
	top_p=TOP_P,
	max_tokens=MAX_TOK,
	)

	print("client: Start streaming chat completions...:\n")
	printed_reasoning_content = False
	answer = []

	for chunk in stream:
	reasoning_content = None
	content = None
	# Check the content is reasoning_content or content
	if hasattr(chunk.choices[0].delta, "reasoning_content"):
	reasoning_content = chunk.choices[0].delta.reasoning_content
	if hasattr(chunk.choices[0].delta, "content"):
	content = chunk.choices[0].delta.content

	if reasoning_content is not None:
	if not printed_reasoning_content:
	printed_reasoning_content = True
	print("Start reasoning:\n", end="", flush=True)
	print(reasoning_content, end="", flush=True)
	if content is not None:
	# Extract and print the content
	if not reasoning_content and printed_reasoning_content:
	answer.extend(content)
	print(content, end="", flush=True)

	if answer:
	print("\n\n=============\nAnswer\n=============\n")
	print("".join(answer))
	else:
	print("\n\n=============\nNo Answer\n=============\n")
	print("No answer was generated by the model, probably because the maximum number of tokens was reached.")
	```

	</details>

	### Transformers

	You can also use Ministral 3 3B Reasoning 2512 with `Transformers` !
	Make sure to install `Transformers` from its first v5 release candidate or from "main":

	```
	pip install transformers==5.0.0rc0
	```

	To make the best use of our model with `Transformers` make sure to have [installed](https://github.com/mistralai/mistral-common) `mistral-common >= 1.8.6` to use our tokenizer.

	```bash
	pip install mistral-common --upgrade
	```

	Then load our tokenizer along with the model and generate:

	<details>
	<summary>Python snippet</summary>

	```python
	import torch
	from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

	model_id = "mistralai/Ministral-3-8B-Reasoning-2512"

	tokenizer = MistralCommonBackend.from_pretrained(model_id)
	model = Mistral3ForConditionalGeneration.from_pretrained(
	model_id, torch_dtype=torch.bfloat16, device_map="auto"
	)

	image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "text",
	"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
	},
	{"type": "image_url", "image_url": {"url": image_url}},
	],
	},
	]

	tokenized = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True)

	tokenized["input_ids"] = tokenized["input_ids"].to(device="cuda")
	tokenized["pixel_values"] = tokenized["pixel_values"].to(dtype=torch.bfloat16, device="cuda")
	image_sizes = [tokenized["pixel_values"].shape[-2:]]

	output = model.generate(
	**tokenized,
	image_sizes=image_sizes,
	max_new_tokens=8092,
	)[0]

	decoded_output = tokenizer.decode(output[len(tokenized["input_ids"][0]):])
	print(decoded_output)
	```

	</details>

	## License

	This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.txt).

	You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.