Olmo-3-32B-Think-MLX-4bit

This is Olmo 3 32B Think converted to MLX 4-bit quantized format for efficient inference on Apple Silicon hardware. This variant is optimized for speed with minimal memory requirements while preserving Olmo's signature thinking capabilities.

MLX Model Variants - Complete Collection

Choose the best variant for your hardware and performance needs:

Model	Precision	Model Size	Bits/Weight	Memory Usage	Performance	Repository
Olmo-3-32B-Think-MLX-4bit	4-bit quantized	17GB	4.500	18.4GB	20.96 tok/s	`Plurigrid/Olmo-3-32B-Think-MLX-4bit`
Olmo-3-32B-Think-MLX-6bit	6-bit quantized	24GB	6.500	26.7GB	15.50 tok/s	`Plurigrid/Olmo-3-32B-Think-MLX-6bit`
Olmo-3-32B-Think-MLX-8bit	8-bit quantized	32GB	8.500	35.1GB	15.10 tok/s	`Plurigrid/Olmo-3-32B-Think-MLX-8bit`
Olmo-3-32B-Think-MLX-bf16	bfloat16 (full)	60GB	~16.000	64.7GB	7.58 tok/s	`Plurigrid/Olmo-3-32B-Think-MLX-bf16`

Why Choose 4-bit?

Optimized Performance: 20.96 tokens/sec (2.8x faster than bf16)
Minimal Memory: 18.4GB RAM usage (3.5x less than bf16)
Device Compatibility: Runs on 32GB+ Apple Silicon devices
Preserved Thinking: Full Olmo <think> capabilities intact

Quick Start

Command Line Interface

# Interactive chat (recommended)
uvx --from mlx-lm mlx_lm.chat --model Plurigrid/Olmo-3-32B-Think-MLX-4bit

# Generate text with thinking
uvx --from mlx-lm mlx_lm.generate --model Plurigrid/Olmo-3-32B-Think-MLX-4bit \
    --prompt "Who would win in a fight - a dinosaur or a cow named Moo Moo?" \
    --max-tokens 500 --temp 0.6

Python API

from mlx_lm import load, generate

# Load the 4-bit quantized model
model, tokenizer = load("Plurigrid/Olmo-3-32B-Think-MLX-4bit")

prompt = "Explain quantum computing step by step."

# Apply chat template if available
if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

# Generate response with Olmo thinking
response = generate(model, tokenizer, prompt=prompt, verbose=True, 
                   temperature=0.6, top_p=0.95, max_tokens=2048)
print(response)

Installation

# Install MLX-LM
pip install mlx-lm
# or with uv
uv add mlx-lm

Hardware Requirements

Component	Minimum	Recommended
Platform	Apple Silicon (M1/M2/M3/M4/M5)	M1 Pro/Max or newer
Memory	32GB unified memory	64GB+ unified memory
Storage	20GB free space	40GB+ free space
OS	macOS 12+	macOS 14+ (Sonoma)

Tested Configuration: Mac Studio M1 Ultra (20-core CPU, 128GB unified memory), macOS Sequoia 15.2

Technical Specifications

4-bit Quantization Details:

Quantization Method: MLX native affine quantization
Effective Bits: 4.500 bits per weight
Group Size: 128 (default)
Conversion Command: mlx_lm.convert --quantize --q-bits 4
Quality Preservation: Excellent (maintains thinking patterns)

Performance Metrics:

Inference Speed: 20.96 tokens/second
Memory Efficiency: 18.4GB peak usage
Model Loading: ~8-12 seconds
Quality: Preserves Olmo's signature <think> reasoning

About Olmo 3 32B Think

Olmo 3 is a new family of open language models designed to enable the science of language models. The Think variant uses long chain-of-thought thinking to improve reasoning tasks like math and coding.

Key Capabilities:

Step-by-step reasoning with visible <think> tags
Advanced mathematical and scientific reasoning
Tool-use and multi-turn conversations
Research-grade analysis and problem-solving

Chat Template

Default System Message

The default system prompt for this model is:

<|im_start|>system
You are a helpful AI assistant.<|im_end|>

Chat Format

The chat template for this model is formatted as:

<|im_start|>system
You are a helpful AI assistant.
<|im_start|>user
Who would win in a fight - a dinosaur or a cow named Moo Moo?<|im_end|>
<|im_start|>assistant
<think>Okay, so the question is who would win in a fight between a dinosaur and a cow named Moo Moo.
Hmm, first I need to break this down. Let me think about the different factors involved here..... </think>
Moo Moo the cow would certainly win.
<|endoftext|>

Evaluation Results

Results from the original Olmo-3-32B-Think model (quality preserved in 4-bit variant):

Benchmark	Olmo 3 Think 32B	Qwen 3 32B	Gemma 3 27B Instruct	DeepSeek-R1-Distill-Qwen-32B
Math
MATH	96.1	95.4	87.4	92.6
AIME 2024	76.8	80.8	28.9	70.3
AIME 2025	72.5	70.9	22.9	56.3
Reasoning
BigBenchHard	89.8	90.6	82.4	89.7
ZebraLogic	76.0	88.3	24.8	69.4
AGI Eval English	88.2	90.0	76.9	88.1
Coding
HumanEvalPlus	91.4	91.2	79.2	92.3
MBPP+	68.0	70.6	65.7	70.1
LiveCodeBench v3	83.5	90.2	39.0	79.5

Advanced Usage

Multi-turn Conversation

messages = [
    {"role": "user", "content": "What is category theory?"},
    {"role": "assistant", "content": "Category theory is a mathematical framework..."},
    {"role": "user", "content": "How does it apply to computer science?"}
]

if tokenizer.chat_template is not None:
    formatted_prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
    response = generate(model, tokenizer, prompt=formatted_prompt, max_tokens=2048)

Research-Style Analysis

research_prompt = """
Analyze the relationship between quantum mechanics and information theory.
Think step by step and provide a comprehensive analysis.
"""

response = generate(model, tokenizer, prompt=research_prompt, max_tokens=4096, verbose=True)

Model Details

Developed by: Allen Institute for AI (Ai2)
Model type: Transformer-style autoregressive language model
Language(s): English
License: Apache 2.0 - intended for research and educational use in accordance with Ai2's Responsible Use Guidelines
Contact: Technical inquiries: olmo@allenai.org, Press: press@allenai.org
Date cutoff: December 2024

Bias, Risks, and Limitations

Like any base language model or fine-tuned model without safety filtering, these models can easily be prompted by users to generate harmful and sensitive content. Such content may also be produced unintentionally, especially in cases involving bias, so we recommend that users consider the risks when applying this technology. Additionally, many statements from OLMo or any LLM are often inaccurate, so facts should be verified.

Citation

@article{olmo3,
  title = {{OLMo 3: Open Language Models for Research and Education}},
  author = {{Allen Institute for AI}},
  year = {2024},
}

🔄 Conversion Details

Conversion Date: November 22, 2024
Converter: MLX community via Plurigrid
Command: uvx --from mlx-lm mlx_lm.convert --hf-path allenai/Olmo-3-32B-Think --mlx-path ./Olmo-3-32B-Think-4bit --quantize --q-bits 4
Framework Version: mlx-lm latest (November 2024)
Validation: Tested with philosophical reasoning maintaining quality and thinking patterns

4-bit Specific Considerations:

Optimized for Apple Silicon hardware only
Excellent quality preservation with 4.500 bits per weight
Fastest inference in the MLX model series for 32B models
Ideal for real-time applications and resource-constrained environments

Downloads last month: 87

Safetensors

Model size

32B params

Tensor type

BF16

U32

Model tree for Plurigrid/Olmo-3-32B-Think-MLX-4bit

Base model

allenai/Olmo-3-1125-32B

Finetuned

allenai/Olmo-3-32B-Think-SFT

Finetuned

allenai/Olmo-3-32B-Think-DPO

Finetuned

allenai/Olmo-3-32B-Think

Quantized

(33)

this model

Plurigrid
/

Olmo-3-32B-Think-MLX-4bit

Olmo-3-32B-Think-MLX-4bit

MLX Model Variants - Complete Collection

Why Choose 4-bit?

Quick Start

Command Line Interface

Python API

Installation

Hardware Requirements

Technical Specifications

About Olmo 3 32B Think

Chat Template

Default System Message

Chat Format

Evaluation Results

Advanced Usage

Multi-turn Conversation

Research-Style Analysis

Related Links

Model Details

Bias, Risks, and Limitations

Citation

🔄 Conversion Details

Model tree for Plurigrid/Olmo-3-32B-Think-MLX-4bit