Olmo-3.1-32B-Think-MLX-4bit

This is a 4-bit quantized MLX version of Olmo-3.1-32B-Think for Apple Silicon Macs.

Model Details

Property	Value
Original Model	allenai/Olmo-3.1-32B-Think
Parameters	32B
Quantization	4-bit (group size 64)
Size	~17 GB
Framework	MLX
Context Length	32K tokens

Features

Extended reasoning: Chain-of-thought with <think> tags
Strong on math/coding: Trained with RLVR for verifiable reasoning
Open weights: Apache 2.0 license from Allen AI

Requirements

macOS 15.0+ (Sequoia)
Apple Silicon Mac (M1/M2/M3/M4)
24GB+ unified memory recommended (uses ~18GB peak)
Python 3.10+

Installation

pip install mlx-lm

Usage

from mlx_lm import load, generate

# Load model
model, tokenizer = load("hunterbown/Olmo-3.1-32B-Think-MLX-4bit")

# Generate with reasoning
prompt = "If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?"
output = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(output)

Performance

Benchmarked on Mac Studio M4 with 36GB unified memory:

Generation speed: ~19 tokens/sec
Peak memory: ~18 GB
Prompt processing: ~5 tokens/sec

Conversion Details

Converted using mlx-lm:

mlx_lm.convert \
  --hf-path allenai/Olmo-3.1-32B-Think \
  --mlx-path ./olmo-3.1-32b-think-4bit \
  -q --q-bits 4 --q-group-size 64

License

Apache 2.0 (same as original model)

Credits

Original model by Allen Institute for AI (Ai2)
MLX conversion using mlx-lm
Quantized by @hunterbown

Downloads last month: 43

Safetensors

Model size

32B params

Tensor type

BF16

U32

Model tree for hunterbown/Olmo-3.1-32B-Think-MLX-4bit

Base model

allenai/Olmo-3-1125-32B

Finetuned

allenai/Olmo-3-32B-Think-SFT

Finetuned

allenai/Olmo-3-32B-Think-DPO

Finetuned

allenai/Olmo-3.1-32B-Think

Quantized

(24)

this model