gpt-oss-120b-awq-w4a16

A 4-bit AWQ-quantised release of gpt-oss-120b

TL;DR – We convert the original FP16/FP32 checkpoint (≈ 234 GB) of gpt-oss-120b into a 4-bit weight-only model with 16-bit activations (W4A16).
The resulting 11-shard safetensors bundle is ≈ 33.4 GB, a 7× size reduction with negligible quality loss.

1 Model details

Property	Value
Architecture	Mixture-of-Experts Transformer
Total parameters	117 B
Active parameters / token	5.1 B
Layers	36
Experts	128 (4 routed per token)
Hidden size / head dim	2880 / 64
Context window (max rope)	131 072 tokens
Activation function	SwiGLU
Norm	RMSNorm (ε = 1e-5)
Rope scaling	YARN (θ = 150 000)
Training data cut-off	2024-06-01

2 Quantisation recipe

2.1 Activation-Aware Weight Quantisation (AWQ)

AWQ protects the ~1 % most activation-sensitive channels by rescaling them before 4-bit rounding, vastly reducing quantisation error compared with vanilla GPTQ.

Post-training – no back-prop; only a small calibration set is needed.
Weight-only – activations stay at fp16/bf16.
Hardware-friendly – single-kernel dequant, SIMD-aware packing, no mixed precision.

2.2 Layer precision map

Module	Precision
All dense & attention weights	int4 (AWQ)
LayerNorm, rotary embeddings, router MLP	fp16
lm_head	fp16

2.3 Size breakdown

Shard	Size (GB)	Shard	Size (GB)
1	1.21	7	2.18
2	4.25	8	4.25
3	2.18	9	2.18
4	4.25	10	4.25
5	2.18	11	2.18
6	4.25	Total	33.36 GB

Compression vs original FP16 checkpoint:

234 GB  / 33.36 GB  ≈ 7× smaller

Downloads last month: 14,162

Safetensors

Model size

117B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for twhitworth/gpt-oss-120b-awq-w4a16

Base model

openai/gpt-oss-120b

Finetuned

(70)

this model

Collection including twhitworth/gpt-oss-120b-awq-w4a16

gpt-oss 120b

Collection

The absolute first lossless gpt-oss-20b/120b quant repo hf or modelscope. Different 4bit AWQ/8bitGPTQ schemes , Mixed Precision, Sparsity • 2 items • Updated Sep 5 • 1