gpt-oss 120b
Collection
The absolute first lossless gpt-oss-20b/120b quant repo hf or modelscope. Different 4bit AWQ/8bitGPTQ schemes , Mixed Precision, Sparsity
•
2 items
•
Updated
•
1
A 4-bit AWQ-quantised release of gpt-oss-120b
TL;DR – We convert the original FP16/FP32 checkpoint (≈ 234 GB) of gpt-oss-120b into a 4-bit weight-only model with 16-bit activations (W4A16).
The resulting 11-shard safetensors bundle is ≈ 33.4 GB, a 7× size reduction with negligible quality loss.
| Property | Value |
|---|---|
| Architecture | Mixture-of-Experts Transformer |
| Total parameters | 117 B |
| Active parameters / token | 5.1 B |
| Layers | 36 |
| Experts | 128 (4 routed per token) |
| Hidden size / head dim | 2880 / 64 |
| Context window (max rope) | 131 072 tokens |
| Activation function | SwiGLU |
| Norm | RMSNorm (ε = 1e-5) |
| Rope scaling | YARN (θ = 150 000) |
| Training data cut-off | 2024-06-01 |
AWQ protects the ~1 % most activation-sensitive channels by rescaling them before 4-bit rounding, vastly reducing quantisation error compared with vanilla GPTQ.
| Module | Precision |
|---|---|
| All dense & attention weights | int4 (AWQ) |
| LayerNorm, rotary embeddings, router MLP | fp16 |
| lm_head | fp16 |
| Shard | Size (GB) | Shard | Size (GB) |
|---|---|---|---|
| 1 | 1.21 | 7 | 2.18 |
| 2 | 4.25 | 8 | 4.25 |
| 3 | 2.18 | 9 | 2.18 |
| 4 | 4.25 | 10 | 4.25 |
| 5 | 2.18 | 11 | 2.18 |
| 6 | 4.25 | Total | 33.36 GB |
Compression vs original FP16 checkpoint:
234 GB / 33.36 GB ≈ 7× smaller
Base model
openai/gpt-oss-120b