TheDrummer/GLM-Steam-106B-A12B-v1 (AWQ 4-bit quant)
This repo contains GLM-Steam-106B-A12B-v1 quantized with AWQ mixed 4-bit/16-bit precision following state-of-the-art Mixture-Of-Expert quantization and a careful selection of calibration datasets covering math, sciences, philosophy, business, fiction, roleplay, creative writing, general knowledge and multilingual to plausibly ensure that all 127 experts of the model had been activated through enough calibration samples.
- Original Model:
The model requires ~65.7GiB of VRAM + 23GiB for a KV-cache for 131072 tokens. This fits perfectly with 4x24GB or 2x48GB or 1x96GB GPUs.
π₯ Usage & Running Instructions
The model was tested with vLLM + 1x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.
Recommendations
It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)
The recommended sampler is "min-p" sampling, this sampling is available through
both the oldest Text completions API and the Chat completions API (and there is a new Response API),
however most LLM frontends only support modifying min-p when using Text completions.
You can however use --override-generation-config "${SAMPLER_JSONCONFIG}" to override the sampler (which is a merge of generation_config.json and vLLM defaults)
Running script
# Model configuration (Mandatory)
MODEL="mratsim/GLM-Steam-106B-A12B-v1-AWQ"
MODELNAME="GLM-Steam-v1"
GPU_UTIL=0.97
# Sampling configuration (Optional, if departing from `generation_config.json`)
# Values from the model card https://rentry.org/geechan#model-specific-presets
SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0.01, "top_p": 1}'
# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1
# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
# however needs to reduce context length to 120000 tokens and GPU_UTIL to 0.95
# export VLLM_ATTENTION_BACKEND=FLASHINFER
vllm serve "${MODEL}" \
--served-model-name "${MODELNAME}" \
--gpu-memory-utilization ${GPU_UTIL} \
--override-generation-config "${SAMPLER_OVERRIDE}"
βΉοΈ The FlashInfer backend may fail with an error similar to
Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator.A workaround is running a sed replacement command within vllm install to increase buffer space
sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 768 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.pyThis will be fixed by PR https://github.com/vllm-project/vllm/pull/25344 or https://github.com/vllm-project/vllm/pull/28269
π¬ Quantization method
The llmcompressor library was used with the following recipe:
default_stage:
default_modifiers:
AWQModifier:
config_groups:
group_0:
targets: ['re:.*mlp\.experts\.[0-9]+\.(down|gate|up)_proj$']
weights:
num_bits: 4
type: int
symmetric: true
group_size: 32
strategy: group
block_structure: null
dynamic: false
actorder: null
observer: mse
observer_kwargs: {}
input_activations: null
output_activations: null
format: null
targets: ['re:.*mlp\.experts\.[0-9]+\.(down|gate|up)_proj$']
ignore: []
mappings:
- smooth_layer: re:.*post_attention_layernorm$
balance_layers: ['re:.*gate_proj$', 're:.*up_proj$']
- smooth_layer: re:.*up_proj$
balance_layers: ['re:.*down_proj$']
duo_scaling: true
and calibrated with over 1600 samples, up to 8192 sequence length of:
- neuralmagic/calibration
- HuggingFaceH4/ultrachat_200k
- nvidia/OpenCodeInstruct
- CSJianYang/CodeArena
- nvidia/OpenScienceReasoning-2
- MegaScience/MegaScience
- Gryphe/Opus-WritingPrompts
- ServiceNow-AI/M2Lingual
- anthracite-org/stheno-filtered-v1.1
- zerofata/Roleplay-Anime-Characters
- zerofata/Instruct-Anime
- zerofata/Instruct-Anime-CreativeWriting
- sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo
- nvidia/OpenMathInstruct-2
- fka/awesome-chatgpt-prompts
- databricks/databricks-dolly-15k
- FreedomIntelligence/SocraticChat
- ruggsea/stanford-encyclopedia-of-philosophy_instruct
- mlfoundations-dev/stackexchange_philosophy
- theoldmandthesea/17k_business_book
- anthracite-org/nopm_claude_writing_fixed
According to the AWQ presentation only 64 samples are needed however due to the Mixture-of-Experts topology, this implies all 127 experts need to see at least 64 samples or alternatively, we activate all experts during calibration which requires reimplementing the attention block of the model in llmcompressor's modeling DB.
Deep-dive
Quantization should be focused on Linear layer (also called Dense or Fully-Connected layers i.e. MatMu+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]
LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.
Note: Experts layers might not be stored as a Linear layer, meaning they might be skipped if using llmcompressor with a Linear target.
Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization:
- quantizing expert FFN layers do not seriously impact model quality
- quantizing cross-attention has some impact
- quantizing self-attention has a large impact
- quantizing dense FFN has a very significant impact
Hence to preserve model quality we choose not to quantize dense FFN layers (i.e. shared experts) and self-attention layers.
We notice that:
- official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
- NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
According to [2], giving more bits to the first k blocks have a significantly higher impact on model quality than for the same last k blocks.
In this case, we keep the first layer unquantized as "first_k_dense_replace": 1 in config.json
References
Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
https://arxiv.org/pdf/2506.12044Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
https://arxiv.org/pdf/2406.08155v1Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
https://arxiv.org/pdf/2310.02410
- Downloads last month
- 9
Model tree for mratsim/GLM-Steam-106B-A12B-v1-AWQ
Base model
zai-org/GLM-4.5-Air