TheDrummer/GLM-Steam-106B-A12B-v1 (AWQ 4-bit quant)

This repo contains GLM-Steam-106B-A12B-v1 quantized with AWQ mixed 4-bit/16-bit precision following state-of-the-art Mixture-Of-Expert quantization and a careful selection of calibration datasets covering math, sciences, philosophy, business, fiction, roleplay, creative writing, general knowledge and multilingual to plausibly ensure that all 127 experts of the model had been activated through enough calibration samples.

Original Model:
- TheDrummer/GLM-Steam-106B-A12B-v1

The model requires ~65.7GiB of VRAM + 23GiB for a KV-cache for 131072 tokens. This fits perfectly with 4x24GB or 2x48GB or 1x96GB GPUs.

📥 Usage & Running Instructions

The model was tested with vLLM + 1x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.

Recommendations

It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)

The recommended sampler is "min-p" sampling, this sampling is available through both the oldest Text completions API and the Chat completions API (and there is a new Response API), however most LLM frontends only support modifying min-p when using Text completions. You can however use --override-generation-config "${SAMPLER_JSONCONFIG}" to override the sampler (which is a merge of generation_config.json and vLLM defaults)

Running script

# Model configuration (Mandatory)
MODEL="mratsim/GLM-Steam-106B-A12B-v1-AWQ"
MODELNAME="GLM-Steam-v1"
GPU_UTIL=0.97

# Sampling configuration (Optional, if departing from `generation_config.json`)
# Values from the model card https://rentry.org/geechan#model-specific-presets
SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0.01, "top_p": 1}'

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
# however needs to reduce context length to 120000 tokens and GPU_UTIL to 0.95
# export VLLM_ATTENTION_BACKEND=FLASHINFER

vllm serve "${MODEL}" \
  --served-model-name "${MODELNAME}" \
  --gpu-memory-utilization ${GPU_UTIL} \
  --override-generation-config "${SAMPLER_OVERRIDE}"

ℹ️ The FlashInfer backend may fail with an error similar to Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator.

A workaround is running a sed replacement command within vllm install to increase buffer space
sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 768 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py
This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344 or https://github.com/vllm-project/vllm/pull/28269

🔬 Quantization method

The llmcompressor library was used with the following recipe:

default_stage:
  default_modifiers:
    AWQModifier:
      config_groups:
        group_0:
          targets: ['re:.*mlp\.experts\.[0-9]+\.(down|gate|up)_proj$']
          weights:
            num_bits: 4
            type: int
            symmetric: true
            group_size: 32
            strategy: group
            block_structure: null
            dynamic: false
            actorder: null
            observer: mse
            observer_kwargs: {}
          input_activations: null
          output_activations: null
          format: null
      targets: ['re:.*mlp\.experts\.[0-9]+\.(down|gate|up)_proj$']
      ignore: []
      mappings:
      - smooth_layer: re:.*post_attention_layernorm$
        balance_layers: ['re:.*gate_proj$', 're:.*up_proj$']
      - smooth_layer: re:.*up_proj$
        balance_layers: ['re:.*down_proj$']
      duo_scaling: true

and calibrated with over 1600 samples, up to 8192 sequence length of:

According to the AWQ presentation only 64 samples are needed however due to the Mixture-of-Experts topology, this implies all 127 experts need to see at least 64 samples or alternatively, we activate all experts during calibration which requires reimplementing the attention block of the model in llmcompressor's modeling DB.

Deep-dive

Quantization should be focused on Linear layer (also called Dense or Fully-Connected layers i.e. MatMu+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]

LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.

Note: Experts layers might not be stored as a Linear layer, meaning they might be skipped if using llmcompressor with a Linear target.

Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization:

quantizing expert FFN layers do not seriously impact model quality
quantizing cross-attention has some impact
quantizing self-attention has a large impact
quantizing dense FFN has a very significant impact

Hence to preserve model quality we choose not to quantize dense FFN layers (i.e. shared experts) and self-attention layers.

We notice that:

official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
- https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
- https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json

According to [2], giving more bits to the first k blocks have a significantly higher impact on model quality than for the same last k blocks. In this case, we keep the first layer unquantized as "first_k_dense_replace": 1 in config.json

References

Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
https://arxiv.org/pdf/2506.12044
Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
https://arxiv.org/pdf/2406.08155v1
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
https://arxiv.org/pdf/2310.02410