Qwen3-30B-A3B-NotaMoeQuant-Int4

Overview

We developed a weight-only quantization method specialized for the Mixture-of-Experts (MoE) architecture, and we release Qwen3-30B-A3B quantized with our algorithm. The quantized weights are packed using an AutoRound-based quantization format.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nota-ai/Qwen3-30B-A3B-NotaMoEQuant-Int4"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "What is large language model?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True 
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=100
)

print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

Performance

PPL (WikiText2) MMLU-Pro AIME25 LiveCodeBench v6 Total TPS (Tokens/Sec.) Memory (GB)
Qwen3-30B-A3B (BF16) 10.8955 75.47 70.00 55.70 1136.56 58.23
Nota MoEQuant (INT4) 11.3046 74.84 70.00 60.18 1262.21 16.01

Note

  • Nota MoEQuant use 8-bit quantization for the gate layer, while all other linear layers are quantized to 4 bits.
  • Tokens per sec. (TPS) is measured with 16 requests, using 20,000 tokens for prefill and 20,000 tokens for decoding.
  • Memory indicates the allocated GPU memory for model parameters.
  • Model evaluations weres conducted using AutoRound==0.8.0 and vLLM==0.12.0.
Downloads last month
113
Safetensors
Model size
0.6B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nota-ai/Qwen3-30B-A3B-NotaMoEQuant-Int4

Finetuned
Qwen/Qwen3-30B-A3B
Quantized
(106)
this model

Dataset used to train nota-ai/Qwen3-30B-A3B-NotaMoEQuant-Int4