Qwen3-30B-A3B-NotaMoeQuant-Int4
Overview
We developed a weight-only quantization method specialized for the Mixture-of-Experts (MoE) architecture, and we release Qwen3-30B-A3B quantized with our algorithm. The quantized weights are packed using an AutoRound-based quantization format.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nota-ai/Qwen3-30B-A3B-NotaMoEQuant-Int4"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "What is large language model?"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=100
)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Performance
| PPL (WikiText2) | MMLU-Pro | AIME25 | LiveCodeBench v6 | Total TPS (Tokens/Sec.) | Memory (GB) | |
|---|---|---|---|---|---|---|
| Qwen3-30B-A3B (BF16) | 10.8955 | 75.47 | 70.00 | 55.70 | 1136.56 | 58.23 |
| Nota MoEQuant (INT4) | 11.3046 | 74.84 | 70.00 | 60.18 | 1262.21 | 16.01 |
Note
- Nota MoEQuant use 8-bit quantization for the gate layer, while all other linear layers are quantized to 4 bits.
- Tokens per sec. (TPS) is measured with 16 requests, using 20,000 tokens for prefill and 20,000 tokens for decoding.
- Memory indicates the allocated GPU memory for model parameters.
- Model evaluations weres conducted using AutoRound==0.8.0 and vLLM==0.12.0.
- Downloads last month
- 113