Qwen3
Collection
Qwen3 models converted to Ctranslate2 format.
•
8 items
•
Updated
Ctranslate2 compatable AWQ 4-bit quantization of Qwen/Qwen3-14B.
- This model requires that a pull request https://github.com/OpenNMT/CTranslate2/pull/1951 be accepted. The model will only work after this occurs.
- This model was made from that AWQ version of the original Qwen3 model. The AWQ version, in turn, was made with custom fork of the
autoawqrepository since the original repository was archived in May, 2025. Feel free to message if you run into any issues, but it has been tested.
| Model | VRAM Usage |
|---|---|
| Qwen3-32B-ct2-awq | ~18.3 GB |
| Qwen3-14B-ct2-awq | ~9.5 GB |
| Qwen3-8B-ct2-awq | ~5.8 GB |
| Qwen3-4B-ct2-awq | ~2.6 GB |
| Qwen3-1.7B-ct2-awq | ~1.3 GB |
| Qwen3-0.6B-ct2-awq | ~0.6 GB |
import ctranslate2
from transformers import AutoTokenizer
MODEL_ID = "CTranslate2HQ/Qwen3-14B-ct2-AWQ"
# Load model and tokenizer from Hugging Face Hub
generator = ctranslate2.Generator(MODEL_ID, device="cuda")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Format prompt using chat template
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Write a short poem about a cat."}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
# Tokenize and generate
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
# Only with "ct2-AWQ" models do not use the "compute_type" parameter
results = generator.generate_batch(
[tokens],
max_length=8192,
sampling_temperature=0.7,
sampling_topk=50,
)
# Decode and print response
output_ids = results[0].sequences_ids[0]
response = tokenizer.decode(output_ids, skip_special_tokens=True)
print(response)
Requirements:
ctranslate2
transformers
torch
huggingface_hub