ELK-AI | Qwen3-Omni-30B-A3B-Instruct-NVFP4
Alibaba's Omni-Modal Foundation Model β Now 63% Smaller
NVFP4 Quantization | 25.68 GB (was 70+ GB) | Text/Vision/Audio Input | Text/Speech Output
Mutaz Al Awamleh β’ ELK-AI β’ December 2025
What Is This?
This is Qwen3-Omni-30B-A3B-Instruct β Alibaba's state-of-the-art omni-modal foundation model β with the Thinker component quantized to NVFP4 using NVIDIA's Model Optimizer.
Key Features
- Omni-Modal: Understands text, images, and audio
- Speech Output: Can generate speech responses via Talker component
- MoE Architecture: 30B total params, 3B active (efficient inference)
- NVFP4 Quantized: Thinker LLM compressed 66%
Size Reduction
| Component | Before | After | Change |
|---|---|---|---|
| Thinker (LLM) | ~64 GB | 20 GB | -69% |
| Talker (Speech) | 6.2 GB | 6.2 GB | BF16 preserved |
| Code2Wav (Audio) | 413 MB | 413 MB | BF16 preserved |
| Total | ~70 GB | 25.68 GB | -63% |
Model Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Qwen3-Omni β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β THINKER (31.72B params) - NVFP4 Quantized β β
β β - 48 Transformer layers β β
β β - MoE (Mixture of Experts) β β
β β - Audio Encoder (embedded) β β
β β - Vision Encoder (embedded) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β TALKER (3.32B) β β CODE2WAV (0.22B) β β
β β BF16 - Speech Gen β β BF16 - Audio Synthesis β β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quick Start
Download
huggingface-cli download ELK-AI/Qwen3-Omni-30B-A3B-Instruct-NVFP4 --local-dir ./model
Docker (Recommended)
docker pull mutazai/qwen3omni-30b-nvfp4:1.0
docker run -d --gpus all \
-v $(pwd)/model:/model \
-p 8000:8000 \
mutazai/qwen3omni-30b-nvfp4:1.0
vLLM (Thinker Only)
from vllm import LLM, SamplingParams
# Load the quantized Thinker component
llm = LLM(
model="./model/thinker", # Point to thinker subdirectory
quantization="modelopt_fp4",
trust_remote_code=True,
kv_cache_dtype="fp8",
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing simply."], sampling_params)
print(outputs[0].outputs[0].text)
API Usage
OpenAI-Compatible API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/model/thinker",
"messages": [
{"role": "user", "content": "Write a poem about AI."}
],
"temperature": 0.7,
"max_tokens": 200
}'
Python OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="/model/thinker",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
print(response.choices[0].message.content)
Hardware Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 32 GB | 48+ GB |
| GPU Model | A100 40GB | B200 / GB10 |
| CUDA | 12.0+ | 13.0 |
| System RAM | 64 GB | 128 GB |
Tested Configurations
- NVIDIA GB10 / DGX Spark
- NVIDIA B200 (Blackwell)
- NVIDIA A100 80GB
- NVIDIA L40S 48GB
Quantization Details
import modelopt.torch.quantization as mtq
# NVFP4 DEFAULT configuration
config = mtq.NVFP4_DEFAULT_CFG
# Exclusions for multimodal components (preserved at BF16)
exclusions = {
"*audio*": {"enable": False},
"*visual*": {"enable": False},
"*talker*": {"enable": False},
"*code2wav*": {"enable": False},
"*embed_tokens*": {"enable": False},
}
config["quant_cfg"].update(exclusions)
# Quantize
mtq.quantize(model, config)
Why NVFP4 DEFAULT?
| Config | Accuracy Loss | Time | Calibration |
|---|---|---|---|
| DEFAULT | ~1.0% | 63 min | None |
| AWQ_LITE | ~0.5% | 2-3 hours | 128 samples |
| AWQ_FULL | <0.3% | 11+ hours | 512 samples |
We use DEFAULT for faster deployment. For highest accuracy, consider AWQ_FULL.
Directory Structure
Qwen3-Omni-30B-A3B-Instruct-NVFP4/
βββ thinker/ # NVFP4 quantized (20 GB)
β βββ config.json
β βββ model-00001-of-00005.safetensors
β βββ model-00002-of-00005.safetensors
β βββ model-00003-of-00005.safetensors
β βββ model-00004-of-00005.safetensors
β βββ model-00005-of-00005.safetensors
β βββ model.safetensors.index.json
βββ talker/ # BF16 (6.2 GB)
β βββ config.json
β βββ model-00001-of-00002.safetensors
β βββ model-00002-of-00002.safetensors
βββ code2wav/ # BF16 (413 MB)
β βββ config.json
β βββ model.safetensors
βββ tokenizer.json
βββ tokenizer_config.json
βββ preprocessor_config.json
βββ chat_template.jinja
More ELK-AI Optimized Models
| Model | Size | Type | Quantization |
|---|---|---|---|
| Qwen3-VL-32B | 21 GB | Vision-Language | NVFP4 AWQ |
| Qwen3-Omni-30B | 25.68 GB | Omni-Modal | NVFP4 DEFAULT |
| Nemotron3-30B | 31.5 GB | Text | NVFP4 |
| Devstral-24B | 53.8 GB | Code | FP8 |
License
- Model Weights: Qwen License
- Container: Apache 2.0
Acknowledgments
- Alibaba Qwen Team for the Qwen3-Omni model
- NVIDIA for Model Optimizer and NVFP4
- vLLM Team for the inference engine
Built with care by ELK-AI
Mutaz Al Awamleh β’ December 2025
Model tree for cybermotaz/Qwen3-Omni-30B-A3B-Instruct-NVFP4
Base model
Qwen/Qwen3-Omni-30B-A3B-Instruct