ELK-AI | Qwen3-Omni-30B-A3B-Instruct-NVFP4

Alibaba's Omni-Modal Foundation Model β€” Now 63% Smaller

NVFP4 Quantization | 25.68 GB (was 70+ GB) | Text/Vision/Audio Input | Text/Speech Output

Docker Hub CUDA 13 Blackwell


Mutaz Al Awamleh β€’ ELK-AI β€’ December 2025


What Is This?

This is Qwen3-Omni-30B-A3B-Instruct β€” Alibaba's state-of-the-art omni-modal foundation model β€” with the Thinker component quantized to NVFP4 using NVIDIA's Model Optimizer.

Key Features

  • Omni-Modal: Understands text, images, and audio
  • Speech Output: Can generate speech responses via Talker component
  • MoE Architecture: 30B total params, 3B active (efficient inference)
  • NVFP4 Quantized: Thinker LLM compressed 66%

Size Reduction

Component Before After Change
Thinker (LLM) ~64 GB 20 GB -69%
Talker (Speech) 6.2 GB 6.2 GB BF16 preserved
Code2Wav (Audio) 413 MB 413 MB BF16 preserved
Total ~70 GB 25.68 GB -63%

Model Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Qwen3-Omni                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  THINKER (31.72B params) - NVFP4 Quantized             β”‚  β”‚
β”‚  β”‚  - 48 Transformer layers                               β”‚  β”‚
β”‚  β”‚  - MoE (Mixture of Experts)                            β”‚  β”‚
β”‚  β”‚  - Audio Encoder (embedded)                            β”‚  β”‚
β”‚  β”‚  - Vision Encoder (embedded)                           β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  TALKER (3.32B)     β”‚  β”‚  CODE2WAV (0.22B)           β”‚    β”‚
β”‚  β”‚  BF16 - Speech Gen  β”‚  β”‚  BF16 - Audio Synthesis     β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

Download

huggingface-cli download ELK-AI/Qwen3-Omni-30B-A3B-Instruct-NVFP4 --local-dir ./model

Docker (Recommended)

docker pull mutazai/qwen3omni-30b-nvfp4:1.0

docker run -d --gpus all \
  -v $(pwd)/model:/model \
  -p 8000:8000 \
  mutazai/qwen3omni-30b-nvfp4:1.0

vLLM (Thinker Only)

from vllm import LLM, SamplingParams

# Load the quantized Thinker component
llm = LLM(
    model="./model/thinker",  # Point to thinker subdirectory
    quantization="modelopt_fp4",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing simply."], sampling_params)
print(outputs[0].outputs[0].text)

API Usage

OpenAI-Compatible API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model/thinker",
    "messages": [
      {"role": "user", "content": "Write a poem about AI."}
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

Python OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="/model/thinker",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)
print(response.choices[0].message.content)

Hardware Requirements

Requirement Minimum Recommended
GPU VRAM 32 GB 48+ GB
GPU Model A100 40GB B200 / GB10
CUDA 12.0+ 13.0
System RAM 64 GB 128 GB

Tested Configurations

  • NVIDIA GB10 / DGX Spark
  • NVIDIA B200 (Blackwell)
  • NVIDIA A100 80GB
  • NVIDIA L40S 48GB

Quantization Details

import modelopt.torch.quantization as mtq

# NVFP4 DEFAULT configuration
config = mtq.NVFP4_DEFAULT_CFG

# Exclusions for multimodal components (preserved at BF16)
exclusions = {
    "*audio*": {"enable": False},
    "*visual*": {"enable": False},
    "*talker*": {"enable": False},
    "*code2wav*": {"enable": False},
    "*embed_tokens*": {"enable": False},
}
config["quant_cfg"].update(exclusions)

# Quantize
mtq.quantize(model, config)

Why NVFP4 DEFAULT?

Config Accuracy Loss Time Calibration
DEFAULT ~1.0% 63 min None
AWQ_LITE ~0.5% 2-3 hours 128 samples
AWQ_FULL <0.3% 11+ hours 512 samples

We use DEFAULT for faster deployment. For highest accuracy, consider AWQ_FULL.


Directory Structure

Qwen3-Omni-30B-A3B-Instruct-NVFP4/
β”œβ”€β”€ thinker/                    # NVFP4 quantized (20 GB)
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model-00001-of-00005.safetensors
β”‚   β”œβ”€β”€ model-00002-of-00005.safetensors
β”‚   β”œβ”€β”€ model-00003-of-00005.safetensors
β”‚   β”œβ”€β”€ model-00004-of-00005.safetensors
β”‚   β”œβ”€β”€ model-00005-of-00005.safetensors
β”‚   └── model.safetensors.index.json
β”œβ”€β”€ talker/                     # BF16 (6.2 GB)
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model-00001-of-00002.safetensors
β”‚   └── model-00002-of-00002.safetensors
β”œβ”€β”€ code2wav/                   # BF16 (413 MB)
β”‚   β”œβ”€β”€ config.json
β”‚   └── model.safetensors
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ preprocessor_config.json
└── chat_template.jinja

More ELK-AI Optimized Models

Model Size Type Quantization
Qwen3-VL-32B 21 GB Vision-Language NVFP4 AWQ
Qwen3-Omni-30B 25.68 GB Omni-Modal NVFP4 DEFAULT
Nemotron3-30B 31.5 GB Text NVFP4
Devstral-24B 53.8 GB Code FP8

License


Acknowledgments

  • Alibaba Qwen Team for the Qwen3-Omni model
  • NVIDIA for Model Optimizer and NVFP4
  • vLLM Team for the inference engine

Built with care by ELK-AI

Mutaz Al Awamleh β€’ December 2025

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cybermotaz/Qwen3-Omni-30B-A3B-Instruct-NVFP4

Finetuned
(10)
this model