ELK-AI | Qwen3-Omni-30B-A3B-Instruct-NVFP4

Alibaba's Omni-Modal Foundation Model — Now 63% Smaller

NVFP4 Quantization | 25.68 GB (was 70+ GB) | Text/Vision/Audio Input | Text/Speech Output

Mutaz Al Awamleh • ELK-AI • December 2025

What Is This?

This is Qwen3-Omni-30B-A3B-Instruct — Alibaba's state-of-the-art omni-modal foundation model — with the Thinker component quantized to NVFP4 using NVIDIA's Model Optimizer.

Key Features

Omni-Modal: Understands text, images, and audio
Speech Output: Can generate speech responses via Talker component
MoE Architecture: 30B total params, 3B active (efficient inference)
NVFP4 Quantized: Thinker LLM compressed 66%

Size Reduction

Component	Before	After	Change
Thinker (LLM)	~64 GB	20 GB	-69%
Talker (Speech)	6.2 GB	6.2 GB	BF16 preserved
Code2Wav (Audio)	413 MB	413 MB	BF16 preserved
Total	~70 GB	25.68 GB	-63%

Model Architecture

┌──────────────────────────────────────────────────────────────┐
│                      Qwen3-Omni                              │
├──────────────────────────────────────────────────────────────┤
│  ┌────────────────────────────────────────────────────────┐  │
│  │  THINKER (31.72B params) - NVFP4 Quantized             │  │
│  │  - 48 Transformer layers                               │  │
│  │  - MoE (Mixture of Experts)                            │  │
│  │  - Audio Encoder (embedded)                            │  │
│  │  - Vision Encoder (embedded)                           │  │
│  └────────────────────────────────────────────────────────┘  │
├──────────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐  ┌─────────────────────────────┐    │
│  │  TALKER (3.32B)     │  │  CODE2WAV (0.22B)           │    │
│  │  BF16 - Speech Gen  │  │  BF16 - Audio Synthesis     │    │
│  └─────────────────────┘  └─────────────────────────────┘    │
└──────────────────────────────────────────────────────────────┘

Quick Start

Download

huggingface-cli download ELK-AI/Qwen3-Omni-30B-A3B-Instruct-NVFP4 --local-dir ./model

Docker (Recommended)

docker pull mutazai/qwen3omni-30b-nvfp4:1.0

docker run -d --gpus all \
  -v $(pwd)/model:/model \
  -p 8000:8000 \
  mutazai/qwen3omni-30b-nvfp4:1.0

vLLM (Thinker Only)

from vllm import LLM, SamplingParams

# Load the quantized Thinker component
llm = LLM(
    model="./model/thinker",  # Point to thinker subdirectory
    quantization="modelopt_fp4",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing simply."], sampling_params)
print(outputs[0].outputs[0].text)

API Usage

OpenAI-Compatible API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model/thinker",
    "messages": [
      {"role": "user", "content": "Write a poem about AI."}
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

Python OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="/model/thinker",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)
print(response.choices[0].message.content)

Hardware Requirements

Requirement	Minimum	Recommended
GPU VRAM	32 GB	48+ GB
GPU Model	A100 40GB	B200 / GB10
CUDA	12.0+	13.0
System RAM	64 GB	128 GB

Tested Configurations

NVIDIA GB10 / DGX Spark
NVIDIA B200 (Blackwell)
NVIDIA A100 80GB
NVIDIA L40S 48GB

Quantization Details

import modelopt.torch.quantization as mtq

# NVFP4 DEFAULT configuration
config = mtq.NVFP4_DEFAULT_CFG

# Exclusions for multimodal components (preserved at BF16)
exclusions = {
    "*audio*": {"enable": False},
    "*visual*": {"enable": False},
    "*talker*": {"enable": False},
    "*code2wav*": {"enable": False},
    "*embed_tokens*": {"enable": False},
}
config["quant_cfg"].update(exclusions)

# Quantize
mtq.quantize(model, config)

Why NVFP4 DEFAULT?

Config	Accuracy Loss	Time	Calibration
DEFAULT	~1.0%	63 min	None
AWQ_LITE	~0.5%	2-3 hours	128 samples
AWQ_FULL	<0.3%	11+ hours	512 samples

We use DEFAULT for faster deployment. For highest accuracy, consider AWQ_FULL.

Directory Structure

Qwen3-Omni-30B-A3B-Instruct-NVFP4/
├── thinker/                    # NVFP4 quantized (20 GB)
│   ├── config.json
│   ├── model-00001-of-00005.safetensors
│   ├── model-00002-of-00005.safetensors
│   ├── model-00003-of-00005.safetensors
│   ├── model-00004-of-00005.safetensors
│   ├── model-00005-of-00005.safetensors
│   └── model.safetensors.index.json
├── talker/                     # BF16 (6.2 GB)
│   ├── config.json
│   ├── model-00001-of-00002.safetensors
│   └── model-00002-of-00002.safetensors
├── code2wav/                   # BF16 (413 MB)
│   ├── config.json
│   └── model.safetensors
├── tokenizer.json
├── tokenizer_config.json
├── preprocessor_config.json
└── chat_template.jinja

More ELK-AI Optimized Models

Model	Size	Type	Quantization
Qwen3-VL-32B	21 GB	Vision-Language	NVFP4 AWQ
Qwen3-Omni-30B	25.68 GB	Omni-Modal	NVFP4 DEFAULT
Nemotron3-30B	31.5 GB	Text	NVFP4
Devstral-24B	53.8 GB	Code	FP8

License

Model Weights: Qwen License
Container: Apache 2.0

Acknowledgments

Alibaba Qwen Team for the Qwen3-Omni model
NVIDIA for Model Optimizer and NVFP4
vLLM Team for the inference engine

Built with care by ELK-AI

Mutaz Al Awamleh • December 2025

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for cybermotaz/Qwen3-Omni-30B-A3B-Instruct-NVFP4

Base model

Qwen/Qwen3-Omni-30B-A3B-Instruct

Finetuned

(10)

this model