Qwen3-VL-8B-Thinking NVFP4 W4A16
First NVFP4 Quantization of Qwen3-VL-8B-Thinking
By Mutaz Al Awamleh | ELK-AI
Model Description
This is the first publicly available NVFP4 W4A16 quantized version of Qwen3-VL-8B-Thinking, a vision-language model optimized for NVIDIA Blackwell (SM121) architecture.
| Attribute | Original | NVFP4 Quantized |
|---|---|---|
| Parameters | 8B | Same |
| Architecture | Vision-Language + Thinking | Same |
| Model Size | ~17 GB | ~7.1 GB |
| Memory Savings | - | 58% |
| Precision | BF16 | FP4 W4A16 |
Quick Start
Using vLLM (Recommended)
from vllm import LLM, SamplingParams
model = LLM(
model="cybermotaz/qwen3-vl-8b-thinking-nvfp4-w4a16",
trust_remote_code=True,
quantization="modelopt_fp4",
kv_cache_dtype="fp8",
gpu_memory_utilization=0.95
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
prompt = "Think step by step: What is shown in this image?"
outputs = model.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
Using Docker (Pre-loaded)
# Pull the optimized container
docker pull elkaioptimization/vllm-nvfp4-cuda-13:qwen3-vl-8b-thinking-nvfp4-1.0
# Run with OpenAI-compatible API
docker run --gpus all -p 8000:8000 \
elkaioptimization/vllm-nvfp4-cuda-13:qwen3-vl-8b-thinking-nvfp4-1.0
Quantization Details
| Parameter | Value |
|---|---|
| Quantization Format | NVFP4 (FP4 E2M1) |
| Weight Precision | 4-bit (W4) |
| Activation Precision | 16-bit (A16) |
| Block Size | 16 elements |
| Scale Format | FP8 E4M3 |
| Calibration Dataset | CNN/DailyMail (512 samples) |
| Calibration Method | AWQ-style |
| Tool Used | NVIDIA TensorRT-Model-Optimizer |
Hardware Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | RTX 4070 (12GB) | RTX 4090 / DGX Spark |
| GPU Memory | 12 GB | 24 GB+ |
| CUDA | 12.4+ | 13.0 |
| Driver | 560+ | 570+ |
Model Architecture
Qwen3-VL-8B-Thinking features:
- Vision-Language: Processes both images and text inputs
- Enhanced Reasoning: Optimized for step-by-step thinking and complex reasoning
- Extended Context: 32K native, 262K extended context length
- Multilingual: Strong performance in English and Chinese
Links
| Resource | Link |
|---|---|
| Original Model | Qwen/Qwen3-VL-8B-Thinking |
| Docker (Org) | elkaioptimization/vllm-nvfp4-cuda-13 |
| Docker (Personal) | mutazai/vllm-spark-blackwell-nvfp4-optimized |
| Author | Mutaz Al Awamleh |
| Organization | ELK-AI |
License
This model is released under the Apache 2.0 License, same as the original Qwen3 model.
Built by Mutaz Al Awamleh | ELK-AI
First to quantize Qwen3-VL-8B-Thinking to NVFP4 for Blackwell
- Downloads last month
- 24
Model tree for cybermotaz/qwen3-vl-8b-thinking-nvfp4-w4a16
Base model
Qwen/Qwen3-VL-8B-Thinking