Qwen3-32B-Uncensored-Autoround-int4
High-performance 4-bit quantized version of Qwen3-32B using Auto-Round quantization (GPTQ-compatible).
Model Description
This is a 4-bit quantized version of Qwen3-32B, optimized for efficient deployment on consumer GPUs while maintaining excellent quality. The model uses Auto-Round quantization with a group size of 128, providing superior quality retention compared to standard GPTQ.
- Base Model: Qwen3-32B
- Quantization: 4-bit Auto-Round (GPTQ-compatible)
- Group Size: 128
- Model Size: ~18GB (4x reduction from FP16)
- Context Length: Up to 40,960 tokens (12,288 tested)
π Performance Benchmarks
Speed Performance (RTX 3090 24GB)
| Test Type | Tokens Generated | Speed | Time |
|---|---|---|---|
| Short (100 tokens) | 100 | 34.1 tok/s | 2.93s |
| Medium (500 tokens) | 500 | 33.7 tok/s | 14.8s |
| Long (12,000 tokens) | 12,000 | 30.7 tok/s | 6m 30s |
Average Speed: 30-34 tokens/second on single RTX 3090
Quality Metrics - Perplexity Analysis
Comprehensive testing across 16 diverse domains (3,015 tokens analyzed):
Overall Results:
- Mean Perplexity: 3.28 β (Excellent)
- Median Perplexity: 3.15
- High Confidence Predictions: 60.9%
Domain-Specific Performance:
| Domain | Perplexity | Rating |
|---|---|---|
| Scientific Text | 2.10 | Outstanding |
| Historical Text | 2.33 | Excellent |
| Medical Text | 2.32 | Excellent |
| Technical Text | 3.07 | Very Good |
| Literary Text | 3.43 | Good |
| News Articles | 5.11 | Good |
Quality Rating: βββββ (5/5) - Excellent
Comparison to Official Benchmarks
vs Official Qwen3 4-bit GPTQ:
- Auto-Round shows superior perplexity retention
- 32B model benefits from better quantization resistance
- Comparable quality to official quantized variants
vs Community Deployments:
- 50% faster than typical 32B deployments (15-25 tok/s)
- Exceeds Qwen3-32B on Quadro GV100 (20 tok/s)
- Production-ready performance
Hardware Requirements
Minimum
- GPU: 24GB VRAM (RTX 3090, RTX 4090, A5000, etc.)
- RAM: 32GB system RAM
- Storage: 20GB
Recommended
- GPU: RTX 3090/4090 or better
- RAM: 64GB system RAM
- Fast SSD storage
Quick Start
Using vLLM (Recommended)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model groxaxo/Qwen3-32B-Uncensored-Autoround-int4 \
--dtype bfloat16 \
--max-model-len 12288 \
--gpu-memory-utilization 0.95 \
--port 8000
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"groxaxo/Qwen3-32B-Uncensored-Autoround-int4",
device_map="auto",
trust_remote_code=False
)
tokenizer = AutoTokenizer.from_pretrained("groxaxo/Qwen3-32B-Uncensored-Autoround-int4")
prompt = "Write a detailed explanation of quantum computing:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Performance Optimization Tips
For Maximum Speed
- Use vLLM with
--enforce-eagerfor consistent performance - Set
--gpu-memory-utilization 0.95to maximize throughput - Use
--dtype bfloat16for optimal speed/quality balance
For Maximum Context
- Max context tested: 12,288 tokens on 24GB GPU
- For 40K context: Use 2 GPUs with tensor parallelism
- FP8 KV cache: Experimental, may be unstable
GPU Utilization
- Achieved: 97% compute, 84% memory during inference
- Memory usage: ~23.6GB on RTX 3090
- Optimal batch size: 1-4 sequences
Benchmark Details
Confidence Distribution
- Very High Confidence (>-0.1 log prob): 43.3%
- High Confidence (-0.5 to -0.1): 17.6%
- Medium Confidence (-1.5 to -0.5): 15.3%
- Low Confidence (-3.0 to -1.5): 12.3%
- Very Low Confidence (<-3.0): 11.5%
Statistical Analysis
- Standard Deviation: 1.34
- 25th Percentile: 2.22
- 75th Percentile: 4.51
- Range: 1.34 - 5.53
Use Cases
Excellent for:
- β Scientific and technical content generation
- β Medical and academic writing
- β Historical and factual text
- β Code generation and analysis
- β Long-form content creation
- β Multi-turn conversations
Good for:
- β Creative writing and literature
- β News article generation
- β Business and legal documents
- β Multilingual tasks (119 languages)
Limitations
- Quantization may introduce minor quality degradation compared to FP16
- Single GPU deployment limited to ~12K context
- Requires GPU with at least 24GB VRAM
- Not suitable for real-time applications requiring <50ms latency
Citation
If you use this model, please cite:
@misc{qwen3-32b-autoround-int4,
title={Qwen3-32B-Uncensored-Autoround-int4},
author={groxaxo},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/groxaxo/Qwen3-32B-Uncensored-Autoround-int4}}
}
License
Apache 2.0 - Same as base Qwen3 model
Acknowledgments
- Base model: Qwen3-32B by Alibaba Cloud
- Quantization: Auto-Round technique
- Testing framework: vLLM 0.11.0
Model Card Contact
For questions or issues, please open an issue on the model repository.
Last Updated: October 2025
Quantization Method: Auto-Round (GPTQ-compatible)
Tested On: RTX 3090 24GB, vLLM 0.11.0
- Downloads last month
- 53