Qwen3-32B-Uncensored-Autoround-int4

High-performance 4-bit quantized version of Qwen3-32B using Auto-Round quantization (GPTQ-compatible).

Model Description

This is a 4-bit quantized version of Qwen3-32B, optimized for efficient deployment on consumer GPUs while maintaining excellent quality. The model uses Auto-Round quantization with a group size of 128, providing superior quality retention compared to standard GPTQ.

  • Base Model: Qwen3-32B
  • Quantization: 4-bit Auto-Round (GPTQ-compatible)
  • Group Size: 128
  • Model Size: ~18GB (4x reduction from FP16)
  • Context Length: Up to 40,960 tokens (12,288 tested)

πŸš€ Performance Benchmarks

Speed Performance (RTX 3090 24GB)

Test Type Tokens Generated Speed Time
Short (100 tokens) 100 34.1 tok/s 2.93s
Medium (500 tokens) 500 33.7 tok/s 14.8s
Long (12,000 tokens) 12,000 30.7 tok/s 6m 30s

Average Speed: 30-34 tokens/second on single RTX 3090

Quality Metrics - Perplexity Analysis

Comprehensive testing across 16 diverse domains (3,015 tokens analyzed):

Overall Results:

  • Mean Perplexity: 3.28 ⭐ (Excellent)
  • Median Perplexity: 3.15
  • High Confidence Predictions: 60.9%

Domain-Specific Performance:

Domain Perplexity Rating
Scientific Text 2.10 Outstanding
Historical Text 2.33 Excellent
Medical Text 2.32 Excellent
Technical Text 3.07 Very Good
Literary Text 3.43 Good
News Articles 5.11 Good

Quality Rating: ⭐⭐⭐⭐⭐ (5/5) - Excellent

Comparison to Official Benchmarks

vs Official Qwen3 4-bit GPTQ:

  • Auto-Round shows superior perplexity retention
  • 32B model benefits from better quantization resistance
  • Comparable quality to official quantized variants

vs Community Deployments:

  • 50% faster than typical 32B deployments (15-25 tok/s)
  • Exceeds Qwen3-32B on Quadro GV100 (20 tok/s)
  • Production-ready performance

Hardware Requirements

Minimum

  • GPU: 24GB VRAM (RTX 3090, RTX 4090, A5000, etc.)
  • RAM: 32GB system RAM
  • Storage: 20GB

Recommended

  • GPU: RTX 3090/4090 or better
  • RAM: 64GB system RAM
  • Fast SSD storage

Quick Start

Using vLLM (Recommended)

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model groxaxo/Qwen3-32B-Uncensored-Autoround-int4 \
  --dtype bfloat16 \
  --max-model-len 12288 \
  --gpu-memory-utilization 0.95 \
  --port 8000

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "groxaxo/Qwen3-32B-Uncensored-Autoround-int4",
    device_map="auto",
    trust_remote_code=False
)

tokenizer = AutoTokenizer.from_pretrained("groxaxo/Qwen3-32B-Uncensored-Autoround-int4")

prompt = "Write a detailed explanation of quantum computing:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance Optimization Tips

For Maximum Speed

  • Use vLLM with --enforce-eager for consistent performance
  • Set --gpu-memory-utilization 0.95 to maximize throughput
  • Use --dtype bfloat16 for optimal speed/quality balance

For Maximum Context

  • Max context tested: 12,288 tokens on 24GB GPU
  • For 40K context: Use 2 GPUs with tensor parallelism
  • FP8 KV cache: Experimental, may be unstable

GPU Utilization

  • Achieved: 97% compute, 84% memory during inference
  • Memory usage: ~23.6GB on RTX 3090
  • Optimal batch size: 1-4 sequences

Benchmark Details

Confidence Distribution

  • Very High Confidence (>-0.1 log prob): 43.3%
  • High Confidence (-0.5 to -0.1): 17.6%
  • Medium Confidence (-1.5 to -0.5): 15.3%
  • Low Confidence (-3.0 to -1.5): 12.3%
  • Very Low Confidence (<-3.0): 11.5%

Statistical Analysis

  • Standard Deviation: 1.34
  • 25th Percentile: 2.22
  • 75th Percentile: 4.51
  • Range: 1.34 - 5.53

Use Cases

Excellent for:

  • βœ… Scientific and technical content generation
  • βœ… Medical and academic writing
  • βœ… Historical and factual text
  • βœ… Code generation and analysis
  • βœ… Long-form content creation
  • βœ… Multi-turn conversations

Good for:

  • βœ… Creative writing and literature
  • βœ… News article generation
  • βœ… Business and legal documents
  • βœ… Multilingual tasks (119 languages)

Limitations

  • Quantization may introduce minor quality degradation compared to FP16
  • Single GPU deployment limited to ~12K context
  • Requires GPU with at least 24GB VRAM
  • Not suitable for real-time applications requiring <50ms latency

Citation

If you use this model, please cite:

@misc{qwen3-32b-autoround-int4,
  title={Qwen3-32B-Uncensored-Autoround-int4},
  author={groxaxo},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/groxaxo/Qwen3-32B-Uncensored-Autoround-int4}}
}

License

Apache 2.0 - Same as base Qwen3 model

Acknowledgments

  • Base model: Qwen3-32B by Alibaba Cloud
  • Quantization: Auto-Round technique
  • Testing framework: vLLM 0.11.0

Model Card Contact

For questions or issues, please open an issue on the model repository.


Last Updated: October 2025
Quantization Method: Auto-Round (GPTQ-compatible)
Tested On: RTX 3090 24GB, vLLM 0.11.0

Downloads last month
53
Safetensors
Model size
2B params
Tensor type
I32
Β·
BF16
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for groxaxo/Qwen3-32B-Uncensored-Autoround-int4

Base model

Qwen/Qwen3-32B
Quantized
(4)
this model