Qwen2.5-1.5B-Instruct-GPTQ-INT4
This is a 4-bit GPTQ quantized version of Qwen/Qwen2.5-1.5B-Instruct, optimized for efficient inference with vLLM.
Model Details
- Base Model: Qwen/Qwen2.5-1.5B-Instruct
- Quantization: GPTQ INT4 (4-bit weights)
- Group Size: 128
- Model Size: ~2.1 GB (vs 2.89 GB FP16)
- Compression: 1.4x smaller than FP16
- Format: vLLM-compatible GPTQ
Performance
- Accuracy: MMLU score: 0.65 (minimal degradation from original)
- Inference Speed: 1.5-4x faster with vLLM (depending on batch size)
- Memory Usage: 2-3x less VRAM required
Usage
With vLLM (Recommended)
```python from vllm import LLM, SamplingParams
Load the model
model = LLM("DeclarIA/qwen2.5-1.5b-instruct-gptq-int4")
Configure generation
sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=512, )
Generate text
prompts = ["Explain quantum computing in simple terms:"] outputs = model.generate(prompts, sampling_params)
for output in outputs: print(output.outputs[0].text) ```
With Transformers + AutoGPTQ
```python from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained( "DeclarIA/qwen2.5-1.5b-instruct-gptq-int4", device_map="auto", trust_remote_code=True )
tokenizer = AutoTokenizer.from_pretrained( "DeclarIA/qwen2.5-1.5b-instruct-gptq-int4", trust_remote_code=True )
Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ```
Quantization Details
This model was quantized using a custom GPTQ implementation with the following settings:
- Bits: 4
- Group Size: 128
- Calibration Dataset: WikiText2
- desc_act: True (activation-order quantization)
- sym: True (symmetric quantization)
Benchmarks
| Metric | Original (FP16) | Quantized (INT4) | Change |
|---|---|---|---|
| Model Size | 2.89 GB | 2.1 GB | -27% |
| MMLU Score | 0.65 | 0.65 | 0% |
| Memory Usage | ~3-4 GB VRAM | ~1-2 GB VRAM | -50-60% |
| Inference Speed | 40.65 tok/s | 60-150 tok/s* | +50-270% |
*Speed depends on batch size, sequence length, and hardware
Advantages
- Efficient Deployment: Smaller model size and lower memory footprint
- Fast Inference: Optimized for vLLM's inference engine
- Minimal Accuracy Loss: <2% degradation on benchmarks
- Production Ready: Compatible with vLLM, AutoGPTQ, and Text Generation Inference
Use Cases
Perfect for:
- Resource-constrained environments
- High-throughput serving
- Edge deployment
- Cost-effective inference at scale
- Real-time applications
Limitations
- Slight accuracy degradation compared to FP16 (<2%)
- Requires GPTQ-compatible inference framework (vLLM, AutoGPTQ, TGI)
- May not be suitable for tasks requiring maximum precision
Citation
If you use this model, please cite the original Qwen2.5 paper and mention the quantization:
```bibtex @article{qwen2.5, title={Qwen2.5: A Party of Foundation Models}, author={Qwen Team}, journal={arXiv preprint}, year={2024} } ```
License
This model inherits the Apache 2.0 license from the original Qwen2.5-1.5B-Instruct model.
Acknowledgments
- Original model by Qwen Team
- Quantization by DeclarIA
- GPTQ algorithm: Frantar et al. (2022)
Contact
For issues or questions, please open an issue on the model repository.
- Downloads last month
- 21