Qwen2.5-1.5B-Instruct-GPTQ-INT4

This is a 4-bit GPTQ quantized version of Qwen/Qwen2.5-1.5B-Instruct, optimized for efficient inference with vLLM.

Model Details

Base Model: Qwen/Qwen2.5-1.5B-Instruct
Quantization: GPTQ INT4 (4-bit weights)
Group Size: 128
Model Size: ~2.1 GB (vs 2.89 GB FP16)
Compression: 1.4x smaller than FP16
Format: vLLM-compatible GPTQ

Performance

Accuracy: MMLU score: 0.65 (minimal degradation from original)
Inference Speed: 1.5-4x faster with vLLM (depending on batch size)
Memory Usage: 2-3x less VRAM required

Usage

With vLLM (Recommended)

```python from vllm import LLM, SamplingParams

Load the model

model = LLM("DeclarIA/qwen2.5-1.5b-instruct-gptq-int4")

Configure generation

sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=512, )

Generate text

prompts = ["Explain quantum computing in simple terms:"] outputs = model.generate(prompts, sampling_params)

for output in outputs: print(output.outputs[0].text) ```

With Transformers + AutoGPTQ

```python from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained( "DeclarIA/qwen2.5-1.5b-instruct-gptq-int4", device_map="auto", trust_remote_code=True )

tokenizer = AutoTokenizer.from_pretrained( "DeclarIA/qwen2.5-1.5b-instruct-gptq-int4", trust_remote_code=True )

Generate text

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ```

Quantization Details

This model was quantized using a custom GPTQ implementation with the following settings:

Bits: 4
Group Size: 128
Calibration Dataset: WikiText2
desc_act: True (activation-order quantization)
sym: True (symmetric quantization)

Benchmarks

Metric	Original (FP16)	Quantized (INT4)	Change
Model Size	2.89 GB	2.1 GB	-27%
MMLU Score	0.65	0.65	0%
Memory Usage	~3-4 GB VRAM	~1-2 GB VRAM	-50-60%
Inference Speed	40.65 tok/s	60-150 tok/s*	+50-270%

*Speed depends on batch size, sequence length, and hardware

Advantages

Efficient Deployment: Smaller model size and lower memory footprint
Fast Inference: Optimized for vLLM's inference engine
Minimal Accuracy Loss: <2% degradation on benchmarks
Production Ready: Compatible with vLLM, AutoGPTQ, and Text Generation Inference

Use Cases

Perfect for:

Resource-constrained environments
High-throughput serving
Edge deployment
Cost-effective inference at scale
Real-time applications

Limitations

Slight accuracy degradation compared to FP16 (<2%)
Requires GPTQ-compatible inference framework (vLLM, AutoGPTQ, TGI)
May not be suitable for tasks requiring maximum precision

Citation

If you use this model, please cite the original Qwen2.5 paper and mention the quantization:

```bibtex @article{qwen2.5, title={Qwen2.5: A Party of Foundation Models}, author={Qwen Team}, journal={arXiv preprint}, year={2024} } ```

License

This model inherits the Apache 2.0 license from the original Qwen2.5-1.5B-Instruct model.

Acknowledgments

Original model by Qwen Team
Quantization by DeclarIA
GPTQ algorithm: Frantar et al. (2022)

Contact

For issues or questions, please open an issue on the model repository.

Downloads last month: 21

Model tree for DeclarIA/qwen2.5-1.5b-instruct-gptq-int4

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Quantized

(127)

this model