Qwen2.5-1.5B-Instruct-GPTQ-INT4

This is a 4-bit GPTQ quantized version of Qwen/Qwen2.5-1.5B-Instruct, optimized for efficient inference with vLLM.

Model Details

  • Base Model: Qwen/Qwen2.5-1.5B-Instruct
  • Quantization: GPTQ INT4 (4-bit weights)
  • Group Size: 128
  • Model Size: ~2.1 GB (vs 2.89 GB FP16)
  • Compression: 1.4x smaller than FP16
  • Format: vLLM-compatible GPTQ

Performance

  • Accuracy: MMLU score: 0.65 (minimal degradation from original)
  • Inference Speed: 1.5-4x faster with vLLM (depending on batch size)
  • Memory Usage: 2-3x less VRAM required

Usage

With vLLM (Recommended)

```python from vllm import LLM, SamplingParams

Load the model

model = LLM("DeclarIA/qwen2.5-1.5b-instruct-gptq-int4")

Configure generation

sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=512, )

Generate text

prompts = ["Explain quantum computing in simple terms:"] outputs = model.generate(prompts, sampling_params)

for output in outputs: print(output.outputs[0].text) ```

With Transformers + AutoGPTQ

```python from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained( "DeclarIA/qwen2.5-1.5b-instruct-gptq-int4", device_map="auto", trust_remote_code=True )

tokenizer = AutoTokenizer.from_pretrained( "DeclarIA/qwen2.5-1.5b-instruct-gptq-int4", trust_remote_code=True )

Generate text

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ```

Quantization Details

This model was quantized using a custom GPTQ implementation with the following settings:

  • Bits: 4
  • Group Size: 128
  • Calibration Dataset: WikiText2
  • desc_act: True (activation-order quantization)
  • sym: True (symmetric quantization)

Benchmarks

Metric Original (FP16) Quantized (INT4) Change
Model Size 2.89 GB 2.1 GB -27%
MMLU Score 0.65 0.65 0%
Memory Usage ~3-4 GB VRAM ~1-2 GB VRAM -50-60%
Inference Speed 40.65 tok/s 60-150 tok/s* +50-270%

*Speed depends on batch size, sequence length, and hardware

Advantages

  • Efficient Deployment: Smaller model size and lower memory footprint
  • Fast Inference: Optimized for vLLM's inference engine
  • Minimal Accuracy Loss: <2% degradation on benchmarks
  • Production Ready: Compatible with vLLM, AutoGPTQ, and Text Generation Inference

Use Cases

Perfect for:

  • Resource-constrained environments
  • High-throughput serving
  • Edge deployment
  • Cost-effective inference at scale
  • Real-time applications

Limitations

  • Slight accuracy degradation compared to FP16 (<2%)
  • Requires GPTQ-compatible inference framework (vLLM, AutoGPTQ, TGI)
  • May not be suitable for tasks requiring maximum precision

Citation

If you use this model, please cite the original Qwen2.5 paper and mention the quantization:

```bibtex @article{qwen2.5, title={Qwen2.5: A Party of Foundation Models}, author={Qwen Team}, journal={arXiv preprint}, year={2024} } ```

License

This model inherits the Apache 2.0 license from the original Qwen2.5-1.5B-Instruct model.

Acknowledgments

Contact

For issues or questions, please open an issue on the model repository.

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DeclarIA/qwen2.5-1.5b-instruct-gptq-int4

Base model

Qwen/Qwen2.5-1.5B
Quantized
(127)
this model