Qwen2.5-Math-7B-Instruct-4bit

Model Description

Qwen2.5-Math-7B-Instruct-4bit is a 4-bit quantized version of the Qwen/Qwen2.5-Math-7B-Instruct model using GPTQ quantization (W4A16 - 4-bit weights, 16-bit activations).

This model is optimized to:

  • Reduce model size by ~75% compared to the original model
  • Reduce GPU memory requirements during inference
  • Increase inference speed
  • Maintain high accuracy for mathematical tasks

Model Details

  • Developed by: Community
  • Model type: Causal Language Model (Quantized)
  • Language(s): English, Mathematics
  • License: MIT
  • Finetuned from model: Qwen/Qwen2.5-Math-7B-Instruct
  • Quantization method: GPTQ (W4A16) via LLM Compressor
  • Calibration dataset: GSM8K (256 samples)

Model Sources

Uses

Direct Use

This model is designed for direct use in mathematical and reasoning tasks, including:

  • Solving arithmetic, algebra, and geometry problems
  • Mathematical reasoning and proofs
  • Analyzing and explaining mathematical concepts
  • Educational mathematics support

Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "your-username/qwen2.5-math-7b-instruct-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    dtype="float16",
    trust_remote_code=True,
    low_cpu_mem_usage=False,  # Important for compressed models
)

# Create prompt
prompt = "<|im_start|>user\nSolve for x: 3x + 5 = 14<|im_end|>\n<|im_start|>assistant\n"

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Downstream Use

This model can be further fine-tuned for specific mathematical tasks or integrated into educational applications.

Out-of-Scope Use

This model is NOT designed for:

  • Generating harmful or inappropriate content
  • Use in applications requiring absolute accuracy (such as critical financial calculations)
  • Tasks unrelated to mathematics or reasoning

Bias, Risks, and Limitations

Limitations

  • The model has been quantized and may have slightly lower accuracy compared to the original model
  • May encounter errors with some complex problems or edge cases
  • Model was primarily trained on English data

Recommendations

Users should:

  • Verify results for important mathematical problems
  • Use the original model (full precision) if maximum accuracy is required
  • Understand that quantization may affect some tasks

How to Get Started with the Model

Installation

pip install transformers torch accelerate

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-username/qwen2.5-math-7b-instruct-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    dtype="float16",
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Use the model
prompt = "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Quantization Procedure

The model was quantized using:

  • Method: GPTQ (W4A16)
  • Tool: vLLM LLM Compressor
  • Calibration dataset: GSM8K (256 samples)
  • Max sequence length: 2048 tokens
  • Target layers: All Linear layers except lm_head

Quantization Hyperparameters

  • Scheme: W4A16 (4-bit weights, 16-bit activations)
  • Block size: 128
  • Dampening fraction: 0.01
  • Calibration samples: 256

Evaluation

Testing Data

The model was evaluated on the GSM8K test set.

Metrics

  • Accuracy: Measured on GSM8K test set
  • Model size: ~3.5GB (compared to ~14GB of the original model)
  • Compression ratio: ~75% reduction
  • Memory usage: Significantly reduced compared to the original model

Results

The compressed model maintains high accuracy for mathematical tasks while significantly reducing size and memory requirements.

Technical Specifications

Model Architecture

  • Base Architecture: Qwen2.5 (Transformer-based)
  • Parameters: 7B (quantized to 4-bit)
  • Context Length: 8192 tokens (original model), 2048 tokens (optimized for quantization)
  • Quantization: GPTQ W4A16

Compute Infrastructure

Hardware

  • Training/Quantization: NVIDIA RTX 3060 12GB (or equivalent)
  • Minimum Inference: GPU with at least 8GB VRAM

Software

  • Quantization Tool: vLLM LLM Compressor
  • Framework: PyTorch, Transformers
  • Python: >=3.12

Citation

If you use this model, please cite:

Base Model:

@article{qwen2.5,
  title={Qwen2.5: A Large Language Model for Mathematics},
  author={Qwen Team},
  year={2024}
}

Quantization Method:

@article{gptq,
  title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers},
  author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2210.17323},
  year={2022}
}

Model Card Contact

To report issues or ask questions, please open an issue on the repository.

Acknowledgments

  • Qwen Team for the original Qwen2.5-Math-7B-Instruct model
  • vLLM team for the LLM Compressor tool
  • Hugging Face for infrastructure and support
Downloads last month
1
Safetensors
Model size
2B params
Tensor type
F16
·
I64
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for lunovian/Qwen2.5-Math-7B-Instruct-4bit