YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

granite-3.3b-2b-instruct Q4_0 Quantized Model

This is a Q4_0 quantized version of the ibm-granite/granite-3.3b-2b-instruct model, converted to GGUF format and optimized for efficient code generation on resource-constrained devices. It was created using llama.cpp in Google Colab, following a workflow inspired by bartowski. The model is designed for tasks like code completion, generation, and editing across 80+ programming languages.

Model Details

  • Base Model: ibm-granite/granite-3.3b-2b-instruct
  • Quantization: Q4_0 (4-bit quantization)
  • Format: GGUF
  • Size: ~0.7โ€“1.0 GB
  • Llama.cpp Version: Recent commit (July 2025 or later)
  • License: MIT (see ibm-granite/granite-3.3b-2b-instruct for details)
  • Hardware Optimization: Supports online repacking for ARM and AVX CPU inference (e.g., Snapdragon, AMD Zen5, Intel AVX2)

Usage

Run the model with llama.cpp for command-line code generation:

./llama-cli -m granite-3.3b-2b-instruct-Q4_0.gguf --prompt "def fibonacci(n):" --no-interactive

Alternatively, use LM Studio for a user-friendly interface:

  1. Download the GGUF file from this repository.
  2. Load it in LM Studio.
  3. Enter code-related prompts (e.g., "Write a Python function to sort a list").

The model is compatible with any llama.cpp-based project (e.g., Ollama, MLX) and excels at tasks like code completion, debugging, and generation in languages such as Python, Java, C++, and more.

Creation Process

This model was created in Google Colab with the following steps:

  1. Downloaded the Base Model: Retrieved ibm-granite/granite-3.3b-2b-instruct from Hugging Face using huggingface-cli.
  2. Converted to GGUF: Used llama.cpp's convert_hf_to_gguf.py to convert the model to GGUF format (granite-3.3b-2b-instruct-f16.gguf).
  3. Quantized to Q4_0: Applied Q4_0 quantization using llama-quantize from llama.cpp.
  4. Tested: Verified functionality with llama-cli using a code-related prompt (e.g., "def fibonacci(n):") in non-interactive mode.

Optional: An importance matrix (imatrix) was generated using a code-focused dataset (e.g., a subset of The Stack or GitHub code) to enhance quantization quality, reducing accuracy loss.

Performance

  • Efficiency: The Q4_0 quantization reduces the model size to ~0.7โ€“1.0 GB, enabling fast inference on CPUs and low-memory devices, including laptops and mobile devices.
  • Code Generation: Retains strong performance for code completion and generation across 80+ programming languages, though minor accuracy loss may occur compared to the original bfloat16 model due to 4-bit quantization.
  • Hardware Optimization: Online repacking optimizes inference speed on ARM (e.g., mobile devices) and AVX CPUs (e.g., modern laptops, servers), with potential 2โ€“3x faster prompt processing on ARM devices.
  • Quality Note: For higher accuracy, consider Q5_K_M or Q8_0 quantizations, which trade off larger size for better performance.

Limitations

  • Accuracy Trade-off: Q4_0 quantization may lead to minor accuracy loss in complex code generation tasks compared to higher-precision formats (e.g., Q8_0 or bfloat16).
  • Hardware Requirements: Requires llama.cpp (recent build) or compatible software like LM Studio for inference.
  • No Imatrix (Optional): If not used, this quantization relies on standard Q4_0, which may have slightly higher accuracy loss. An imatrix-calibrated version (using a code dataset) would improve quality.
  • License Restrictions: The MIT license includes responsible AI clauses, requiring adherence to ethical use guidelines (see ibm-granite/granite-3.3b-2b-instruct).
  • Code-Specific: Optimized for code tasks; may not perform well for general text generation without fine-tuning.

Acknowledgments

  • Bartowski: For inspiration and guidance on GGUF quantization workflows (e.g., bartowski/Llama-3.2-1B-Instruct-GGUF).
  • Llama.cpp: By Georgi Gerganov for providing the quantization and inference tools.
  • The Stack: For the training dataset enabling code generation capabilities.

Contact

For issues or feedback, please open a discussion on this repository or contact the maintainer on Hugging Face or X.


Created in July 2025 by tanujrai.

Downloads last month
5
GGUF
Model size
3B params
Architecture
granite
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support