---
license: apache-2.0
tags:
- text-generation
- llama.cpp
- gguf
- quantized
- q3_k
model_type: llama
inference: false
base_model:
- sarvamai/sarvam-m
---

# sarvam-m-24b - Q3_K GGUF

This repository contains the **Q3_K** quantized version of sarvam-m-24b in GGUF format.

## Model Details
- **Quantization**: Q3_K
- **File Size**: ~10.7GB  
- **Description**: Standard Q3 quantization
- **Format**: GGUF (compatible with llama.cpp)

## Usage

### With llama.cpp
```bash
# Download the model
huggingface-cli download tifin-india/sarvam-m-24b-q3_k-gguf

# Run inference
./main -m sarvam-m-24b-Q3_K.gguf -p "Your prompt here"
```

### With Python (llama-cpp-python)
```python
from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="./sarvam-m-24b-Q3_K.gguf",
    n_ctx=2048,  # Context length
    n_gpu_layers=35,  # Adjust based on your GPU
    verbose=False
)

# Generate text
response = llm("Your prompt here", max_tokens=100)
print(response['choices'][0]['text'])
```

### With Transformers + AutoGGUF
```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_name = "tifin-india/sarvam-m-24b-q3_k-gguf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_quantized(model_name)
```

## Performance Characteristics

| Aspect | Rating |
|--------|--------|
| **Speed** | ⭐⭐⭐⭐ |
| **Quality** | ⭐⭐ |
| **Memory** | ⭐⭐⭐⭐ |

## Original Model

This is a quantized version of the original model. For the full-precision version and more details, please refer to the original model repository.

## Quantization Details

This model was quantized using llama.cpp's quantization tools. The Q3_K format provides a good balance of model size, inference speed, and output quality for most use cases.

## License

This model follows the same license as the original model (Apache 2.0).

## Citation

If you use this model, please cite the original model authors and acknowledge the quantization.