--- license: apache-2.0 tags: - text-generation - llama.cpp - gguf - quantized - q3_k model_type: llama inference: false base_model: - sarvamai/sarvam-m --- # sarvam-m-24b - Q3_K GGUF This repository contains the **Q3_K** quantized version of sarvam-m-24b in GGUF format. ## Model Details - **Quantization**: Q3_K - **File Size**: ~10.7GB - **Description**: Standard Q3 quantization - **Format**: GGUF (compatible with llama.cpp) ## Usage ### With llama.cpp ```bash # Download the model huggingface-cli download tifin-india/sarvam-m-24b-q3_k-gguf # Run inference ./main -m sarvam-m-24b-Q3_K.gguf -p "Your prompt here" ``` ### With Python (llama-cpp-python) ```python from llama_cpp import Llama # Load the model llm = Llama( model_path="./sarvam-m-24b-Q3_K.gguf", n_ctx=2048, # Context length n_gpu_layers=35, # Adjust based on your GPU verbose=False ) # Generate text response = llm("Your prompt here", max_tokens=100) print(response['choices'][0]['text']) ``` ### With Transformers + AutoGGUF ```python from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM model_name = "tifin-india/sarvam-m-24b-q3_k-gguf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoGPTQForCausalLM.from_quantized(model_name) ``` ## Performance Characteristics | Aspect | Rating | |--------|--------| | **Speed** | ⭐⭐⭐⭐ | | **Quality** | ⭐⭐ | | **Memory** | ⭐⭐⭐⭐ | ## Original Model This is a quantized version of the original model. For the full-precision version and more details, please refer to the original model repository. ## Quantization Details This model was quantized using llama.cpp's quantization tools. The Q3_K format provides a good balance of model size, inference speed, and output quality for most use cases. ## License This model follows the same license as the original model (Apache 2.0). ## Citation If you use this model, please cite the original model authors and acknowledge the quantization.