CosmosGemma T1 GGUFs

Objective

Due to the need for quantized models in real-time applications, we introduce our GGUF formatted models. These models are part of GGML project with a hope to democratize the use of Large Models. Depending on the quantization type, there are 20+ models.

Features

All quantization details are listed on the right by Hugging Face.
All the models have been tested in llama.cpp environments, llama-cli and llama-server.
Furthermore, a YouTube video has been made to introduce the basics of using lmstudio to utilize these models. 👇

Code Example

Usage example with llama-cpp-python

from llama_cpp import Llama

# Define the inference parameters
inference_params = {
    "n_threads": 4,
    "n_predict": -1,
    "top_k": 20,
    "min_p": 0.0,
    "top_p": 0.95,
    "temp": 0.6,
    "repeat_penalty": 1.05,
    "input_prefix": "<start_of_turn>user\\n",
    "input_suffix": "<end_of_turn>\\n<start_of_turn>model\\n",
    "antiprompt": [],
    "pre_prompt": "",
    "pre_prompt_suffix": "",
    "pre_prompt_prefix": "<bos>",
    "seed": -1,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "n_keep": 0,
    "logit_bias": {},
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.1,
    "memory_f16": True,
    "multiline_input": False,
    "penalize_nl": True
}

# Initialize the Gemma model with the specified inference parameters
gemma = Llama.from_pretrained(
    repo_id="ytu-ce-cosmos/Turkish-Gemma-9b-T1-GGUF",
    filename="*Q4_K.gguf",
    verbose=False
)
# Example input
user_input = "Türkiyenin başkenti neresidir?"

# Construct the prompt
prompt = f"{inference_params['pre_prompt_prefix']}{inference_params['pre_prompt']}{inference_params['pre_prompt_suffix']}{inference_params['input_prefix']}{user_input}{inference_params['input_suffix']}"

# Generate the response
response = gemma(prompt)

# Output the response
print(response['choices'][0]['text'])

The quantization has been made using llama.cpp. As we have seen, this method tends to give the most stable results.

Obviously, we encountered better inference quality for models with the highest bits. However, the inference time tends to be similar between low-bit models.

Each model's memory footprint can be anticipated by the qunatization docs in either Hugging Face or llama.cpp.

Acknowledgments

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
Thanks to the generous support from the Hugging Face team, it is possible to download models from their S3 storage 🤗

Contact

COSMOS AI Research Group, Yildiz Technical University Computer Engineering Department
https://cosmos.yildiz.edu.tr/
cosmos@yildiz.edu.tr

Downloads last month: 895

GGUF

Model size

9B params

Architecture

gemma2

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ytu-ce-cosmos/Turkish-Gemma-9b-T1-GGUF

Base model

ytu-ce-cosmos/Turkish-Gemma-9b-T1

Quantized

(2)

this model