GGUF Quantization Collection
Collection
A collection of gguf quantization models created by Korean high school student.
•
10 items
•
Updated
This model had issues with its original Chat Template, which prevented it from functioning properly.
In this GGUF Quantization, all tool-use–related sections have been removed, but the model can run in basic chat mode.
If you know how to improve the Chat Template, please open a new Discussion to share your insights.
This model is converted to GGUF format from LGAI-EXAONE/EXAONE-4.0.1-32B using llama.cpp release b6795.
Original model card: LGAI-EXAONE/EXAONE-4.0.1-32B
(I wanted to make other versions besides Q4_K_M, but I didn't have time because I'm a high school student...)
Please make sure that the environment you are trying to run this model has at least 24GB of VRAM.
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="Lumia101/EXAONE-4.0.1-32B-GGUF-Q4_K_M",
filename="EXAONE-4.0.1-Q4_K_M-ctemplate-removed.gguf",
n_gpu_layers=-1,
n_ctx=8192,
verbose=False
)
prompt = "Tell me the reason why I need GPU to run a language model." # If you would like to ask this model another question, please edit it here.
output_stream = llm.create_chat_completion(
messages = [
{
"role": "user",
"content": prompt
}
],
temperature=0.6,
top_p=0.95,
presence_penalty=1.5,
stream=True
)
for chunk in output_stream:
content = chunk.get('choices', [{}])[0].get('delta', {}).get('content', '')
if content:
print(content, end='', flush=True)
print()
4-bit
Base model
LGAI-EXAONE/EXAONE-4.0-32B