⚠️ Note

This model had issues with its original Chat Template, which prevented it from functioning properly.
In this GGUF Quantization, all tool-use–related sections have been removed, but the model can run in basic chat mode.

General text generation and conversation: working properly.
Function calling and external tool integration: currently disabled.

If you know how to improve the Chat Template, please open a new Discussion to share your insights.

Lumia101/EXAONE-4.0.1-32B-GGUF-Q4_K_M

This model is converted to GGUF format from LGAI-EXAONE/EXAONE-4.0.1-32B using llama.cpp release b6795.

Original model card: LGAI-EXAONE/EXAONE-4.0.1-32B

(I wanted to make other versions besides Q4_K_M, but I didn't have time because I'm a high school student...)

How to use this model

Please make sure that the environment you are trying to run this model has at least 24GB of VRAM.

Install llama-cpp-python with this command.

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

Use this code to use this model.

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="Lumia101/EXAONE-4.0.1-32B-GGUF-Q4_K_M",
    filename="EXAONE-4.0.1-Q4_K_M-ctemplate-removed.gguf",
    n_gpu_layers=-1,
    n_ctx=8192,
    verbose=False
)

prompt = "Tell me the reason why I need GPU to run a language model." # If you would like to ask this model another question, please edit it here.

output_stream = llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": prompt
        }
    ],
    temperature=0.6,
    top_p=0.95,
    presence_penalty=1.5,
    stream=True
)

for chunk in output_stream:
    content = chunk.get('choices', [{}])[0].get('delta', {}).get('content', '')
    
    if content:
        print(content, end='', flush=True)
        
print()