Actual tests show it works well. The Q4K quantized model maintains a decoding speed of around 27 tokens after multiple turns of casual conversation

#1
by goodgame - opened

My configuration is as follows
EPYC 9654QS
288 GB RAM
2x 5060 Ti 16 GB
Decoding speed is around 27 tokens after 6 rounds of chat conversation

how do you run this?

i tested with same q4k

2X2680 v4 e5 xeon
256gb ram
rtx 3680 12gb

10 t/sec

how do you run this?

Clone the repository with custom directory name

git clone https://github.com/cturan/llama.cpp.git llama2.cpp

Navigate into the directory

cd llama2.cpp

Checkout the minimax branch

git checkout minimax

#build it

btw guys i kind feel the response is more similar to the gpt 120B oss

btw guys i kind feel the response is more similar to the gpt 120B oss

did you test with specific task or just natural conversation ?

btw guys i kind feel the response is more similar to the gpt 120B oss

did you test with specific task or just natural conversation ?

just a conversation

q3k

i5-13400F
128 GB DDR5

6 tok/sec versus 6 tok/sec on Qwen3-235B-A22B. =)
I expected it twice as fast.

Sry, my fault.
9.89 tok/sec vs 5.11 tok/sec. +93% versus expected 120%. Not bad at all! Thx!

I cant get this to output anything but miles of thinking/metadata/Chinese characters.

GPT swears up and down the template that is being used has problems, no template supplied in repo, template from og repo doesnt help. any ideas?

"The model is responding with template metadata instead of code, which points to either a prompt/template misconfiguration or the model itself—not the build"

I cant get this to output anything but miles of thinking/metadata/Chinese characters.

GPT swears up and down the template that is being used has problems, no template supplied in repo, template from og repo doesnt help. any ideas?

"The model is responding with template metadata instead of code, which points to either a prompt/template misconfiguration or the model itself—not the build"

For this code you need use same gguf's in this repository, any chance you download different gguf's and try with this code or vice versa, I tested it with both general chat and roo code, including tool calls, there should be no problem, it even returns the think blocks to the context, close to what the model's manufacturer recommends.

maybe i'm just expecting it to think less - thanks for the work!

Sign up or log in to comment