Actual tests show it works well. The Q4K quantized model maintains a decoding speed of around 27 tokens after multiple turns of casual conversation

by goodgame - opened 17 days ago

17 days ago

My configuration is as follows
EPYC 9654QS
288 GB RAM
2x 5060 Ti 16 GB
Decoding speed is around 27 tokens after 6 rounds of chat conversation

mtcl

17 days ago

how do you run this?

gopi87

17 days ago

i tested with same q4k

2X2680 v4 e5 xeon
256gb ram
rtx 3680 12gb

10 t/sec

gopi87

17 days ago

how do you run this?

Clone the repository with custom directory name

git clone https://github.com/cturan/llama.cpp.git llama2.cpp

Navigate into the directory

cd llama2.cpp

Checkout the minimax branch

git checkout minimax

#build it

gopi87

17 days ago

btw guys i kind feel the response is more similar to the gpt 120B oss

edwarddddr

17 days ago

btw guys i kind feel the response is more similar to the gpt 120B oss

did you test with specific task or just natural conversation ?

gopi87

17 days ago

btw guys i kind feel the response is more similar to the gpt 120B oss

did you test with specific task or just natural conversation ?

just a conversation

BahamutRU

17 days ago

•

edited 16 days ago

q3k

i5-13400F
128 GB DDR5

6 tok/sec versus 6 tok/sec on Qwen3-235B-A22B. =)
I expected it twice as fast.

Sry, my fault.
9.89 tok/sec vs 5.11 tok/sec. +93% versus expected 120%. Not bad at all! Thx!

ogg130

15 days ago

•

edited 15 days ago

I cant get this to output anything but miles of thinking/metadata/Chinese characters.

GPT swears up and down the template that is being used has problems, no template supplied in repo, template from og repo doesnt help. any ideas?

"The model is responding with template metadata instead of code, which points to either a prompt/template misconfiguration or the model itself—not the build"

cturan

Owner 15 days ago

I cant get this to output anything but miles of thinking/metadata/Chinese characters.

GPT swears up and down the template that is being used has problems, no template supplied in repo, template from og repo doesnt help. any ideas?

"The model is responding with template metadata instead of code, which points to either a prompt/template misconfiguration or the model itself—not the build"

For this code you need use same gguf's in this repository, any chance you download different gguf's and try with this code or vice versa, I tested it with both general chat and roo code, including tool calls, there should be no problem, it even returns the think blocks to the context, close to what the model's manufacturer recommends.

ogg130

14 days ago

maybe i'm just expecting it to think less - thanks for the work!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment