Repetition on long contexts
Thanks for making this, when its working correctly, produces some really great logic and perspectives.
Unfortunately the model is extremely prone to repetition especially when given significant amounts of code to review (which is somewhat of a primary use case). I've tried q8_0, q6k, q5_1, q4_1 and q4k ISQs and various penalty and topk/topp settings unfortunately to no avail. Whether running on two GPUs or 4, it always ends up with the same behavior.
Sometimes the loops are your usual 1-line reptitions and they're not always loops - it sort of breaks out of those on occasion but the longer chunks of code it keeps reproducing and the "fall back into thinking" without a <thinking> tag (but throwing out multiple </thinking> ones) behaviors are pathological. End up having to restart the inferencing engine. An interesting note is that KV cache pollution and RAG contexts seem to make this behavior worse.
Running with candle-vllm on V100s (for now). Getting 20-30 T/S out of it when its thinking clearly so its quite usable aside from the ... pathology. Hopefully that can be identified and refined in the safetensors format without specialized gguf gaming.