Not bad but..

#1
by SerialKicked - opened

It's an interesting model and probably (one of?) the first QwQ model that can do multi turn creative writing / roleplay stuff. It manages to keep track of who is talking and previous events a lot better than other QwQ or R1 fine-tunes. In that aspect, it's impressive. The others I tested, being mostly trained on single round Q&A (because the whole thinking part not being kept in multi rounds makes fine-tuning very difficult), struggle a LOT more with continuity in the story than yours, not sure what you did there.

It has the atrocious formatting of QwQ/R1, of course (bold within quotes with italics too in the same sentence for good measure), but I wrote code to normalize it to a point it's more or less readable. Not your fault, of course, it's a quirk of the base model.

However, using your recommendations (even that weird system prompt, which, let's be honest, is more wishful thinking than actually doing anything), and even more conservative inference settings, it's kinda all over the place. Some sentence will not make any sense (or relevance), and it gets weirdly fixated on very anecdotal details of the plot-line. Like a mug fell down in the tavern, and 30 messages later in a completely different context, it still insists on giving news about that broken mug. :D

However, when you re-roll enough times, yes, you can get incredibly good results, but it's very inconsistent in that regard.

Also, on a technical level what are those added 3B params? In my experience, most attempts to add layers to a model only made it worse, it needs a full retrain to work, so I'm kinda curious about what you added there.

Thank you for your feedback.

RE: Formatting , noticed this as well - including new GEmma 3s (all of them - even 1B!). This seems to be showing up in newer models - reasoning or not.
Writing wise... it is overkill... but also "looks cool" maybe ??

RE: Fixation
You might want to try the regular "cubed" version ; the albr/decensoring might be having a negative effect(s) here.
NOTE: The reg version "Cubed" uses "standard" QwQ plus 2 donors (also standard) ; whereas the ablr/decensor ALL the models are ablr/decensored.

NOTE: You may also want to try the Imatrix version of this model and/or the "normal" version in Imatrix... ; has a bit more kick.

RE: 3b:
Conclusion layers from other reasoning models used in series. Careful balancing is used here to preserve core function, with the core
model "overriding" - stronger control - then the donors.

This is done in a very limited fashion here - to minimize issues, and likewise not inflate parameters/size of the model.
IE: This is a "5x" method using multiple cores, whereas Darkest Planet 16.5B is a 40x version built upon Dark Planet 8B.

Blowing QwQ from 32ish B to close to 50B was not the plan... although 40x is a powerhouse for creative uses and rp.

You might also want to try:
https://huggingface.co/DavidAU/Qwen2.5-The-Wisemen-QwQ-Deep-Tiny-Sherlock-32B-GGUF

As this is a DARE TIES - more stable - and contains QwQ plus 3 more reasoning models.
There is not an "uncensored" version of this one yet - just waiting on "Sherlock" model to be decensored/ablr.

Thanks for the response.

Formatting
Oh I totally agree that there's a novelty aspect to a model using italic or bold to emphasis a single word. And, to the base model's credit, it pick the word correctly. And i guess that enthusiasm for that new writing style made people (finetuners and users alike) ignore more blatant issues with CoT models when it comes to RP/storytelling.
It's very front-end depending too. On the one I built for myself, it gets very painful to read, so I filter it out. Technically in the long run, i could probably normalize datasets to make "distills" behave normally, but, it'll take me some time, and it's not exactly making me money.

Anyway, I agree that it's a matter of personal preference.

Fixation
That might be an abliterated (sp?) thing. I mean I can't rule it out, but i really think it's more a side effect of the 1 shot training thing that make CoT models bad at multi turn creative writing / RP. And to be fair, that mug, it's birth, it's life, and it's near death experience, was funny on its own. Just you know, unintentionally :D

But I agree regarding that operation (it's more addressed to people reading us that to you): Some people think it's just removing artificial safeguards by pinpointing "that magic node", and reversing it. It's literal bullshit, it makes a model hallucinate like crazy, and its ability to say "no" (independently of context) becomes gravely impaired, when it was already a loosing battle to begin with.

RE: 3b:

I understand language models, and neural networks. I wrote NN in the 90's. Your response was an advertisement, not an explanation of what was added and how. If you don't want to explain, that's perfectly fine, but don't take me on a ride.

Owner

RE: 3b;
That is exactly what was done. It is not an advertisement.
More layers, more processing, more "cooks in the kitchen". Many of my repo cards (for models with Brainstorm embedded) discuss the method used, and point to papers that discuss it.

Roughly this method is in part how MOE ("random setup") models process tokens, just at a smaller scale - but also more direct, and more focused (end layer processing) (and likewise more difficult to control).
It took a LOT of experiments (100s) to make it work, and then tune it so it adds value to the model too ; and each arch type also has its quirks too.
And likewise this method was rooted in/sprang from a lot of model merges/experiments/testing.

Sign up or log in to comment