Fine-tuned / uncensored versions of Qwen 0.6B
I decided to test few alternative uncensored/abliterated versions of Qwen 0.6B and it actually makes a lot of difference.
So far I tried :
DavidAU/Qwen3-0.6B-heretic-abliterated-uncensored
Goekdeniz-Guelmez/Josiefied-Qwen3-0.6B-abliterated-v1
Goekdeniz-Guelmez/Qwen3-0.6B-gabliterated
dnotitia/Smoothie-Qwen3-0.6B
All four of them showed a much better adherence to prompt, and overall better image quality than base Qwen.
Some prompts which resulted in messy images on base Qwen (weird anatomy, fused bodies, e.t.c) work just fine on tuned versions.
I don't really have a testing method, but based on "feeling" I would rank them :
- Qwen3-0.6B-heretic-abliterated-uncensored
- Smoothie-Qwen3-0.6B
- Josiefied-Qwen3-0.6B-abliterated-v1
- Qwen3-0.6B-gabliterated
Maybe someone can do a proper test if they have the time and compute to do so.
I've tried the 4 billion version for z image turbo and I got great results as well.
Yes, I've been doing this with Zimage, Klein and other models.... using a purely 'ablitted' model (removing the guardrail rejections) helps, but the additional training adds much better results (IMHO), like Josie).
For Zimage, ZEngineer is a great model for example.
https://huggingface.co/BennyDaBall/Qwen3-4b-Z-Image-Engineer-V4/
that's placebo
Any improvement you see is placebo or just the result of differences introduced by breaking the TE. You're simply shifting the prompt representation away from what the model was trained to understand. When a model's representation of a concept is broken a broken TE might push it away from that broken representation, but in general results should not be better (more faithful to the prompt) than when using the model it was trained with.
@Kellenok
and
@HDiffusion
you are both entirely and complete WRONG.
A few pictures says more than all of your denials:
That's Engineer V2.5, v2, v1 (which had noise issues) and stock Qwen3 4b
(V2 was similar but not identical to stock, it varied widely depending on many things... V2.5 was/is definitely better at composition and positions...
I've generated literal dozens of compares
That was stock Qwen4, 2 versions of Josie, and Engineer v1 (yes, I did these a while ago) more can be found here
That's not 'placebo', not 'broken TE', and in SOME cases, it's 'better' than the stock Qwen 4b it was trained with.
Over at Banodoco, I did an informal 'pick the best' test with 50 (yes, 50) pairs of images using Stock and Josie.
I even gave the prompt (after requests) so people could eval using what was asked for...
While the results weren't scientific, by a rough estimate, the Josie image was preferred 25% of the time, was tied with the Stock Qwen TE 25% of the time.
So in 50% of the cases, it was at least as good, if not better. In 50%, people preferred the Stock... so the true answer: not better, not worse. At least as faithful to the prompt.
And that was with older models. I'd use different models today, like V4 engineer, and merge of that and a Josie model...
BTW, this is NOT using the model as a prompt generator (which Engineer can also do), all of the above are using it AS a TE. And the results show it's not a 'placebo'
Training the model DOES affect the tokenization of the prompt, and then the image model is using that 'slightly different' token stream and the results are different.
Added: I previous wrote up a long explanation of 'why' I think the TE changes are reflected due to the small changes in training, using the original Chinese versus the many English translations of the Tao Te Ching, to sub in for Prompt into Tokenization by analogy.
your examples just show that this is a placebo and/or worse than stock
i am failing to see an "improvement" from these comparisons.
So in 50% of the cases, it was at least as good, if not better. In 50%, people preferred the Stock... so the true answer: not better, not worse. At least as faithful to the prompt.
not better, not worse. doesn't that just mean its the same?
your examples just show that this is a placebo and/or worse than stock
Your opinion, and my testing showed otherwise. If you are so 'certain', conduct your own tests and be rigorous and do blind studies.
@lowchannel1503 I gave just a few. Same challenge: do it yourself. You'll see it's 'different', not 'improved' (sometimes it is, sometimes not, ZEngineer's later versions almost ALWAYS give me far better compositions and body positioning.
someone else did this compare recently:
Used z-image-turbo q8 gguf WITH ; 1-) default clip (qwen3_4b) ::: 2-) heretic (β z-imageβ ) ::: 3-) zImageTrainedText_bf16.safetensors (don't know where I got this from lol) ::: 4-) z-engineer v2 ::: 5-) z-engineer v2.5 ::: 6-) z-engineer v4 . (9 steps, er_sde, sgm_uniform, 1536x1024)
prompt :
Photograph of a street scene in front of a traditional Korean text-rendering shop. Eye-level shot with a medium depth of field. Natural lighting, overcast sky. Likely taken with a DSLR camera, settings: aperture f/2.8, shutter speed 1/250s, ISO 400.
Foreground: Young Asian couple, mid-20s, standing to the left. Woman with long black hair, wearing a beige coat and black pants, smiling slightly. Man with short black hair, wearing a dark gray coat and black pants, looking at the camera. Both have light skin tones.
Background: Traditional Korean building with a red wooden facade and gray tiled roof. Shop sign reads "Text rendering" in English and Korean. Another sign reads "Reaistic texture." Display window shows white and black text paper, small snowman decoration with a blackboard saying "Our text is actually interesting!" Floral arrangement with pink and red flowers on the right. Police officer in black uniform with white text on the back, riding a motorcycle on the right.
Left background: Partial view of another traditional building, gray with a tiled roof.
Lighting: Natural, diffused by overcast sky. No harsh shadows.
Camera angle: Eye-level, slightly low angle.
again, various levels of 'prompt adherance' (v4 Engineer is one of the best).
difference is less than different seeds, but in addition to this, everyone failed the prompt, but stock is still closer
difference is less than different seeds, but in addition to this, everyone failed the prompt, but stock is still closer
actually, if you tally it all up, V4 Engineer wins the prompt adherance. And it's all the same seed, size, etc, the ONLY change is the TE used.
anyway, you can take your nay-saying and go away. I've shown your 'placebo' claim is bogus.
actually, if you tally it all up, V4 Engineer wins the prompt adherance.
Can you point out where I should be looking exactly for the better prompt adherence compared to base?
At first I thought the base didn't have "Our text is actually interesting!" but upon closer inspection it's just on a different sign.z-engineer's shop sign doesn't have the korean text, though ig you can say it's the sign next to "Text rendering"
Both failed to create a sign showing "Reaistic texture."
Both failed to really spell correctly (e.g. 'rendoring' for base; 'intersting' for z-engineer)
Additionally, all 3 z-engineers made shops that had no doors and seem unenterable
difference is less than different seeds, but in addition to this, everyone failed the prompt, but stock is still closer
scruffynerf showed you that there is a visible difference when using different TE models.
I don't know why you find it so hard to accept, and why you have to double down. Please stop it. This isn't Reddit.
On an unrelated note, it seems like base Qwen is better at using model built-in artist styles. It's a lot more pronounced than when I'm using abliterated versions. But sometimes it is a bit too pronounced, resulting in poor quality.
I have to agree. I think that using the base text encoder which the model was probably trained on will give you better results. The only one that I would say give a try is a new type which is called Gabliteration by GΓΆkdeniz GΓΌlmez "With this model series, I introduce the first Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods through adaptive multi-directional projections with regularized layer selection. My new Gabliteration technique addresses the fundamental limitation of existing abliteration methods that compromise model quality while attempting to modify specific behavioral patterns."
As when you take the safety parameters off, it can actually take off some of the knowledge as well. So something like this would actually be pretty good. I'm yet to test it properly but I have used it in z Imege Turbo and it has given me a little bit better results. However sometimes it misses stuff which the base text encoder retains.
The only one that I would say give a try is a new type which is called Gabliteration by GΓΆkdeniz GΓΌlmez
Josie is by the same guy, and some Josie models use Gabliteration. it's not 'special', it's just his own term for this sort of retraining, ev en if he uses a different method.
actually, if you tally it all up, V4 Engineer wins the prompt adherance.
Can you point out where I should be looking exactly for the better prompt adherence compared to base?
The door doesn't have to be visible, the prompt said "in front of a store", not "in front of the door". Storefront windows are a thing.
Please don't nitpick. None of them are perfect, but if you break down into each element asked for and score it for each, then V4 wins. It's a VERY long prompt, including lots of small details, like "white text on the back of the policeman", multiple signs with text, etc.
and that doesn't even account for the improved positioning. V4 (actually the entire Engineer series) does 'eye level but lower' correctly, which is a subtle perspective. Look at the man's relative size/height to the roof opposite, and where than implies the camera is.... plus "looking at the camera" (v4, but not v2.5), AND the changed poses (which can be argued about, but it's different, and somewhat more 'photographic')
Again, 'different' is not equal to 'better', sometimes it is better though.
Yes, there is 100% a difference, but that difference is just random noise at best, a degradation at worst. Using a mismatched Text Encoder will degrade generation quality SLIGHTLY, for various reasons.
A model is trained with the embeddings of a specific text encoder, if you change the TE to a different finetune, or even just an abliterated one, the semantic meaning might be the same but the embedding geometry will be different from what the U-Net/DiT was trained on. The vector space might be vastly different since all vectors will have to move to accommodate new training data.
There is no point in using an abliterated model as a Text Encoder, it has zero benefit, only minor drawbacks. TEs cannot output refusals, no matter what heinous shit you're trying to generate. They're only providing embeddings, and the Heretic abliteration process only removes safety layers afaik, it does not strengthen NSFW or violent embeddings, that's why there is no real difference when using it.
That's why in scruffynerfs examples Qwen3, Heretic and Engineer v2 did "the best" with the other 3 being markedly worse. Qwen3 because it was used to train the model, Heretic changes very little for TE purposes, Engineer v2 was probably undertrained so the distortion was weaker. Although I'm sure with enough seed variation you'll get all 6 to eventually generate as poorly or as well as the next. For example I'm sure any of the text encoders will manage to gen a missing door eventually.
If you want your TE choice to actually matter, you have to retrain the embeddings of the diffuser, or maybe finetune the model using a different TE, I don't know, I doubt anybody does, I don't think there's been a whole lot of experimentation in that space.
In short: yes there are real differences, but any perceived improvements are indeed, sadly, placebo. I wish it were not so, I've tried many different models as TEs with Z-Image, I actually stuck with Fiction-on-fire Qwen3 because in my mind a fantasy RP finetune should do better at fantasy images, right? So I don't even practice what I preach. And placebo or not, if you like the results better, just use whatever model you want.
Please don't nitpick. None of them are perfect, but if you break down into each element asked for and score it for each, then V4 wins. It's a VERY long prompt, including lots of small details, like "white text on the back of the policeman", multiple signs with text, etc.
I'm not trying to nitpick, it's simply that you claim "X achieves much better results than base," so I imagine I could hold it to a higher standard; but sure, the door is somewhere off camera ig
in both images, the policeman has white text on his back (Base's a little less visible due to angle, but it's there), and there are multiple signs with text in both as well. is z-engineer supposed to be scoring higher than base for the policeman and the multiple signs? because if I were to rate it then they'd be equal, they both followed the prompt
and that doesn't even account for the improved positioning. V4 (actually the entire Engineer series) does 'eye level but lower' correctly, which is a subtle perspective. Look at the man's relative size/height to the roof opposite, and where than implies the camera is.... plus "looking at the camera" (v4, but not v2.5), AND the changed poses (which can be argued about, but it's different, and somewhat more 'photographic')
I'm not skilled in photography, but to my brain the man and woman simply stood closer to the camera, as everything else's sizes and positions stayed relatively similar
in both images, the man and the woman are looking at the camera
whether the changed poses are good or bad is subjective, as the prompt doesn't specify what pose to put them in
Again, 'different' is not equal to 'better', sometimes it is better though.
no one's arguing about the images being different - it's clear that there is an effect when you switch the TE
point is, and even you seem to agree here, that at best it just looks 'different' not 'better'
"sometimes better" and sometimes worse - in other words, it's different
there's no clear 'improvements' to be gained from swapping TEs
If you want your TE choice to actually matter, you have to retrain the embeddings of the diffuser, or maybe finetune the model using a different TE, I don't know, I doubt anybody does, I don't think there's been a whole lot of experimentation in that space.
I know of 2 instances where the TE was tuned for a model, though granted they're a bit old - Pony, and NoobAI
for Pony, the poor tuning nuked a lot of knowledge out of the TE, making it unable to draw lawnmowers for example
for NoobAI, CLIP-L basically died and can no longer perform tasks that a healthy CLIP model should be able to do (retrieval); additionally, the embeddings for color in the non-dead CLIP-G became clustered and not well separated
so, possible? yeah.
easy? no.
should you tune a TE for a diffusion model? probably not, especially for modern TEs that are LLMs trained on humongous datasets. spending that money on further tuning the model itself is probably more beneficial
Yes, there is 100% a difference, but that difference is just random noise at best, a degradation at worst. Using a mismatched Text Encoder will degrade generation quality SLIGHTLY, for various reasons.
Nope, you are just wrong. I've tested this extensively, and it's not just random noise. But again, prove me wrong: add random noise and compare. Random hack on a TE, 'break it' and show me it's similarly making coherent changes.
There is no point in using an abliterated model as a Text Encoder, it has zero benefit, only minor drawbacks. TEs cannot output refusals
Incorrect. See below.
Heretic abliteration process only removes safety layers afaik, it does not strengthen NSFW or violent embeddings, that's why there is no real difference when using it.
I 100% agree with you here. I found almost NO changes to using purely ablitted models. At best, there would be a slight change of hand or something. In NSFW, it would slightly improve where a hand went or what a body part would do... but not much... but it did show that it was doing something.
In short: yes there are real differences, but any perceived improvements are indeed, sadly, placebo. I wish it were not so, I've tried many different models as TEs with Z-Image, I actually stuck with Fiction-on-fire Qwen3 because in my mind a fantasy RP finetune should do better at fantasy images, right? So I don't even practice what I preach. And placebo or not, if you like the results better, just use whatever model you want.
'Placebo' implies 'it's all in the mind', which is just not accurate. It's making changes. Again, I linked above to my analogy with translating Chinese into English, and how a dozen different translation can all be different, despite being based on the same source. Tokenization by one model will always yield the same token stream. But a retrained model WILL generate a different token stream, EVEN when not eval-ing that token stream itself.
This seems to be the big thing people don't grasp: When you query an LLM, it encodes the query, and then it CONTINUES, and 'adds the next bits'... But usually we don't worry about the encoded query, just the reply part. Older TE models like ClipL were not LLMs, they were more dictionaries. T5 was a bit more LLM, but still not quite there. Using newer LLMs AS TEs? That's a different ballgame and it's what we're discussing.
it's simply that you claim "X achieves much better results than base,"
despite that I also said the exact opposite repeatedly: I said "'different' is always not equal to 'better', but sometimes it is"
there's no clear 'improvements' to be gained from swapping TEs
This is where we disagree: In SOME places, there is benefit. I listed one: I find Engineer TEs to do BETTER photographic composition and body posing. It just does. It was literally trained on that idea, for that idea, and it SOMEHOW is reflected in the way it tokenizes. Perhaps it's adding tokens reflecting this, perhaps it's structuring the token order differently, but whatever it is doing, it works, and it makes 'better' composed images, IMHO. And again, that's over hundreds of tests, multiple models.
The naysayers all think "it's just random, it's not actually doing anything coherent', without ACTUAL EVIDENCE of this, just assumptions. If you want to prove me wrong, do go what I did: generate a few hundred examples with Stock and a given different model, in a variety of prompts (same settings/seed/size/samplers/etc) and then get many people to give you opinions, and look at the results. Use Engineer v4 and ask for 'prefered photographic composition' and see if it's all 'placebo'. (hint: it's not)
should you tune a TE for a diffusion model? probably not, especially for modern TEs that are LLMs trained on humongous datasets. spending that money on further tuning the model itself is probably more beneficial
You clearly have NOT looked at Engineer, have you? He trained it multiple times on a Strix Halo 128gb machine and a Mac Mini working together (not a cluster of big GPUs... so cheap and easy, relatively) And he iterated and improved the process repeatedly from v1 (flawed) to v2 to v2.5 to v4... He also was training it as a Prompt Generator, not to enhance the TE quality. I came along and tried it as a TE, and demonstrated it worked in V1 (though it had strange noise issues... GEE, like it was making random incoherent noise, oh hey, that is what people claim these all are... why don't the other models do this? because it's not random noise)... He then retrained to v2, and in the process, the tokenizing aspect improved, DESPITE THAT HE WAS STILL TRAINING IT FOR PROMPT GENERATING IMPROVEMENTS. In other words, any tokenizing changes were a side effect of the retraining of the main LLM learning. And his later improvements FOCUSED on improved prompt generating for Zimage to produce better results (keep that in mind), have caused V2.5 and V4 used as TEs to make better composed images, not merely 'different', but with better posing and photographic composition as a coherent repeated result. I don't know his dataset, but based on his comments, it's clearly a dataset focused on teaching the model how to craft more detailed better composed image prompts. The tokenizer changes have learned this as well... somehow. It's not random.
'Placebo' implies 'it's all in the mind', which is just not accurate. It's making changes. Again, I linked above to my analogy with translating Chinese into English, and how a dozen different translation can all be different, despite being based on the same source. Tokenization by one model will always yield the same token stream. But a retrained model WILL generate a different token stream, EVEN when not eval-ing that token stream itself.
That's why I specifically said the improvements are placebo, and the changes are real.
Unless we're talking about different things here, this is exactly not the case: Unless Engineer has changed - Tokenizer OR Vocab OR Pre-Tokenization rules. Neither of which Engineer did according to its model card, it's a simple weight fine-tune in V4 and a simple LoRA merge in V2.5 and lower. It was never designed with being used as a Text Encoder for Z-Image, just a prompt enhancer, the fact that it also happens to work as a drop-in replacement is just due to how little actual change happened to the parts that matter. There is no "continuation" if you use the LLM as a TE.
Are we talking about the same thing here? Otherwise none of this is relevant to using Qwen3 or its finetunes as a TE.
And no, CLIP and T5 and Qwen are all transformers, doesn't matter that the former two are not LARGE Language Models or autoregressive, their function here remains largely the same, just their size and training is different.
In order to get meaningful, directed, intentional improvements you need to train a text encoder alongside its respective diffuser, you will get very subtle adherence improvements over a long enough time. To get any real benefit you need to train your TE WITH your Diffuser, (so you get a new TE and Diffusion model at the end). Both of which will have matching embedding geometry and be able to benefit from the new training data. But you'll need hundreds of GB of VRAM for that. (300+)
But if you say that you see an actual improvement, why not showcase it. Post two sets of 4 images, same everything except the TE, sequential seeds, with workflow, and point out exactly where the improvements are. You can cherrypick the starting seed even. Otherwise why should people take your word for it? So far all the examples were just "different".
Unless we're talking about different things here, this is exactly not the case: Unless Engineer has changed - Tokenizer OR Vocab OR Pre-Tokenization rules. Neither of which Engineer did according to its model card, it's a simple weight fine-tune in V4 and a simple LoRA merge in V2.5 and lower. It was never designed with being used as a Text Encoder for Z-Image, just a prompt enhancer, the fact that it also happens to work as a drop-in replacement is just due to how little actual change happened to the parts that matter. There is no "continuation" if you use the LLM as a TE.
Bzzt, thanks for playing. If you WERE correct, then replacing one model with another model with the same 'pedigree' (ie V2, v2.5, v4) should yield identical results, after all, there are no changes to the Tokenizer/Vocab/etc, right? The only change is the weights. and Gee, it performs differently AS a TE. so..... And Again, using the LLM as a TE, no it's not continuing, but it does render the existing prompt into tokens, and that result Differs from each model. [And btw, the Heretic/Ablitterated models that don't train anything 'new', just remove the guiderails (yes, it's still training, but usually it's a lobotomy moreso] despite NO changes at all to Tokenizer/Vocab, they also make slightly different images. So... nothing matches your misguided understanding of the way it works used as a TE. LLMs as TE don't work the way you think they do.]
But if you say that you see an actual improvement, why not showcase it. Post two sets of 4 images, same everything except the TE, sequential seeds, with workflow, and point out exactly where the improvements are. You can cherrypick the starting seed even. Otherwise why should people take your word for it? So far all the examples were just "different".
I literally posted images, showing there are differences. You moving the goal line to 'where the improvements are' is the problem here. If you cannot see that the V4 composition is different from the stock Qwen 3 composition on the Korean images, you need your eyes checked. If you want to argue about which is 'better', that's purely opinion, but there is no question when you look at dozens of examples (as I have) that it's 'better' at photographic composition and body positioning. (In other words, it'll take the same subject and make a 'nicer' image out of it.)
Moving on... the naysayers are literally unable to do anything other than bleat they know better.
See, now I know you're just trolling, I was constantly giving you benefit of the doubt, didn't want to point out that you don't know what you're talking about, but now you've self-reported.
I'm going to use your own words against you here real quick:
Your claim that V4 Engineer is better at prompt adherence, which from the images you posted it clearly wasn't:
difference is less than different seeds, but in addition to this, everyone failed the prompt, but stock is still closer
actually, if you tally it all up, V4 Engineer wins the prompt adherance. And it's all the same seed, size, etc, the ONLY change is the TE used.
anyway, you can take your nay-saying and go away. I've shown your 'placebo' claim is bogus.
Your claim that abliterating helps - abliteration does absolutely nothing to improve Text Encoder performance. Fundamental misunderstanding what a Text Encoder does:
Yes, I've been doing this with Zimage, Klein and other models.... using a purely 'ablitted' model (removing the guardrail rejections) helps, but the additional training adds much better results (IMHO), like Josie).
For Zimage, ZEngineer is a great model for example.
https://huggingface.co/BennyDaBall/Qwen3-4b-Z-Image-Engineer-V4/
Finally you said yourself that it's neither better nor worse, so it's just different, so why argue with people who say that it's "just different":
So in 50% of the cases, it was at least as good, if not better. In 50%, people preferred the Stock... so the true answer: not better, not worse. At least as faithful to the prompt.
Holy hell man we're training models here, not breeding dogs. Engineer V2, 2.5 and 4 are similar in name only. 2.5 was a MERGE, it wasn't even a finetune, and v4 had a different dataset, in your world that's like comparing a pregnant chihuahua to a fully grown male mastiff:
Bzzt, thanks for playing. If you WERE correct, then replacing one model with another model with the same 'pedigree' (ie V2, v2.5, v4) should yield identical results, after all, there are no changes to the Tokenizer/Vocab/etc, right?
Do you even know what a tokenizer is? It turns the words from your prompt into tokens. All Qwen3 based models use the same tokenizer. Just like all Llama3 models use their own tokenizer. It's meaningless to modify it for a finetune. I don't even know why someone would do that as it'll break compatability in llamacpp in so many ways. As a rule of thumb you can just expect all Qwen3_4b based models to produce the same Token IDs for any given input. It was the first hint that you were talking out of your behind.
That being said the weights determine the resulting hidden states the model outputs after the forward-pass. A Model like Z-Image is trained with TE embeddings to expect a very specific distribution of those hidden states and diffusers in general are quite sensitive to these changes. That's why on average you'll see some loss in performance in some aspect of the process. I hope that makes things more clear for you. Sorry if I come off as harsh.
It's fine to encourage people to experiment, I'd do so myself. But don't spread misinformation about how things work please, and give people unmoderated expectations. That's why others called your claims out as "placebo".
Your claim that V4 Engineer is better at prompt adherence, which from the images you posted it clearly wasn't:
"From the images you posted it clearly wasn't".
Um, oh really, did you analyze each prompt item and score it against each image?
Nope, you didn't. You just keep claiming "stock is better"
Your claim that abliterating helps - abliteration does absolutely nothing to improve Text Encoder performance.
No, I said and I'll repeat myself again for the slow person who keeps insisting otherwise:
Abliting BARELY changes things, it makes very small differences, DESPITE ZERO changes in the tokenizer/vocab, which you insist are the key piece, and I keep saying are not.
It doesn't 'improve', but it doesn't hurt, it does change things slightly. It doesn't magically do NSFW better, but it does suddenly do things that the guiderails would not do, like putting part A touching part B, where I found that the stock clip model would avoid that and 'get close but never touch' (and A and B are anything from hands to mouth to other parts)
Finally you said yourself that it's neither better nor worse, so it's just different, so why argue with people who say that it's "just different":
Sigh, let's repeat this again: In general, IN GENERAL, it's 'just different'. If you take ANY retrained Qwen 3 4B LLM, you can use it as a clip model... and you will get different results. I make ZERO claims that ANY random model is always better than stock... but in my testing, Josie model was:
So in 50% of the cases, it was at least as good, if not better. In 50%, people preferred the Stock... so the true answer: not better, not worse. At least as faithful to the prompt.
You clearly missed the '25% of the time, it was better'. That's not a guarantee, it is an observation, and confirmed by blind opinion survey by people who didn't know which was which. That was on a model that wasn't trained AT ALL for image prompting. It was just trained to be a 'better more cooperative model'... it also learned to use emojis and make comments, and all sorts of extra bits... but it also loosened the model up and it, WHEN USED AS A CLIP MODEL (TE), caused the results to sometimes be better than Stock TE.
Holy hell man we're training models here, not breeding dogs. Engineer V2, 2.5 and 4 are similar in name only. 2.5 was a MERGE, it wasn't even a finetune, and v4 had a different dataset, in your world that's like comparing a pregnant chihuahua to a fully grown male mastiff:
Bzzt, thanks for playing. If you WERE correct, then replacing one model with another model with the same 'pedigree' (ie V2, v2.5, v4) should yield identical results, after all, there are no changes to the Tokenizer/Vocab/etc, right?
You seem to ignore that you lost: You claimed the only thing that mattered was the Tokenizer/Vocab which did NOT change from model to model... so that's the 'pedigree' I meant. Not direct Model A -> Model B -> Model C. If you WERE right, then the same tokenizer in all 3 models would produce the same result, it DOES NOT. So explain HOW those models do different things WITH THE SAME TOKENIZER/VOCAB, unless the Weights are involved? (Which you deny)
Do you even know what a tokenizer is? It turns the words from your prompt into tokens. All Qwen3 based models use the same tokenizer. Just like all Llama3 models use their own tokenizer. It's meaningless to modify it for a finetune.
Nobody SAID the 'tokenizer code' was being trained, in fact, I specifically am saying the opposite: it's NOT the tokenizer that is at issue here, but the weights. I insisted it was.
I don't even know why someone would do that as it'll break compatability in llamacpp in so many ways. As a rule of thumb you can just expect all Qwen3_4b based models to produce the same Token IDs for any given input. It was the first hint that you were talking out of your behind.
No, it's the first clue you aren't actually reading and understanding.
Yes, all of the Qwen3_4b models use the same token IDs.
OK, so explain HOW:
If you use V2, V2.5 and V4, all of which use the SAME Tokenizer, the conditioning result of passing the same prompt to all 3, that results in 3 different token strings (aka conditioning)?
(btw, please don't argue that the conditioning is not a 'token string', it is, you can decode it backwards, it's a token string. I've even passed LLM output (which I also translated back to English) directly to Zimage as tokenized conditioning and it generated the expected picture based on the tokens (ie it matched the English translation)
(added: documented here which provides the screenshot of the custom node (basically an LLM node which both passes responses directly as conditioning along and returns the response to be detokenized into human readable)
It's the weights, and I said the weights, and you seemingly deny it over and over that matters to the token PROCESSING into a conditioning string of them.
That being said the weights determine the resulting hidden states the model outputs after the forward-pass. A Model like Z-Image is trained with TE embeddings to expect a very specific distribution of those hidden states and diffusers in general are quite sensitive to these changes. That's why on average you'll see some loss in performance in some aspect of the process. I hope that makes things more clear for you. Sorry if I come off as harsh.
AKA There are conditioning differences, which are NOT always 'losses'
This is where the Prompt is converted into Tokens which are then INTEPRETED by THE CLIP MODEL.
IN OTHER WORDS, IT'S THE TRANSLATOR, WHICH I SAID FROM THE START:
Chinese Characters to English Translations is similar to the Prompt->Tokens->Conditioning changes.
And before you say "But the prompt is always translated to the exact same tokens": YES, in the first part, but the use of the model as a TE? That's the second part, that's where it IS being 'translated' into a different set of tokens aka the conditioning string.
I'm gonna be honest I feel like a lot of this arguing is pretty pointless. You would be much better off focusing your efforts on arguing for a better TE in general that isn't a ridiculously small 0.6B in size, rather than spending your time swapping it out with a bunch of random TEs that barely affect outputs in the first place. No matter how much it does or doesn't improve the output, your main limitation here is still that it's just 0.6B, not that the TE isn't abliterated. It's more important than whatever this all is honestly
This is why it's so difficult to argue with people who don't understand the subject matter.
When I asked if we're talking about the same thing you could have just said "no".
I've never seen anyone use terms like "translation" or "token stream" in the way you did, hence the misunderstanding.
This is why it's so difficult to argue with people who don't understand the subject matter.
When I asked if we're talking about the same thing you could have just said "no".
I've never seen anyone use terms like "translation" or "token stream" in the way you did, hence the misunderstanding.
Because you didn't actually read what I wrote. You assumed you knew better.
And you even said "Unless we're talking about different things here, this is exactly not the case:"
You said this, and I immediately, Um, No, that's not correct. Not sure how much clearer "No" can be.
You would be much better off focusing your efforts on arguing for a better TE in general that isn't a ridiculously small 0.6B in size
That I can agree with. 4B would probably be an overkill for 2B image model, but why not 1.7B?
Maybe they targeted 6GB VRAM or something?
I'm not sure how much quantization affects TE, but using gguf or FP8 is an option.
But sadly, I really doubt that they will just swap TE model from preview to release, so we will have to work with what we have.
Idk why one is trying to prove with evidences and the another is keep ignoring with some arrogance. One should also explain what placebo is, since the idea is open and anyone can try this.
- Qwen3-0.6B-heretic-abliterated-uncensored
- Smoothie-Qwen3-0.6B
- Josiefied-Qwen3-0.6B-abliterated-v1
- Qwen3-0.6B-gabliterated
I was looking for alternative clips to try, do these work in comfyui ?
even a normal person would show different reactions if removed part of their brain or fried it with electricity
You would be much better off focusing your efforts on arguing for a better TE in general that isn't a ridiculously small 0.6B in size
But sadly, I really doubt that they will just swap TE model from preview to release, so we will have to work with what we have.
I addressed this in the now closed thread about it, and basically there is no good way to 'change out the TE' (given the Qwen3 0.6b to T5 bridge).
But yes, there are retrained 0.6b models, like the Josiefied one mentioned. Almost any model can be made to work with ComfyUI as a TE, you just need to be sure it's in the right spot and has the support files which are often left out (like vocab.json). You can NOT change the size/etc, those are different models, they won't work. And right now, there is no way to use MLX models as Clip models in Comfy, I believe, but GGUF work fine. (use City96 repo nodes)
That said: it's clear that they ARE training the bridge to 'do better', so you might want to consider using an ablit/heretic script ON the stock Anima TE model (and keep it from lobotomizing it) and see if that helps a little bit (it won't help a LOT...) and THEN you could train more NSFW (for example, could be SFW anime, or whatever content...) concepts into the Qwen3 0.6b model, and THEN train Anima.
BUT I'm pretty sure that T5 itself will still be the true bottleneck here: See https://github.com/Kaoru8/T5XXL-Unchained where they added NSFW vocab to T5...
The original tokenizer had a vocabulary size of 32,100, while the newly uncensored one has been extended to a vocabulary size of 69,300.
Aside from effectively uncensoring the model, this results in significantly more direct 1:1 word -> token/concept mappings
(and therefore convergence speed during training and prompt adherence during inference) for the following:
NSFW terms and anatomy
Danbooru tags
First and last names of people and characters
Ethnicities and nationalities
So stock T5, used with Anima means it doesn't know those... so I suspect this is where Qwen3 0.6b is learning and helping... it's doing the same as this 'enhanced vocab T5'... but it still won't fix that bottleneck.
I'm MORE curious why they didn't use Kaoru's unchained T5XXL and start Anima there.
scruffynerf is just trolling, they will continue to defend placebo and even bad models no matter what, even if it's crystal clear that a model is very bad.
The OP's topic is about finetuned Qwen3 0.6B, not flaming this stupid war about 4B LLMs. And it's clear that troll knows nothing and it is useless to entertain him. Now, it's about t5xxl...
I'm MORE curious why they didn't use Kaoru's unchained T5XXL and start Anima there.
Nobody asked for your opinion, nobody will train yet another model using t5xxl's encoder just to comply with your low effort trolling, it is 2026 already, no way they'd do that, do that yourself if you are enough "smart"
Do not talk back.
scruffynerf is just trolling
https://huggingface.co/19999999dog/activity/all
anonymous account with no history who popped in just to name call. Who's a troll? Reporting.
scruffynerf is just trolling
https://huggingface.co/19999999dog/activity/all
anonymous account with no history who popped in just to name call. Who's a troll? Reporting.
report these no lifes
scruffynerf is just trolling
https://huggingface.co/19999999dog/activity/all
anonymous account with no history who popped in just to name call. Who's a troll? Reporting.
report these no lifes they make accounts to disagree on things
I reported him too
I decided to compare default qwen 0.6b and the "Qwen3-0.6B-heretic-abliterated-uncensored" OP mentions, generating 2x250 images of the train brats on a beach and comparing how many errors I saw. I have no doubts that just averaging the loss for 200, 400, 600 and 800 timesteps on a bunch of dan images would also show consistently worse performance, but oh well, I didn't bother to validate like that and just tested like this.
Prompt/settings
masterpiece, best quality, 2girls, hikari \(blue archive\), nozomi \(blue archive\), blue archive, blue armband, green hair, long hair, halo, blue jacket, white gloves, blue skirt, long skirt, hikari on left v fingers, nozomi on right waving, bare legs, barefoot, feet, toes, beach, outdoors, oceanlow quality, worst quality, white pantyhose, rabbit pose
28 steps, 4.5 cfg, er_sde, 1152x896
Image with no notable errors, not necessarily representative of everything generated. E.g. I did not specify eye color -> might have closed eyes or be viewed from behind.
Stats
Default qwen:
175 without victory finger pose
89 with bad halos
31 white legs (white pantyhose artifact)
16 catastrophic anatomical failures (these were only leg-related)
3 images that have an extra brat (3girls)
0 images have both gakis be the same one
2 images with the waving shupogaki on the wrong side (left/right) (0 are both waving)
0 images with the v-ing choochoobrats on the wrong side (left/right)
~6 images with one or more feet that are flipped (e.g. two left feet)
qwen3_06b_heretic_ablit:
200 without victory finger pose
129 with bad halos
32 white legs (white pantyhose artifact)
24 catastrophic anatomical failures (21 leg, 2 arm, 1 very broken image)
3 images that have an extra brat (3girls)
2 images have both gakis be the same one (at least, I may have missed some...)
21 images with the waving shupogaki on the wrong side (left/right) (5 are both waving)
2 images with the v-ing choochoobrats on the wrong side (left/right)
~14 images with one or more feet that are flipped (e.g. two left feet)
Link to an archive of the 500 images and my tracked errors for them, as captions
I'm not willing to sum up how many times I noted down all the variations of poor/er fingers or feet as I originally intended to, and I probably did not track that very consistently beyond a general feel that the abliterated model mostly made things worse. I dunno if I should count toes when there's 5 feet, or I couldn't decide what to count some broken finger poses as and so on. As I went through it, I found it more and more confusing and tedious thinking of how to classify all the wacky anatomical errors.
I did not track if the correct brat is on the left/right, since to be quite honest I doubt the model actually knows which is which given the vast majority of images of them include both.
I forgot to track skirt length. Whoops...
100-200 images in I started noticing that the abliterated model would also cause more frequent cases where one of the train brats is "standing" but her body is sunk too deep into the sand/shallow water, as if badly pasted on top of the image. Sadly I saw this too late and I did not intend to do a second pass just for this.
Also bad halo in the top left.
Or, in short: Using the abliterated TE with this single prompt very consistently breaks the brats' halos (36% -> 52%), worsens understanding of left/right, worsens understanding of "v" and causes more frequent failures of all kinds, including anatomical failures big and small and some pretty weird rarer results like the above. It's not placebo-tier, it's worse.
If you only look at a small sample, it is very possible to land on images where the ablit does better. Someone above mentions 4 pairs. Not enough. As an example, there are a few cases where the original qwen does not make the left choochoobrat do a v, and the ablit does.
I highly doubt these results are much different for other prompts. If someone wants to critique me for just testing a single prompt, I'm all eyes for a chart of the losses of default qwen and the ablit over some large amount of dan images.
I decided to compare default qwen 0.6b and the "Qwen3-0.6B-heretic-abliterated-uncensored"
"I tried this new fast food place I heard about, and the food is just awful, so all other restaurants must be this bad, and things like reviews or recommendations don't really matter"
(like the guy [me] who said that merely abliterated models do NOT perform 'better', could perform worse, and have no real benefit, but that well trained replacement models do exist. aka not shocked at your results.)
Now, try it with the very next one on the list that IS additionally trained: Goekdeniz-Guelmez/Josiefied-Qwen3-0.6B-abliterated-v1
However, Anima might STILL not work well because you aren't merely talking to Qwen3, you are talking to Qwen 3 who is then talking to T5, who then talks to Anima model.
And it sounds like they are training Qwen3 0.6b to talk better to T5 (ie the Anima model is not using 'pure stock Qwen3 0.6b' but 'Anima trained Qwen3 0.6b who knows more about what T5 likes to hear')
Your results are entirely valid, and unsurprising, but don't reflect any of the other factors discussed, including Just How was that model "heretic-ed/abliterated/uncensored"? there is literally a benchmark for that, and it's well known that the 'lower the score' the 'less rejections' but also often the 'matrix' was broken to achieve that lower score. (Yes, it's possible to call it a lobotomy, digital, but still..... And there are other methods that focus on avoiding damage but settle for 'higher scores' (meaning some rejections remains, but tradeoff is acceptable)






