Qwen 3.5
So I follow AI news quite a lot and it looks like Qwen has released the smaller variants it's Qwen 3.5 series. I'm curious if they have a 0.6b version and if they do it should be the same size as the current great Qwen3 0.6b text encoder.
I know I said previously that I it might be better with a bigger model. However, I saw videos of the original Qwen3 0.6b by Bijanbowen And I have to say size definitely doesn't matter. It's about how you train the model and the current model they have as the text encoder is very very good and trained very well so hats off to the Devs of this great anime model.
I'm also curious if there's a big difference between Qwen 3.5 and Qwen 3.
I love the open source models at the moment. I think they are incredible. Both video image and overall LLMs. It's definitely exciting times for the open source community and AI in general.
There is no 0.6B model in Qwen 3.5, but there is a 0.8B model available. I saw someone in an overseas community experiment with it. The architecture is slightly different, and some unnecessary parts were trimmed down. Fortunately, the hidden_size is the same, so they managed to integrate it into the existing DiT structure and get it running. However, the generated images did not come out properly—the shapes seemed to collapse and blend together.
It might be an issue with the adapter not being properly aligned, but it is difficult to determine exactly what the problem is. For now, it can be considered “possible,” but whether it will lead to “better” results remains to be seen.
I haven't tried the 0.8B yet,
but judging from the 9B and 27B versions, there are obvious improvements in the versions.
And the entire series comes with multimodality, and it might even be possible to consider passing reference images in clip?
(Regarding te's multimodality, it can run smoothly in the Newbie( test version), even though no special training has been conducted for this.)
However, if the existing structure is maintained, using T5's tokenizer, simply replacing from qwen3 0.6b to 3.5 0.8b may result in limited improvement.
I expect there can be some behavior alignment by fine-tuning adapter part. I'll post comment if there are some meaningful results.
There is no 0.6B model in Qwen 3.5, but there is a 0.8B model available. I saw someone in an overseas community experiment with it. The architecture is slightly different, and some unnecessary parts were trimmed down. Fortunately, the hidden_size is the same, so they managed to integrate it into the existing DiT structure and get it running. However, the generated images did not come out properly—the shapes seemed to collapse and blend together.
It might be an issue with the adapter not being properly aligned, but it is difficult to determine exactly what the problem is. For now, it can be considered “possible,” but whether it will lead to “better” results remains to be seen.
Alright — the person from the overseas community I mentioned earlier has finally produced some promising results. For clarity: I won’t post a direct link because the post contains NSFW images and comes from a non-English community, so quoting it directly would be inappropriate. I’ll only report the results here.
In short: about what we expected. They ported the Qwen 3.5 0.8B model and then fine-tuned the LLM adapter on a small dataset. A plain port had previously produced collapsed and distorted images; the additional adapter training, however, produced outputs that better preserve structure and align more closely with the prompts.
The overall conclusion remains the same: it’s still “possible,” but whether switching will actually yield broadly better results is uncertain.
Comparing the two models — Qwen 3.0 0.6B vs. Qwen 3.5 0.8B — you might notice some differences in basic UX, but within Anima there’s no clear, encouraging performance gap yet; the model mostly acts as a text encoder. Fully retraining the adapter on a much larger dataset to utilize the 0.8B model could mean discarding roughly half the existing progress and rebuilding the adapter from scratch.
I hope the developer sees this and posts more details so people can be reassured — I’m waiting for any additional information.
Yeah to be honest The prompt adherence is very good even with the current great Qwen3 0.6b model. Especially when you have multiple characters. Make sure you say names, Jess, John... rather than just 1girl or 1boy I've noticed it gives better results.
Also If you use pure natural language make sure it's descriptive personally I use Dan tags for certain poses/anatomy stuff 😏 and mostly natural language for everything else as it works very well. I have also seen templates that people using for z Imege turbo which uses Qwen3 4b as its Text encoder. This is one of them that has been working well for more complex images:
[Style & Aesthetic]
(Your quality tags)
[Composition & Camera]
(What camera angle you want the photo, full body shot, Dutch angle...)
[Subjects & Anatomy]
(How many subjects do you want? Make sure that you use names like Jess a 21-year-old blonde woman, John a 48-Year-Old african man.... Instead of just 1girl and 1boy As I found this works better sometimes, but your mileage may vary.)
[Action]
(What your subject/subjects are doing...)
[Environment & Atmosphere]
(The more description of your environment and background the better as if you don't. It tends to give you the same background because the model is very good at following the prompt. The variety isn't the best sometimes at the moment)
[Lighting & Contrast]
(Be careful with this because if you go too much on the lighting it can make the characters look a bit washed out and things like that I tend to just say if it's a sunny day or something simple like that)
I've also experimented with BREAK like in SDXL and Illustrious And it's surprisingly worked pretty well, especially when my prompt has been very long.
These may or may not work for you but these are some of the things I've been trying. I hope it helps.
I've also been using a prompt enhancer as well sometimes which has been helping a lot and expanding my prompt although it's through open router so the latency can be a little bit on the slow side, sometimes depending on the LLM model.
I am actually running a parallel experiment with changing the text encoder to Qwen3.5-2B-Base.
It is fairly straightforward to train a new LLM adapter from scratch to align to the existing text embeddings and produce coherent images. I've already done this and it works.
What takes (potentially) much longer, is fully recovering the character details and artist knowledge after switching to the new text encoder. A surprising amount of knowledge, especially for styles, is contained in the LLM adapter, not the DiT. It has to relearn all this knowledge from the full dataset. I will see how fast it is able to recover the knowledge.
It's entirely possible, and even likely, that it would just take way too much time to fully adapt to a new text encoder and so I don't go with this option. There's also some reasons to believe that the model is bottlenecked in other ways and that Qwen3-0.6b isn't actually hurting quality or prompt comprehension that much. But I am investigating whether it's feasible to switch to Qwen3.5-2B.
I am actually running a parallel experiment with changing the text encoder to Qwen3.5-2B-Base.
What about qwen 3.5 0.8B? The 2B model isn’t huge, but some people could still experience offloading or OOM because of it. Since the size difference compared to the existing 0.6B model is relatively small, wouldn’t it also allow you to see results more quickly?
I tried some experiments and also tried fine-tuning adapter part (which belongs to anima-preview.safetensors) and realized that:
- noun/verbs are easily got adapted
- pronouns are quite difficult to adapt
But what I feel interesting is, vision capability of qwen3.5 0.8b. It can understand image quite decently so there can be some ways to incorporate I guess.
A surprising amount of knowledge, especially for styles, is contained in the LLM adapter, not the DiT. It has to relearn all this knowledge from the full dataset. I will see how fast it is able to recover the knowledge.
Is this why lora don't work well?
I am actually running a parallel experiment with changing the text encoder to Qwen3.5-2B-Base.
What about qwen 3.5 0.8B? The 2B model isn’t huge, but some people could still experience offloading or OOM because of it. Since the size difference compared to the existing 0.6B model is relatively small, wouldn’t it also allow you to see results more quickly?
I agree I think Qwen 3.5 0.8b might be better as it's smaller and lighter plus a decent upgrade from Qwen 3 0.6b as tests people have been doing online and on YouTube. To be honest the current Qwen3 0.6b as actually very decent if you prompt it correctly and works well with Dan tags and natural language. Plus it's pretty fast.
Also take your time with this tdrussell as I'm sure it takes a lot of time and work testing and training... I hope it goes well and thanks again for all your incredible work you've done with this incredible anime model.
I spent 3 days doing an experiment to train the model to work with Qwen3.5 2b. I decided not to use this option, and will continue using the original Qwen3 0.6b.
The experiment "worked", in the sense that it got like 95% of the way to the original model's quality. However,
- Artist and character knowledge lagged behind by a small but noticeable amount. Estimated 2 weeks of continuous training to fully recover and match the original model.
- There was no obvious improvement in the rate of decrease for stabilized loss curves.
- Even manual testing indicated no immediate observable improvement in prompt comprehension.
Both in terms of objective metrics (loss values) and subjective model testing, I couldn't see any clear win from switching to the larger text encoder, but there was a small loss in character and artist knowledge that would have taken too much time to recover. Since there is no clear improvement, there's no sense trying to go with a larger TE and I will stick to the original architecture.
I spent 3 days doing an experiment to train the model to work with Qwen3.5 2b. I decided not to use this option, and will continue using the original Qwen3 0.6b.
The experiment "worked", in the sense that it got like 95% of the way to the original model's quality. However,
- Artist and character knowledge lagged behind by a small but noticeable amount. Estimated 2 weeks of continuous training to fully recover and match the original model.
- There was no obvious improvement in the rate of decrease for stabilized loss curves.
- Even manual testing indicated no immediate observable improvement in prompt comprehension.
Both in terms of objective metrics (loss values) and subjective model testing, I couldn't see any clear win from switching to the larger text encoder, but there was a small loss in character and artist knowledge that would have taken too much time to recover. Since there is no clear improvement, there's no sense trying to go with a larger TE and I will stick to the original architecture.
Can you upload the model trained on Qwen 3.5? Maybe it works better for some people in specific scenarios.
I'm not really opposed to uploading it, but 1) it is definitely overall worse, 2) it is a medium resolution trained model that doesn't work at even 1024 res, and 3) the only way to run it is hacked together custom node code that isn't public.
I see, some people are trying to adapt Qwen 3.5 to Anima, maybe this would help them, I don't know if it works but there is a node that is supposed to run Anima with Qwen 3.5 here: https://github.com/GumGum10/comfyui-qwen35-anima
Hi! I am the person that is trying to adapt qwen 3.5 4b, native qwen yielded worse results buuut with some alignment of the embedding space to match t5 its working a lot better, I will upload the model and code shortly
I'm not really opposed to uploading it, but 1) it is definitely overall worse, 2) it is a medium resolution trained model that doesn't work at even 1024 res, and 3) the only way to run it is hacked together custom node code that isn't public.
Hi, please see below, this is more of an adapter rather than directly integration of qwen 3.5 4B:
https://huggingface.co/lylogummy/anima2b-qwen-3.5-4b
https://civitai.com/models/2455272?modelVersionId=2760745
https://github.com/GumGum10/comfyui-qwen35-anima
I sincerely believe that upgrading the TE has potential, yes initially there may be diminishing results, and if upgrading the TE would delay the 1.0 release I woudn't be doing it neither, but maybe in the future it's a worthy endeavor to pursuit.
I sincerely believe that upgrading the TE has potential, yes initially there may be diminishing results, and if upgrading the TE would delay the 1.0 release I woudn't be doing it neither, but maybe in the future it's a worthy endeavor to pursuit.
I'm not trying to belittle your efforts or be overly critical, but to be honest, it's not clear that using a relatively large model like 4B offers any definite benefits.
When the sample images are anonymized, it's difficult to tell which is clearly better. If the 4B model truly produces superior results, it would be more convincing to demonstrate that it yields better outputs on average under controlled conditions — for example, by varying only a few settings such as seed or a single prompt.
Beyond the quality concerns, there are also practical costs to consider. Using 4B means you're now relying on something larger than the previous 0.6B — effectively using auxiliary components that exceed the size of the model itself. There's a certain irony in that.
On top of that, a larger model will inevitably increase the likelihood of requiring offloading during generation, or causing OOMs depending on the setup. Similar issues can arise during training as well, not just for end users. Over time, this becomes a small but real barrier for the many people who use and develop local models.
To summarize: given these risks and long-term costs, it's worth carefully reconsidering whether adopting 4B remains a justified decision once all of these factors are weighed — though if the results do prove out under controlled testing, that would certainly change the picture.