Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
Abstract
MoS, a novel multimodal diffusion model fusion paradigm, achieves state-of-the-art results in text-to-image generation and editing with minimal parameters and computational overhead by using a learnable, token-wise router for modality interaction.
We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-k hidden states and is trained with an ε-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to 4times larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.
Community
Scaling text-to-image generation isn't just about stacking layers—it's about smarter moves. The MoS team brings fresh tricks to the transformer table. Prior approaches predominantly rely on cross-attention, self-attention, or unified architectures such as Mixture-of-Transformers (MoT) to align textual and visual representations, enabling the model to follow text guidance effectively. In contrast, our design adopts an orthogonal perspective, guided by the following key principles:
Adaptive Layer Selection. We find that using a single fixed layer—typically the final-layer feature—or enforcing rigid one-to-one layer alignment is suboptimal. This indicates that diffusion models do not consume language features in a strictly sequential or layer-aligned manner, making a flexible selection mechanism essential.
Input-Dependent Conditional Signals. Modern text-to-image systems often encode the text once and keep it static throughout denoising. We find that this creates an information mismatch with the evolving nature of the diffusion trajectory. Conditional signals should adapt to the noise level and denoising step rather than remain fixed.
Token-Specific Conditioning. Our findings indicate that it is more effective to allow each token to source its representation adaptively from different layers, rather than using a single, shared layer embedding to represent all tokens uniformly. This supports a {\it more granular, token-level view} of context conditioning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Growing Visual Generative Capacity for Pre-Trained MLLMs (2025)
- JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation (2025)
- LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation (2025)
- MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation (2025)
- ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation (2025)
- UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning (2025)
- Query-Kontext: An Unified Multimodal Model for Image Generation and Editing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper