Papers
arxiv:2511.12207

Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

Published on Nov 15
· Submitted by L on Nov 20
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

MoS, a novel multimodal diffusion model fusion paradigm, achieves state-of-the-art results in text-to-image generation and editing with minimal parameters and computational overhead by using a learnable, token-wise router for modality interaction.

AI-generated summary

We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-k hidden states and is trained with an ε-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to 4times larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

Community

Paper submitter

Scaling text-to-image generation isn't just about stacking layers—it's about smarter moves. The MoS team brings fresh tricks to the transformer table. Prior approaches predominantly rely on cross-attention, self-attention, or unified architectures such as Mixture-of-Transformers (MoT) to align textual and visual representations, enabling the model to follow text guidance effectively. In contrast, our design adopts an orthogonal perspective, guided by the following key principles:

  • Adaptive Layer Selection. We find that using a single fixed layer—typically the final-layer feature—or enforcing rigid one-to-one layer alignment is suboptimal. This indicates that diffusion models do not consume language features in a strictly sequential or layer-aligned manner, making a flexible selection mechanism essential.

  • Input-Dependent Conditional Signals. Modern text-to-image systems often encode the text once and keep it static throughout denoising. We find that this creates an information mismatch with the evolving nature of the diffusion trajectory. Conditional signals should adapt to the noise level and denoising step rather than remain fixed.

  • Token-Specific Conditioning. Our findings indicate that it is more effective to allow each token to source its representation adaptively from different layers, rather than using a single, shared layer embedding to represent all tokens uniformly. This supports a {\it more granular, token-level view} of context conditioning.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.12207 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.12207 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.12207 in a Space README.md to link it from this page.

Collections including this paper 1