We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-k hidden states and is trained with an epsilon-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to 4x larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.
Accepted by CVPR 2026 | arXiv:2511.12207v1
Multimodal generation is typically built on a fixed attention regime: either causal decoding or fully bi-directional interaction. However, real multimodal reasoning is asymmetric and context-dependent. Different tokens may require different visibility patterns at different layers, especially when text instructions, visual context, and generation targets interact over a long denoising trajectory.
We introduce Mixture of States (MoS), a token-level routing framework that allows each token to dynamically switch between causal and bi-directional states. Instead of hand-crafted fusion rules, MoS learns these routing decisions end-to-end with a lightweight router, enabling the model to adapt token interactions to input content and generation stage.
Following this principle, the key characteristics of MoS can be summarized as:
Adaptive layer selection, enabled by token-wise routing. Rather than always consuming one fixed text layer (or enforcing rigid one-to-one layer matching), MoS lets each token query a pool of cross-modal hidden states and select useful ones through a learnable router. In practice, the router performs sparse top-k state selection, so layer usage is decided by content and context, not by a hard-coded layer index.
Dynamic, timestep-dependent conditioning through state routing. Diffusion denoising is non-stationary, but static one-shot text conditioning cannot reflect this evolution. MoS addresses this by making routing decisions conditioned on the current denoising state, so the selected conditional features vary across timesteps and noise levels. This keeps guidance aligned with what the model needs at each stage of generation.
Token-specific conditional signals at fine granularity. Different tokens require different evidence (semantic, structural, or style cues), so sharing one uniform layer embedding across all tokens is restrictive. MoS routes at token level: each token can draw from different states and layers, producing a more precise conditional interface between modalities and improving alignment in both generation and editing.
Figure 1. MoS Design Details
Figure 2. Image Generation: Teaser and Prompt-conditioned Comparison.
Prompt-conditioned comparisons
Use the buttons to switch between prompts. MoS-Image is highlighted on the right.
A colorful poster with the title at the top in large letters: “Author Meet and Greet on Saturday.” Below the title is a portrait of the author in the center. At the bottom, smaller text reads “Book Signing and Q&A.”
On a large wooden table, a variety of foods are arranged in a vibrant display. In the center sits a pepperoni pizza, cut into eight slices, the golden crust slightly charred at the edges, melted cheese stretching between slices, and glossy red pepperoni discs glistening with oil. To the right, a hamburger is stacked high on a white plate: a sesame seed bun with a juicy beef patty, melted cheddar cheese dripping down the sides, layers of green lettuce, red tomato slices, and pickles visible in between, with golden French fries scattered beside it. On the left, a grilled fish is presented on a rectangular platter, its skin crispy and golden-brown with hints of char, garnished with lemon slices placed along its body and fresh parsley sprinkled across. Near the top of the table, a bowl of fruit overflows with color: shiny red apples, bright yellow bananas curving upward, deep purple grapes spilling over the edge, and a cut-open orange revealing its juicy segments. At the front of the scene, a small dessert plate holds a slice of chocolate cake, dark and rich with glossy frosting, topped with a bright red strawberry. The entire table is lit with soft natural light, creating highlights on the glossy fruit skins, reflections on the melted cheese, and warm shadows under the plates, giving the display a fresh and appetizing look.
A Chinese restaurant menu poster with a solid black background and golden decorative borders. At the top, in large bold letters, the heading says “Today’s Specials.” The appetizers section lists: “Spring Rolls - ¥18,” “Dumplings - ¥22,” “Hot and Sour Soup - ¥20.” The main dishes section displays in larger text: “Kung Pao Chicken - ¥45,” “Braised Beef - ¥55,” “Eggplant in Garlic Sauce - ¥38.” At the bottom, the desserts section reads: “Sesame Balls - ¥25,” “Mango Pudding - ¥28.” All menu items are written in clear white letters against the black background, with the prices shown directly beside each dish.
The image is an advertisement for a GPS tracking device designed for dogs, featuring a brown dog running in the woods and a close-up of the device. In the foreground, a brown dog and a yellow collar is prominently displayed, running on a dirt path surrounded by trees. To the right of the dog, a text overlay reads "LIVE GPS TRACKING" in black font within a yellow rectangle, followed by "NEVER HAVING TO HOPE SOMEONE SCANS THEIR MICROCHIP" in white font. This text highlights the key benefit of the product. In the bottom center of the image, a close-up view of the GPS tracking device is shown. The device is black with a yellow strap and features the letters "MoS" in white on its front. The strap is made of a textured material and has a black plastic buckle. The overall design of the device appears sleek and modern. The background of the image is a blurred forest scene, with trees and foliage visible behind the dog. The atmosphere is one of freedom and adventure, as the dog runs through the woods with ease.
The image is divided into two sections. The left side features a dark teal background with a grid pattern, accompanied by large white text that reads "Owner makes $131,150 IN 10 MONTHS WHEN PARTNERING WITH MoS." The word "MoS" is displayed in teal and yellow font below the main text. On the right side of the image, there is a photograph of a house situated in a wooded area. The house has a dark green exterior with white trim around the windows and doors. A small porch is visible at the front entrance, which is flanked by two lanterns on either side. The roof appears to be made of metal, and the surrounding landscape includes trees and bushes. A wooden walkway leads up to the front door, adding to the overall aesthetic appeal of the property.
Prompt-conditioned one-row generation comparison across five prompts, with MoS-Image highlighted on the right.
Figure 3. Instruction-based Editing: Teaser and Prompt-conditioned Comparison.
Prompt-conditioned editing
Use the buttons to switch between editing cases. MoS-Image is highlighted on the right.
remove 'Lover' text from the image, change the diffusion's color to light blue, add a word 'MoS' with purple under hand, and change the color of hand to gray
Make it a smiling face, add glasses, change to blue fur and green clothes.
The image depicts a person's hand holding a ring between their thumb and index finger. The hand has fair skin and is well-manicured. The ring is gold with pearls and diamonds, and is positioned in the center of the image. The background is a blurred beige or brown color, which helps to emphasize the ring and the hand. The overall tone of the image is elegant and sophisticated, with a focus on showcasing the ring in a luxurious and refined manner. The camera is positioned at a slightly elevated angle, capturing a close-up shot with a shallow depth of field, making the background blur and the ring sharp. The lighting is soft and gentle, with a subtle sheen on the hand and the ring. The image is of exceptionally high quality, with a clear and sharp focus on the ring and the hand.
Split the image into two sections, in the right section change the season to spring, in the left section change the season to winter, keep the original position of the tree
Prompt-conditioned one-row editing comparison across four cases, with MoS-Image highlighted on the right.
To make the router interpretable, we visualize its behavior on the caption “A dog holding a sign that says ‘MoS in 2025’”. The first row shows the denoising trajectory, illustrating how the output evolves from pure noise to a coherent image as sampling proceeds.
We then summarize the router from two complementary views. First, we average routing weights across generation blocks and tokens to measure layer-wise importance at different denoising steps. Second, we fix a denoising step and inspect token-conditioned routing maps, which reveal how individual words trigger different connections between the understanding and generation towers.
Routing is phase-dependent: early steps are sparse and selective, while later steps shift toward smoother, more stable importance patterns.
Routing is token-specific: words such as “dog”, “holding”, and “sign” activate distinct layer combinations rather than sharing a single global pattern.
These qualitative patterns support the main design claim of MoS: useful multimodal fusion is neither uniform nor static, but depends on both denoising stage and token semantics.
Figure 4. Router Visualization.
Top: generation outputs at different denoising steps. Middle: layer-wise importance aggregated across generation blocks and tokens. Bottom: token-conditioned routing patterns showing that different words induce different preferences over understanding-tower layers.
To quantify the practical cost of MoS routing, we decompose end-to-end inference into three stages: input encoding, iterative generation, and output decoding. As shown below, the understanding tower takes about 0.094s, the generation tower takes about 0.121s, and decoding takes about 0.016s, while the MoS router contributes only around 0.007s. This shows that token-wise routing introduces a small latency overhead relative to the dominant generation path.
These results support our design goal: enabling dynamic, token-specific conditioning without sacrificing deployment efficiency under realistic sampling settings.
Figure 5. Router Efficiency. Runtime breakdown across input encoding, iterative generation, and output decoding. The router overhead is small compared with the main generation computation.
MoS replaces fixed multimodal fusion with a lightweight, token-wise router that dynamically selects hidden states across towers and denoising steps. This simple change makes the model substantially more adaptive without introducing heavy computational overhead.
Across image generation, instruction-based editing, router visualization, and efficiency analysis, the same picture emerges: effective multimodal interaction should be dynamic, phase-aware, and token-specific. MoS turns that principle into a practical design that improves quality and controllability while remaining efficient to deploy.
@article{liu2025mixture,
title={Mixture of States: Routing Token-Level Dynamics for Multimodal Generation},
author={Liu, Haozhe and Liu, Ding and Zhuge, Mingchen and Zhou, Zijian and Xie, Tian and He, Sen and Yang, Yukang and Liu, Shuming and Cong, Yuren and Guo, Jiadong and others},
journal={CVPR},
year={2026}
}