Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

Haozhe Liu1,2,*, Ding Liu2,*, Mingchen Zhuge1,2,†, Zijian Zhou2,†, Tian Xie2,†, Sen He2, Yukang Yang2, Shuming Liu1,2, Yuren Cong2, Jiadong Guo2, Hongyu Xu2, Ke Xu2, Kam-Woh Ng2, Juan C. Pérez2, Juan-Manuel Pérez-Rúa2, Tao Xiang2, Wei Liu2, Shikun Liu2,†, and Jürgen Schmidhuber1,†

1 KAUST    2 Meta AI
* Equal contribution (Joint First Authors)    Core Contributors

We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-k hidden states and is trained with an epsilon-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to 4x larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

Accepted by CVPR 2026 | arXiv:2511.12207v1

Introduction

Multimodal generation is typically built on a fixed attention regime: either causal decoding or fully bi-directional interaction. However, real multimodal reasoning is asymmetric and context-dependent. Different tokens may require different visibility patterns at different layers, especially when text instructions, visual context, and generation targets interact over a long denoising trajectory.

We introduce Mixture of States (MoS), a token-level routing framework that allows each token to dynamically switch between causal and bi-directional states. Instead of hand-crafted fusion rules, MoS learns these routing decisions end-to-end with a lightweight router, enabling the model to adapt token interactions to input content and generation stage.

Following this principle, the key characteristics of MoS can be summarized as:

MoS Design Details

Figure 1. MoS Design Details

MoS In Action

1. Image Generation

Figure 2. Image Generation: Teaser and Prompt-conditioned Comparison.

Teaser 1 for image generation.

Prompt-conditioned comparisons

Use the buttons to switch between prompts. MoS-Image is highlighted on the right.

Qwen-Image Flux Bagel SANA MoS-Image
Prompt 1

A colorful poster with the title at the top in large letters: “Author Meet and Greet on Saturday.” Below the title is a portrait of the author in the center. At the bottom, smaller text reads “Book Signing and Q&A.”

Qwen-Image
Qwen-Image result for prompt 1.
Flux
Flux result for prompt 1.
Bagel
Bagel result for prompt 1.
SANA
SANA result for prompt 1.
MoS-Image
MoS-Image result for prompt 1.
Prompt 2

On a large wooden table, a variety of foods are arranged in a vibrant display. In the center sits a pepperoni pizza, cut into eight slices, the golden crust slightly charred at the edges, melted cheese stretching between slices, and glossy red pepperoni discs glistening with oil. To the right, a hamburger is stacked high on a white plate: a sesame seed bun with a juicy beef patty, melted cheddar cheese dripping down the sides, layers of green lettuce, red tomato slices, and pickles visible in between, with golden French fries scattered beside it. On the left, a grilled fish is presented on a rectangular platter, its skin crispy and golden-brown with hints of char, garnished with lemon slices placed along its body and fresh parsley sprinkled across. Near the top of the table, a bowl of fruit overflows with color: shiny red apples, bright yellow bananas curving upward, deep purple grapes spilling over the edge, and a cut-open orange revealing its juicy segments. At the front of the scene, a small dessert plate holds a slice of chocolate cake, dark and rich with glossy frosting, topped with a bright red strawberry. The entire table is lit with soft natural light, creating highlights on the glossy fruit skins, reflections on the melted cheese, and warm shadows under the plates, giving the display a fresh and appetizing look.

Qwen-Image
Qwen-Image result for prompt 2.
Flux
Flux result for prompt 2.
Bagel
Bagel result for prompt 2.
SANA
SANA result for prompt 2.
MoS-Image
MoS-Image result for prompt 2.
Prompt 3

A Chinese restaurant menu poster with a solid black background and golden decorative borders. At the top, in large bold letters, the heading says “Today’s Specials.” The appetizers section lists: “Spring Rolls - ¥18,” “Dumplings - ¥22,” “Hot and Sour Soup - ¥20.” The main dishes section displays in larger text: “Kung Pao Chicken - ¥45,” “Braised Beef - ¥55,” “Eggplant in Garlic Sauce - ¥38.” At the bottom, the desserts section reads: “Sesame Balls - ¥25,” “Mango Pudding - ¥28.” All menu items are written in clear white letters against the black background, with the prices shown directly beside each dish.

Qwen-Image
Qwen-Image result for prompt 3.
Flux
Flux result for prompt 3.
Bagel
Bagel result for prompt 3.
SANA
SANA result for prompt 3.
MoS-Image
MoS-Image result for prompt 3.
Prompt 4

The image is an advertisement for a GPS tracking device designed for dogs, featuring a brown dog running in the woods and a close-up of the device. In the foreground, a brown dog and a yellow collar is prominently displayed, running on a dirt path surrounded by trees. To the right of the dog, a text overlay reads "LIVE GPS TRACKING" in black font within a yellow rectangle, followed by "NEVER HAVING TO HOPE SOMEONE SCANS THEIR MICROCHIP" in white font. This text highlights the key benefit of the product. In the bottom center of the image, a close-up view of the GPS tracking device is shown. The device is black with a yellow strap and features the letters "MoS" in white on its front. The strap is made of a textured material and has a black plastic buckle. The overall design of the device appears sleek and modern. The background of the image is a blurred forest scene, with trees and foliage visible behind the dog. The atmosphere is one of freedom and adventure, as the dog runs through the woods with ease.

Qwen-Image
Qwen-Image result for prompt 4.
Flux
Flux result for prompt 4.
Bagel
Bagel result for prompt 4.
SANA
SANA result for prompt 4.
MoS-Image
MoS-Image result for prompt 4.
Prompt 5

The image is divided into two sections. The left side features a dark teal background with a grid pattern, accompanied by large white text that reads "Owner makes $131,150 IN 10 MONTHS WHEN PARTNERING WITH MoS." The word "MoS" is displayed in teal and yellow font below the main text. On the right side of the image, there is a photograph of a house situated in a wooded area. The house has a dark green exterior with white trim around the windows and doors. A small porch is visible at the front entrance, which is flanked by two lanterns on either side. The roof appears to be made of metal, and the surrounding landscape includes trees and bushes. A wooden walkway leads up to the front door, adding to the overall aesthetic appeal of the property.

Qwen-Image
Qwen-Image result for prompt 5.
Flux
Flux result for prompt 5.
Bagel
Bagel result for prompt 5.
SANA
SANA result for prompt 5.
MoS-Image
MoS-Image result for prompt 5.

Prompt-conditioned one-row generation comparison across five prompts, with MoS-Image highlighted on the right.

2. Instruction-based Editing

Figure 3. Instruction-based Editing: Teaser and Prompt-conditioned Comparison.

Teaser 2 for instruction-based image editing.

Prompt-conditioned editing

Use the buttons to switch between editing cases. MoS-Image is highlighted on the right.

Original Image Qwen-Image Flux-Kontext Bagel MoS-Image
Prompt 1

remove 'Lover' text from the image, change the diffusion's color to light blue, add a word 'MoS' with purple under hand, and change the color of hand to gray

Original Image
Original image for editing prompt 1.
Qwen-Image
Qwen-Image result for editing prompt 1.
Flux-Kontext
Flux-Kontext result for editing prompt 1.
Bagel
Bagel result for editing prompt 1.
MoS-Image
MoS-Image result for editing prompt 1.
Prompt 2

Make it a smiling face, add glasses, change to blue fur and green clothes.

Original Image
Original image for editing prompt 2.
Qwen-Image
Qwen-Image result for editing prompt 2.
Flux-Kontext
Flux-Kontext result for editing prompt 2.
Bagel
Bagel result for editing prompt 2.
MoS-Image
MoS-Image result for editing prompt 2.
Prompt 3

The image depicts a person's hand holding a ring between their thumb and index finger. The hand has fair skin and is well-manicured. The ring is gold with pearls and diamonds, and is positioned in the center of the image. The background is a blurred beige or brown color, which helps to emphasize the ring and the hand. The overall tone of the image is elegant and sophisticated, with a focus on showcasing the ring in a luxurious and refined manner. The camera is positioned at a slightly elevated angle, capturing a close-up shot with a shallow depth of field, making the background blur and the ring sharp. The lighting is soft and gentle, with a subtle sheen on the hand and the ring. The image is of exceptionally high quality, with a clear and sharp focus on the ring and the hand.

Original Image
Original image for editing prompt 3.
Qwen-Image
Qwen-Image result for editing prompt 3.
Flux-Kontext
Flux-Kontext result for editing prompt 3.
Bagel
Bagel result for editing prompt 3.
MoS-Image
MoS-Image result for editing prompt 3.
Prompt 4

Split the image into two sections, in the right section change the season to spring, in the left section change the season to winter, keep the original position of the tree

Original Image
Original image for editing prompt 4.
Qwen-Image
Qwen-Image result for editing prompt 4.
Flux-Kontext
Flux-Kontext result for editing prompt 4.
Bagel
Bagel result for editing prompt 4.
MoS-Image
MoS-Image result for editing prompt 4.

Prompt-conditioned one-row editing comparison across four cases, with MoS-Image highlighted on the right.

3. Router Behavior Visualization

To make the router interpretable, we visualize its behavior on the caption “A dog holding a sign that says ‘MoS in 2025’”. The first row shows the denoising trajectory, illustrating how the output evolves from pure noise to a coherent image as sampling proceeds.

We then summarize the router from two complementary views. First, we average routing weights across generation blocks and tokens to measure layer-wise importance at different denoising steps. Second, we fix a denoising step and inspect token-conditioned routing maps, which reveal how individual words trigger different connections between the understanding and generation towers.

Router Visualization

Figure 4. Router Visualization.

Top: generation outputs at different denoising steps. Middle: layer-wise importance aggregated across generation blocks and tokens. Bottom: token-conditioned routing patterns showing that different words induce different preferences over understanding-tower layers.

4. Router Efficiency Analysis

To quantify the practical cost of MoS routing, we decompose end-to-end inference into three stages: input encoding, iterative generation, and output decoding. As shown below, the understanding tower takes about 0.094s, the generation tower takes about 0.121s, and decoding takes about 0.016s, while the MoS router contributes only around 0.007s. This shows that token-wise routing introduces a small latency overhead relative to the dominant generation path.

These results support our design goal: enabling dynamic, token-specific conditioning without sacrificing deployment efficiency under realistic sampling settings.

Router Efficiency

Figure 5. Router Efficiency. Runtime breakdown across input encoding, iterative generation, and output decoding. The router overhead is small compared with the main generation computation.

5. Conclusion

MoS replaces fixed multimodal fusion with a lightweight, token-wise router that dynamically selects hidden states across towers and denoising steps. This simple change makes the model substantially more adaptive without introducing heavy computational overhead.

Across image generation, instruction-based editing, router visualization, and efficiency analysis, the same picture emerges: effective multimodal interaction should be dynamic, phase-aware, and token-specific. MoS turns that principle into a practical design that improves quality and controllability while remaining efficient to deploy.

Citation

@article{liu2025mixture,
  title={Mixture of States: Routing Token-Level Dynamics for Multimodal Generation},
  author={Liu, Haozhe and Liu, Ding and Zhuge, Mingchen and Zhou, Zijian and Xie, Tian and He, Sen and Yang, Yukang and Liu, Shuming and Cong, Yuren and Guo, Jiadong and others},
  journal={CVPR},
  year={2026}
}