StreamMultiDiffusion: Real-Time Interactive Generation with
Region-Based Semantic Control

Jaerin Lee, Daniel Sungho Jung, Kanggeon Lee, and Kyoung Mu Lee

Computer Vision Lab, Seoul National University

Arxiv 2024

StreamMultiDiffusion is a real-time interactive multiple-text-to-image generation from user-assigned regional text prompts.

Concept of StreamMultiDiffusion


The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing. Previous works have focused on improving the usability of diffusion models by reducing the inference time or increasing user interactivity by allowing new, fine-grained controls such as region-based text prompts. However, we empirically find that integrating both branches of works is nontrivial, limiting the potential of diffusion models. To solve this incompatibility, we present StreamMultiDiffusion, the first real-time region-based text-to-image generation framework. By stabilizing fast inference techniques and restructuring the model into a newly proposed multi-prompt stream batch architecture, we achieve ×10 faster panorama generation than existing solutions, and the generation speed of 1.57 FPS in region-based text-to-image synthesis on a single RTX 2080 Ti GPU. Our solution opens up a new paradigm for interactive image generation named semantic palette, where high-quality images are generated in real-time from given multiple hand-drawn regions, encoding prescribed semantic meanings (e.g., eagle, girl).

Stable Acceleration of Region-Based Image Generation

Bootstrapping mechanism of StreamMultiDiffusion

We establish the compatibility between region-based control [1] and acceleration [2] techniques for diffusion models.

Semantic Palette

Our stable acceleration technique allows practical applications of large size image generation with fine-grained regional prompt control. In this first demo, we demonstrate arbitrary-sized image generation from arbitrary number of prompt-mask pairs. Try now from our code or at the official Hugging Face space demo. Note that our generation results also obeys strict prompt separation.

Multi-Prompt Stream Batch Architecture

Multi-Prompt Stream Batch architecture

We extend Stream Batch architecture of StreamDiffusion [3] to allow streamed generation from multiple region-based text prompts.

Real-Time Semantic Palette

Our multi-prompt stream batch architecture allows fast, interactive generation with region-based controls. Try now from our code or wait until the demo is published online.

More Examples

Accelerated Text-to-Panorama Generation

Our acceleration technique speeds the generation of 512 x 3072 images up to 13 times faster than previous solution [1]. The time of inference is measured with a single 2080 Ti GPU.

Accelerated Region-Based Text-to-Image Generation

Text-to-Image generation result

Our StreamMultiDiffusion can synthesize high-resolution images in seconds while strictly obeying the regional text prompts. The size of this generation is 768 x 1920 with nine prompts including the background prompt. The time is measured with a single 2080 Ti GPU.


    title="{StreamMultiDiffusion:} Real-Time Interactive Generation with Region-Based Semantic Control",
    author={Lee, Jaerin and Jung, Daniel Sungho and Lee, Kanggeon and Lee, Kyoung Mu},
    journal={arXiv preprint arXiv:2403.09055},


[1] Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: MultiDiffusion: Fusing diffusion paths for controlled image generation. In ICML 2023.

[2] Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent Consistency Models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.

[3] Kodaira, A., Xu, C., Hazama, T., Yoshimoto, T., Ohno, K., Mitsuhori, S., Sugano, S., Cho, H., Liu, Z., Keutzer, K.: StreamDiffusion: A pipeline-level solution for real-time interactive generation. arXiv preprint arXiv:2312.12491, 2023.