LayerFlow : A Unified Model for Layer-aware Video Generation

SIGGRAPH 2025

Sihui Ji^1,2 Hao Luo^2,3 Xi Chen¹ Yuanpeng Tu¹ Yiyang Wang¹ Hengshuang Zhao¹

¹HKU ²DAMO Academy, Alibaba Group ³Hupan Lab

arXiv BibTex Code

We present LayerFlow, a unified solution for layer-aware video generation.
Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene.
It enables a wide range of applications for users, including multi-layer video generation, multi-layer video decomposition,
foreground-conditioned layer generation, and background-conditioned layer generation. (Scroll to view more videos)

Multi-layer Video Generation

Take three layer-wise descriptions as input, the generated results of
foreground, background, and blended videos are shown above.

Our VideoAnydoor enables a wide range of applications for users, including video virtual tryon, video face swapping, logo insertion, and multi-region editing. (Scroll to view more videos)

Multi-layer Video Decomposition

Take three layer-wise descriptions and a blended video as input,
the decomposition results of foreground, background videos are shown above.

Our VideoAnydoor enables a wide range of applications for users, including video virtual tryon, video face swapping, logo insertion, and multi-region editing. (Scroll to view more videos)

Foreground-conditioned Layer Generation

Take three layer-wise descriptions and a foreground video as input,
the generated results of background, blended videos are shown above.

Our VideoAnydoor enables a wide range of applications for users, including video virtual tryon, video face swapping, logo insertion, and multi-region editing. (Scroll to view more videos)

Background-conditioned Layer Generation

Take three layer-wise descriptions and a background video as input,
the generated results of foreground, blended videos are shown above.

Abstract

We present LayerFlow a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos of different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.

Overall Pipeline

Overall pipeline of LayerFlow, which allows for the production of multi-layer videos including transparent foreground, undisturbed background and blended sequences. We organize videos of different layers as sub-clips and concatenate them to form a whole sequence to be encoded by VAE encoder. At the same time, index modification is conducted before prompts are processed by the T5 encoder, then layer embedding is added to text embeddings to impart layer awareness. All the visual patches and text embeddings are fed into transformer blocks as a long tensor. In the process of training, a base model is firstly trained on crudely made multi-layer video data for initial layered generation ability. Motion LoRA tuning prepares the model with accommodations for frozen video and Content LoRA can then borrow knowledge from both high-quality duplicated multi-layer images and copy-pasted video data for improving layer-aware synthesis quality as well as maintenance of motion dynamics.

LayerFlow : A Unified Model for Layer-aware Video Generation

Multi-layer Video Generation

Take three layer-wise descriptions as input, the generated results of foreground, background, and blended videos are shown above.

Our VideoAnydoor enables a wide range of applications for users, including video virtual tryon, video face swapping, logo insertion, and multi-region editing. (Scroll to view more videos)

Multi-layer Video Decomposition

Take three layer-wise descriptions and a blended video as input, the decomposition results of foreground, background videos are shown above.

Our VideoAnydoor enables a wide range of applications for users, including video virtual tryon, video face swapping, logo insertion, and multi-region editing. (Scroll to view more videos)

Foreground-conditioned Layer Generation

Take three layer-wise descriptions and a foreground video as input, the generated results of background, blended videos are shown above.

Our VideoAnydoor enables a wide range of applications for users, including video virtual tryon, video face swapping, logo insertion, and multi-region editing. (Scroll to view more videos)

Background-conditioned Layer Generation

Take three layer-wise descriptions and a background video as input, the generated results of foreground, blended videos are shown above.

Abstract

Overall Pipeline

BibTeX

Take three layer-wise descriptions as input, the generated results of
foreground, background, and blended videos are shown above.

Take three layer-wise descriptions and a blended video as input,
the decomposition results of foreground, background videos are shown above.

Take three layer-wise descriptions and a foreground video as input,
the generated results of background, blended videos are shown above.

Take three layer-wise descriptions and a background video as input,
the generated results of foreground, blended videos are shown above.