1 minute read

Lumiere is a new text-to-video diffusion model from Google AI. It uses a Space-Time U-Net architecture and unlike other models, it generates a video’s full temporal duration in one go, ensuring better coherence.


Lumiere builds on pre-trained text-to-image models and by processing videos at multiple space-time scales, it generates a full frame rate video. The most common aproach that other methods use typically involves generating keyframes and filling in gaps with a cascade of temporal super-resolution models, followed by spatial enhancements applied in isolation. In contrast, Lumiere introduces a more integrated and efficient pipeline. By processing all frames simultaneously without the need for sequential temporal super-resolution models, it enables the learning of globally coherent motion. Here’s an overview of the architecture:

Lumiere architecture vs common approach

Moreover, it can handle multiple tasks such as image-to-video conversion, video inpainting and stylized video generation.


Qualitatively the results look very good as you can see below:

Also, Lumiere performs well on quantitative analysis surpassing most models:

Zeri-shot text-to-video generation on UCF101 dataset

In addition, the authors conduct a human preference study where users were asked to choose between two options: a baseline model and one video generated by Lumiere. Participants were asked to choose the video that they deemed better in terms of visual quality and motion. Here are the results:

Human study


Lumiere is a promising new text-to-video diffusion model that generates a video’s full temporal duration in one pass, ensuring better coherence. It achieves good results and is a new advancement in the world of text-to-video generation. For more details please consults the paper: https://arxiv.org/abs/2401.12945 or the project page: https://lumiere-video.github.io.

Congrats to the authors for their great work!