Lumiere - a new text-to-video diffusion model
Lumiere is a new text-to-video diffusion model from Google AI. It uses a Space-Time U-Net architecture and unlike other models, it generates a video’s full temporal duration in one go, ensuring better coherence.
Architecture
Lumiere builds on pre-trained text-to-image models and by processing videos at multiple space-time scales, it generates a full frame rate video. The most common aproach that other methods use typically involves generating keyframes and filling in gaps with a cascade of temporal super-resolution models, followed by spatial enhancements applied in isolation. In contrast, Lumiere introduces a more integrated and efficient pipeline. By processing all frames simultaneously without the need for sequential temporal super-resolution models, it enables the learning of globally coherent motion. Here’s an overview of the architecture:
Lumiere architecture vs common approach
Moreover, it can handle multiple tasks such as image-to-video conversion, video inpainting and stylized video generation.
Results
Qualitatively the results look very good as you can see below:
Also, Lumiere performs well on quantitative analysis surpassing most models:
Zeri-shot text-to-video generation on UCF101 dataset
In addition, the authors conduct a human preference study where users were asked to choose between two options: a baseline model and one video generated by Lumiere. Participants were asked to choose the video that they deemed better in terms of visual quality and motion. Here are the results:
Human study
Conclusion
Lumiere is a promising new text-to-video diffusion model that generates a video’s full temporal duration in one pass, ensuring better coherence. It achieves good results and is a new advancement in the world of text-to-video generation. For more details please consults the paper: https://arxiv.org/abs/2401.12945 or the project page: https://lumiere-video.github.io.
Congrats to the authors for their great work!