1 minute read

The paper introduces AtomoVideo, a novel framework for high-fidelity image-to-video (I2V) generation. AtomoVideo can generate high-quality videos from an input image, achieving superior motion intensity and consistency compared to existing works. It can also combine with advanced text-to-image models for text-to-video generation.

Method Overview

AtomoVideo uses a pre-trained text-to-image model and adds temporal convolution and attention layers. It injects the input image information in two ways: 1) Concatenating the image latent and mask with noise latent as input, and 2) Injecting high-level image semantics through cross-attention. During training, zero terminal SNR and v-prediction strategies are employed to improve stability. The model can also be extended to video frame prediction by iteratively generating subsequent frames. An overview can be seen below:

Multi-Granularity Image Injection

AtomoVideo injects the input image information at two levels:

  1. Low-level Image Injection: The image is encoded through a VAE encoder to obtain a low-level representation (Fi), which is concatenated with the noise latent (Xt) and a binary frame mask (Fm) as the input to the UNet. This helps recover fine-grained image details for better fidelity.

  2. High-level Semantic Injection: The input image is also encoded using a CLIP image encoder to obtain a high-level semantic representation, which is injected through cross-attention layers. This improves the semantic consistency of the generated video with the given image.

Video Frame Prediction

For long video generation, AtomoVideo can perform video frame prediction by iteratively generating subsequent frames given the preceding frames. The input image conditions and masks can be flexibly replaced with any sequence of frames from a given video, without requiring any other changes to the model architecture.


Quantitative evaluations on various metrics like image consistency, temporal consistency, video-text alignment, motion intensity, and video quality show that AtomoVideo outperforms other methods, achieving good results.

Qualitative samples demonstrate that AtomoVideo generates more coherent and stable videos with greater motion intensity compared to other methods. Additionally, AtomoVideo can be combined with personalized text-to-image models for customized video generation.


AtomoVideo is a new image-to-video generation framework that achieves excellent performance by maintaining superior temporal consistency and stability while generating videos with greater motion intensity. For more details please consult the full paper or the project page.

Congrats to the authors for their work!

Gong, Litong, et al. “AtomoVideo: High Fidelity Image-to-Video Generation.” arXiv preprint arXiv:2403.01800, 2024.