3 minute read

Genie, a new method from Google DeepMind enables the creation of interactive, action-controllable virtual worlds from unlabelled Internet videos. This unsupervised learning approach generates diverse environments that users can explore and interact with, paving the way for advanced agent training and novel applications in gaming and simulation.

Method Overview

Genie leverages a combination of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a latent action model to generate controllable video environments. It is trained on video data alone, without requiring action labels, using unsupervised learning to infer latent actions between frames, which enables frame-by-frame control over the generated video sequences. It uses a memory efficient ST-transformer for all components in order to mitigate the quadratic memory cost of the Vision Transformer that poses challenges for videos. The model consists of three components: video tokenizer, latent action model and a dynamic model as shown below:

The Latent Action Model infers the latent action between each pair of frames, while the Video Tokenizer converts raw video frames into discrete tokens. Then, given a latent action and past frame tokens, the Dynamics Model predicts the next frame in the video. The model is trained in two phases using a standard auto-regressive pipeline: a Video Tokenizer is trained first, then, the Latent action model and the Dynamics model are co-trained.

The Latent Action Model is designed for controllable video generation, where the prediction of each future frame is based on the action taken in the previous frame. Since action labels for videos, especially those sourced from the Internet, are expensive to annotate, the model adopts an unsupervised approach to learn latent (hidden) actions. So, in an Encoding Stage, the encoder takes as input all previous frames as well as the next frame, and outputs a set of continuous latent actions. Then, in a Decoding Stage, the decoder receives as input the previous frames and the latent actions and predicts the next frame. It uses a VQ-VAE-based objective which allows limiting the total number of actions to a small discrete set of codes. Restricting the size of the action vocabulary to 8 ensures that the number of possible latent actions remains small. In this way, the action should encode the most meaningful changes between the past and the future. This module is discarded during inference and replaced by actions given by the user, so it’s only used as a training signal.

A similar encode-decoder is used for the Video Tokenizer Training. The video tokenizer is trained to effectively encode video data into a latent space that can be efficiently manipulated. The process is highlighted below where z are the learnt tokens:

The Dynamics Model is responsible for understanding and predicting the temporal evolution of the environment based on actions. It uses decoder-only MaskGIT and receives as input the past token frames as well as the action and predicts the next frame.

For inference, the user gives a starting frame x1 and choses an integer that represents the latent action to take. Since the model was trained with 8 actions, it choses a number between 0 and 8. Now, this frame and the action are fed to the Dynamic Model to generate the next frame tokens. Then the process continues iteratively, the user inputing new actions. This process is highlighted below:


The dataset used for training is generated by filtering publicly available Internet videos with specific criteria related to 2D platformer games, actions like “speedrun” or “playthrough,” and excluding non-relevant content like “movie” or “unboxing.”Videos are split into 16-second clips at 10 FPS, resulting in 160 frames per clip, with the entire dataset amounting to approximately 244k hours from 55 million videos. The data obtained after this step often contains low quality videos. So, a selection process is employed using a learnt classifier. The team labels 10k videos on a scale of 1 (poor) to 5 (best) quality. This is used to train a binary classifier that is used to filter out videos. This process reduces the dataset to around 30k hours from 6.8 million videos and improves the overall performance.


The model, demonstrates some very promising results and had the ability of generating high-quality, diverse, and controllable environments. Here are some results highlights:


Genie represents a significant advance in generative AI, offering a method for interactive environment generation. I think it has the potential to open up numerous possibilities for creative content generation, immersive simulation experiences, and agent training. More details in the paper and on the project page.

Congrats to the authors for their work!

Bruce, Jake et al. “Genie: Generative Interactive Environments.” (2024).