3 minute read

Google has just released Gemini 1.5 Pro with a 1M tokens context size. Curious about how they might have achieved this? We may have some hints from this new paper from UC Berkeley, where the authors train a multimodal language model, called LWM (Large World Model) with, you guessed 1M tokens context size. In order to learn with such a huge context, there are several challenges that need to be addressed: memory constraints, computational complexity and limited datasets. Additionally, it open-sources several 7B models, designed to process lengthy text and video content, enhancing capabilities in text and video processing tasks. Now, let’s see how they achieve this!

Method overview

The goal of the paper is to train a large autoregressive transformer model with a very large context window. Below you can see an overview of the training process:

LWM training overview

Stage 1 - LLM context extension

To effectively learn long-range dependencies over sequences of millions of tokens, two key challenges must be addressed: scalable training on lengthy documents and extending the context of the base language reliably.

  1. Scalable Training on Long Documents: Training on lengthy documents becomes prohibitively expensive due to memory constraints stemming from the quadratic complexity of computing attention weights. To tackle this, the authors use RingAttention, a technique that employs block-wise computation with sequence parallelism. This theoretically extends to an infinite context, constrained only by the number of available devices.

  2. Progressive Training on Increasing Context Length: Using the approach described above allows training on documents with millions of tokens, it remains costly due to the quadratic computational complexity of attention. This results in gradient step time scaling roughly linearly with context size, allowing for a limited number of steps within a given timeframe. To mitigate this, the model is trained on progressively longer sequence lengths, starting from 32K tokens and ending at 1M tokens in increasing powers of two. This strategy enables the model to initially learn shorter-range dependencies before gradually incorporating longer sequences. Consequently, it allows training on significantly more tokens compared to directly training on the maximum target sequence length.

Stage 2 - Vision-Language Training

The LWM (illustrated below) is an autoregressive transformer designed to handle sequences containing millions of tokens. Each frame of a video is first tokenized into 256 tokens using VQGAN. These tokens, along with text tokens, are then concatenated and inputted into transformers to predict the subsequent token in an autoregressive manner. The model accommodates various training data formats, such as image-text, text-image, video, text-video, and pure text, by organizing input and output tokens accordingly. Special delimiters, and , are used to distinguish between image and text tokens, and for decoding, with additional and tokens marking the end of intermediate and final frames in images and videos, although these delimiters are not explicitly shown for simplicity. Ultimately, the model is trained to handle multimodal inputs and outputs in a flexible "any-to-any" manner across different modalities.

LWM overview

The training process is again progressive on a combined text-image and text-video data.


LWM achieves a very strong performance on retrieval tasks such as needle retrieval. Needle retrieval refers to the process of retrieving relevant information such as a short sentence from a very large document.

It can also answer questions over 1 hour videos as depicted below:

QA over 1 hour YouTube video

LWM is also compared to current models on image understanding and short video understanding. The performance here is average and in some cases it underperforms as compared to other models.


The development of LWM, capable of handling a 1M tokens context size, signifies a new advancement in multimodal language models by significantly increasing the context size and showcasing a strong performance on very difficult tasks such as Question Answering over 1 hour long videos. With models available online (see below), this work has a lot of potential in enabling future research in this field. For more details please consult the full paper.

Other resources

Models: https://huggingface.co/LargeWorldModel

Code: https://github.com/LargeWorldModel/LWM

Project page: https://largeworldmodel.github.io

Congrats to the authors for their work!

Liu, Hao et al. “World Model on Million-Length Video And Language With RingAttention.” (2024).