Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

1 minute read

The paper proposes Panda-70M, a large-scale video dataset consisting of 70 million high-quality video clips paired with textual caption annotations. The key motivation is to create a dataset that can be used for pre-training and can facilitate advanced video understanding. High-quality video-text pairs are valuable for pretraining but difficult and expensive to manually collect at scale. Hence, the authors develop an automatic pipeline to generate captions by leveraging multiple teacher models and multiple input modalities. The new datasets, Panda-70M is then used for pre-training and the models are tested on three downstream tasks: video captioning, text-video retrieval and text-driven video generation.

Method Overview

The process starts by taking 3.8M videos from the publicly available dataset HD-VILA-100M. These are then split into semantically consistent video clips. In order to achieve this, they use a method that balances semantic coherence within a clip while keeping clips sufficiently long. The authors then use multiple (8) cross-modality teacher models to generate captions for each clip. Now, for each clips there are multiple candidate captions. The authors use a fine-tuned retrieval model to select the best caption. They fine-tune an existing retrieval model on a subset of clips where they have annotated the best caption manually.

Finally, because running 8 different captioning models is computational intensive, in order to possibly further scale this dataset, the authors train a student captioning model on Panda-70M (obtained using 8 models) to learn from the collection of 8 students. The architecture of this student is shown below:

Results

The constructed Panda-70M dataset contains 70.8 million 720p video clips with an average duration of 8.5 seconds paired with 13.2 word captions. Models pretrained on this dataset demonstrate significant gains over baselines on downstream tasks like:

video captioning

text/video retrieval

text-to-video generation

Also, qualitatively, the results look very promising as shown below:

Conclusion

In this paper, the authors introduce Panda-70M, a large-scale, high-quality video dataset. Pretraining on this dataset is shown to benefit several video and language tasks. More details in the full paper.

Congrats to the authors for their work!

Project page: https://snap-research.github.io/Panda-70M/

Code: https://github.com/snap-research/Panda-70M

Chen, Tsai-Shien et al. “Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers.” (2024).