1 minute read

The paper proposes VisionLLaMA, a vision transformer architecture that is similar to the LLaMA language model. The main idea is to leverage the successful LLaMA architecture for vision tasks, such as image generation, classification, semantic segmentation, and object detection. By adapting the LLaMA architecture to the vision domain, the authors aim to achieve better performance and faster convergence compared to existing vision transformers.

Method Overview

VisionLLaMA follows the same pipeline as Vision Transformers (ViT) but incorporates components from the LLaMA language model architecture. The basic VisionLLaMA block consists of a multi-head self-attention (MHSA) layer with 2D rotary positional embeddings (2D RoPE), followed by a LayerNorm and a SwiGLU activation function.

The authors explore two main variants of VisionLLaMA (as shown above):

  1. Plain Transformer: This directly adapts the LLaMA block design for vision tasks. The input image is flattened into N patches. Then, a class token is prepended at the beginning of the sequence. The whole sequence is then processed by L VisionLLaMA blocks.

  2. Pyramid Transformer: Inspired by architectures like Twins, this variant incorporates local self-attention (LSA) and global self-attention (GSA) mechanisms.

To handle variable input resolutions, which is a requirement for many vision tasks, the authors introduce Auto-Scaled 2D RoPE (AS2DRoPE). This extends the 1D RoPE used in language models to 2D and employs interpolation scaling to accommodate arbitrary resolutions during inference.

Results

The authors extensively evaluate VisionLLaMA across various vision tasks and benchmarks.

For image generation, VisionLLaMA outperforms previous state-of-the-art methods like DiT and SiT in terms of Precision, Recall and other metrics.

In classification tasks on ImageNet, VisionLLaMA achieves comparable or better performance than existing vision transformers, such as DeiT3 and Swin, in both supervised and self-supervised settings.

For semantic segmentation on ADE20K and object detection on COCO, VisionLLaMA also demonstrates superior performance compared to strong baselines like Twins and Swin.

Conclusion

VisionLLaMA is a unified and generic modeling framework that shows promising gains over previous state-of-the-art vision transformers across various tasks, including image generation, classification, semantic segmentation, and object detection. For more details, please consult the full paper.

Repo (code will be available soon according to the paper): https://github.com/Meituan-AutoML/VisionLLaMA

Congrats to the authors for their work!

Updated: