Lumos: Empowering Multimodal LLMs with Scene Text Recognition

2 minute read

Meta Reality Labs has developed a new system called Lumos. This system combines Multimodal Large Language Models (MM-LLMs) with Scene Text Recognition (STR) in order to improve the performance on various tasks such as multimodal question-answering, text summarization, etc. It uses STR to extract text from first-person point-of-view images which is used to augment the input to a multimodal large language model. This approach has a lot of challenges ranging from overall latency to compute resources and performing STR on “in-the-wild” images.

The goal of Lumos is to run on edge devices. While looking at the examples above, one may argue that the task can be solved by a MM-LLM. However, since the text can be relatively small as compared to the whole image, the results of text recognition may be sub-optimal when using a single multimodal LLM. This is where STR comes into place. Now, let’s talk about the overall architecture.

Overview

In order to reduce latency, Lumos does a mix of on-device and cloud-side computing. Additionally, Lumos supports voice queries and voice responses.

Lumos overview

The overall pipeline is quite straightforward: as input there is a voice query and an image. These are processed on device, then the results are sent to the cloud-based MM-LLM where the prompt is constructed based on the STR data and the query and fed to the model. The final result uses text-to-speach (TTS) and sent back to the user. Lumos is designed to tackle challenges related to STR quality, overall latency, and model inference. The core components of STR are:

Region of Interest (ROI) Detection: Identifies and focuses on the text-rich areas within images, optimizing the process for text extraction.
Text Detection and Recognition: Accurately detects and interprets text within the designated ROI, ensuring high-quality text recognition.
Reading Order Reconstruction: Arranges the recognized text in a logical sequence, mimicking the natural reading order, which is crucial for understanding the context.

The flow between these components can be seen below:

Lumos STR flow

Results

The performance of Lumos is evaluated on several benchmarks. On question-answering, it achieves a 80% accuracy, and shows that using STR, the performance is significantly improved.

Question-Answering (QA) results

Furthermore, Lumos stands out for its low word error rate, showcasing its ability to accurately interpret text from images:

Word error rate (WER) results

Conclusion

Lumos is one of the first multimodal systems that incorporates STR and is able to run on edge devices. For more details, please consult the full paper.

Congratulations to the team at Meta Reality Labs for their work!

Shenoy, Ashish et al. “Lumos : Empowering Multimodal LLMs with Scene Text Recognition.” (2024).

Twitter Facebook LinkedIn

Vlad Bogolin

Lumos: Empowering Multimodal LLMs with Scene Text Recognition

Overview

Results

Conclusion

You May Also Enjoy

VideoMamba: State Space Model for Efficient Video Understanding

Multi-LoRA Composition for Image Generation

Design2Code: How Far Are We From Automating Front-End Engineering?