VideoMamba: State Space Model for Efficient Video Understanding

The paper proposes VideoMamba, a state space model (SSM)-based approach for efficient video understanding. It aims to address the challenges of local redundancy and global dependencies in video data, leveraging the linear complexity of the Mamba operator for long-term modeling.

Multi-LoRA Composition for Image Generation

The paper introduces two new methods, LoRA Switch and LoRA Composite, for composing multiple Low-Rank Adaptations (LoRAs) for the task of text-to-image generation. The main idea is to explore a decoding-centric perspective, where the LoRA weights are kept intact, and the composition is achieved by selectively activating LoRAs during the denoising process of diffusion models. The paper also proposes ComposLoRA a testbed consisting of 480 composition sets used to measure the performance.

Design2Code: How Far Are We From Automating Front-End Engineering?

The paper proposes the Design2Code task and benchmark, which measures the ability of multimodal LLMs to automate front-end engineering by generating code implementations that directly render into the given reference webpages, given the screenshots as input.

AtomoVideo: High Fidelity Image-to-Video Generation

The paper introduces AtomoVideo, a novel framework for high-fidelity image-to-video (I2V) generation. AtomoVideo can generate high-quality videos from an input image, achieving superior motion intensity and consistency compared to existing works. It can also combine with advanced text-to-image models for text-to-video generation.

VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

The paper proposes VisionLLaMA, a vision transformer architecture that is similar to the LLaMA language model. The main idea is to leverage the successful LLaMA architecture for vision tasks, such as image generation, classification, semantic segmentation, and object detection. By adapting the LLaMA architecture to the vision domain, the authors aim to achieve better performance and faster convergence compared to existing vision transformers.

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

The paper proposes Panda-70M, a large-scale video dataset consisting of 70 million high-quality video clips paired with textual caption annotations. The key motivation is to create a dataset that can be used for pre-training and can facilitate advanced video understanding. High-quality video-text pairs are valuable for pretraining but difficult and expensive to manually collect at scale. Hence, the authors develop an automatic pipeline to generate captions by leveraging multiple teacher models and multiple input modalities. The new datasets, Panda-70M is then used for pre-training and the models are tested on three downstream tasks: video captioning, text-video retrieval and text-driven video generation.


This paper provides a systematic study on how different scaling factors affect the performance of large language model (LLM) finetuning. Specifically, it explores the impact of factors like LLM model size, pretraining data size, finetuning data size, and tunable parameter sizes for parameter efficient tuning (PET) methods such as prompt tuning and LoRA.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

The paper introduces BitNet b1.58, a 1.58-bit large language model (LLM) variant that matches the performance of full-precision models while being much more cost-effective for inference.

Instruction-tuned Language Models are Better Knowledge Learners

Today’s paper studies how to improve large language models’ (LLMs) ability to acquire and access knowledge from new documents through continued training. Specifically, it proposes a pre-instruction (PIT) tuning approach.

EMO: Emote Portrait Alive

Today’s paper introduces EMO, a new framework for generating realistic talking head videos from just a single portrait photo and an audio clip. The key innovation lies in EMO’s ability to directly translate audio cues like tone and pronunciation into corresponding facial expressions and head movements. By capturing the dynamic link between vocals and motions, EMO can animate portrait photos with a diverse range of lively and natural motions synchronized to the audio.

What Evidence Do Language Models Find Convincing?

Today’s paper “What Evidence Do Language Models Find Convincing?” explores what types of evidence and argumentation techniques language models find convincing when presented with ambiguous, open-domain questions that have conflicting answers online. For example, for the question “does aspartame cause cancer?” there are arguments and evidence supporting both yes and no answers. The goal is to study why models might be convinced by certain real-world evidence paragraphs over others.

Genie: Generative Interactive Environments

Genie, a new method from Google DeepMind enables the creation of interactive, action-controllable virtual worlds from unlabelled Internet videos. This unsupervised learning approach generates diverse environments that users can explore and interact with, paving the way for advanced agent training and novel applications in gaming and simulation.

A Closer Look at the Limitations of Instruction Tuning

In this paper, the authors investigate the efficacy of Instruction Tuning (IT) in Large Language Models (LLMs) for conversational agents. Instruction Tuning represents the process of training large language models (LLMs) using instruction-response pairs and is the widely used method for transforming base pre-trained LLMs into conversational agents. The goal of this work is to uncover limitations of instruction tuning through a series of experiments, focusing on how these limitations affect the performance and capabilities of LLMs.

LLM Agents can Autonomously Hack Websites

The paper explores the capability of Large Language Models (LLMs) to autonomously hack websites. It demonstrates that some models (especially GPT-4) can perform complex cybersecurity attacks, such as blind database schema extraction and SQL injections, without human intervention or prior knowledge of the vulnerabilities.

VideoPrism: A Foundational Visual Encoder for Video Understanding

VideoPrism is a a versatile video encoder that enhances video understanding across various tasks with a single model. It uses a large dataset of over 36 million high-quality video-caption pairs and 582 million video clips with noisy associated texts (extracted from ASR - Automatic Speech Recognition) for training. During pretraining, it uses global-local distillation of semantic video embeddings and a token shuffling scheme, allowing the model to focus on the video modality while leveraging the text. VideoPrism achieves state-of-the-art performance on a wide range of video understanding benchmarks (30 out of 33 tested benchmarks).

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

In this paper, the authors introduce Web Rephrase Augmented Pre-training (WRAP), aimed at enhancing language model training efficiency by rephrasing web documents into styles like Wikipedia or question-answer formats. This approach addresses the challenges of learning from noisy, unstructured web data, which typically requires significant compute and data resources.

Reformatted Alignment

The paper introduces REALIGN, a novel approach aimed at enhancing the quality of instruction data for large language models (LLMs) to better align with human values. This method reformats instruction data responses into a format that aligns with pre-established criteria and evidence, reducing errors and scaling issues associated with manual annotation and LLM hallucinations. The method is orthogonal to existing techniques and shows significant improvements in LLM performance without requiring new data or advanced training techniques.

How to Train Data-Efficient LLMs

The evolution of large language models (LLMs) has reached a point where the balance between data consumption and model quality is one of the key challenges. With the increasing computational costs and diminishing returns from simply scaling up, the focus has shifted towards more data-efficient pre-training techniques. Today’s paper aims to address these challenges by proposing a method of training data-efficient LLMs.

Recovering the Pre-Fine-Tuning Weights of Generative Models

Spectral DeTuning is a new method that successfully recovers the original weights of generative models before they were fine-tuned with human feedback or other customization. It shows that pre-fine-tuning weights are recoverable and methods fine-tuned with LoRA can be susceptible to a new weight recovery type of attack.

World Model on Million-Length Video and Language with RingAttention

Google has just released Gemini 1.5 Pro with a 1M tokens context size. Curious about how they might have achieved this? We may have some hints from this new paper from UC Berkeley, where the authors train a multimodal language model, called LWM (Large World Model) with, you guessed 1M tokens context size. In order to learn with such a huge context, there are several challenges that need to be addressed: memory constraints, computational complexity and limited datasets. Additionally, it open-sources several 7B models, designed to process lengthy text and video content, enhancing capabilities in text and video processing tasks. Now, let’s see how they achieve this!

Chain-of-Thought Reasoning Without Prompting

In this paper from Google DeepMind, the authors have explored a new way for large language models (LLMs) to reason without the usual need for crafted prompts. More exactly, they study how LLMs can generate reasoning paths on their own, just by tweaking how they decode information. Up until now, getting LLMs to reason involved guiding them with carefully designed prompts. Contrary, this method introduces CoT-decoding and shows that by changing how models consider different possible next words (tokens) while they decode a question, the models can start reasoning through problems without any prompts. This finding suggests that LLMs might have an in-built ability to reason more deeply than previously thought, as they can display this capability when allowed to consider alternatives during their decoding process.

Premise Order Matters in Reasoning with Large Language Models

In this new paper from Google DeepMind, the authors study the significance of premise ordering in different reasoning tasks when using large language models (LLMs). Despite the logical axiom that the sequence of premises should not affect the validity of a conclusion, this study uncovers a sensitivity of LLMs to the arrangement of these premises. Let’s consider the following example to make things more clear:

Lumos: Empowering Multimodal LLMs with Scene Text Recognition

Meta Reality Labs has developed a new system called Lumos. This system combines Multimodal Large Language Models (MM-LLMs) with Scene Text Recognition (STR) in order to improve the performance on various tasks such as multimodal question-answering, text summarization, etc. It uses STR to extract text from first-person point-of-view images which is used to augment the input to a multimodal large language model. This approach has a lot of challenges ranging from overall latency to compute resources and performing STR on “in-the-wild” images.

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

OS-Copilot is a new framework that introduces a method for creating general-purpose computer agents with enhanced capabilities and adaptability. It aims to facilitate the development of agents that can effectively interact with various operating system components, performing a wide range of tasks, from navigating files and executing code to handling multimedia content. The framework’s effectiveness is demonstrated through the performance of its agent, FRIDAY, on the GAIA benchmark, surpassing existing methods and showcasing its ability to learn and accumulate skills in unfamiliar environments.

SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures

SELF-DISCOVER is a new method that allows LLMs such as GPT-4 and PaLM 2 to autonomously construct their internal reasoning structures. This approach enhances LLMs’ abilities to address complex reasoning tasks while achieving significant gains in performance.

In-Context Principle Learning from Mistakes

In-context learning (ICL or few-shot prompting) it’s one of the most widely used methods for adapting LLMs to downstream tasks, by learning from just a few input-output examples. In this paper, the authors introduce LEAP (Learning Principles). The goal is to learn more from this few examples. In order to achieve this, the model learns task-specific principles that allow a better performance on new unseen questions. Below you can see some examples of learned principles:

Adapting LLMs to Specialized Domains

Tag-LLM is a new method from Microsoft Research designed to enhance the performance of Large Language Models (LLMs) in specialized domains. This approach addresses the challenge of applying general-purpose LLMs to tasks in areas such as the physical and biomedical sciences, where specific domain knowledge is crucial and often not well-represented in the models’ pretraining data.

Harnessing the Power of Multiple Agents in Large Language Models

A new paper titled “More Agents Is All You Need” from Tencent presents an approach to significantly enhance the performance of large language models (LLMs) through a simple yet effective method: use a sampling-and-voting mechanism to combine multiple agents and boost performance. Below you can see a teaser of the main results:

Depth Antyhing

Depth Anything offers a practical approach to depth estimation, specifically monocular depth estimation where depth is estimated from a single image. This has far-reaching applications in fields like self-driving cars and virtual reality. Instead of relying on hard-to-obtain labeled images, Depth Anything leverages a large dataset of 62 million regular images for training. This allows it to predict depths accurately across a wider range of situations.

PALP: Prompt Aligned Personalization

PALP: Prompt Aligned Personalization is a new paper from Google that allows the creation of images with complex and intricate prompts. Traditional text-to-image AI struggles with complex prompts and personal subjects. The trade-off between personalization and prompt fidelity has been a significant hurdle. PALP aims to address this.

Lumiere - a new text-to-video diffusion model

Lumiere is a new text-to-video diffusion model from Google AI. It uses a Space-Time U-Net architecture and unlike other models, it generates a video’s full temporal duration in one go, ensuring better coherence.

TrustLLM: Trustworthiness in Large Language Models

In the rapidly evolving field of LLMs, the introduction of the TrustLLM framework offers a new approach to evaluating the ethical dimensions of large language models. This comprehensive analysis not only shows the current state of AI trustworthiness but also sets the stage for a more responsible and ethical future in AI development.

Semantic Evaluation of LLMs

In the ever-evolving landscape of Large Language Models (LLMs), the need for scalable and accurate evaluation metrics is more pressing than ever. In this blog, we will talk about SemScore, a new evaluation metric that leverages Semantic Textual Similarity (STS) to offer a direct comparison between model outputs and gold standard responses. This approach not only promises to streamline the assessment of instruction-tuned LLMs but also aligns closely with human judgment, marking a significant step forward in automated LLM evaluation.

Self-rewarding language models

Learning from human feedback is crucial for developing effective Large Language Models. However, gathering such feedback on a large scale can be challenging. In this paper, the authors suggest that to create superhuman AI agents, we need feedback that goes beyond what humans can provide due to the sheer volume of data involved. To address this, they propose the use of self-reward mechanisms, where the model rewards itself during training through LLM-as-a-Judge prompting.

Can Large Language Models Understand Context?

Understanding context is one of the key requirements for a Large Language Model. Although there’s been a noticeable improvement in performance, with recent large language models (LLMs) demonstrating remarkable capabilities, there’s a growing emphasis on evaluating their problem-solving quality rather than exploring their ability to comprehend context. This new study from Georgetown University and Apple introduces a new context understanding benchmark. The benchmark consist of four tasks and nine datasets:

OLMo - an Open Language Model

OLMo is a new large language model introduced by the Allen Institute for Artificial Intelligence. It’s achieves state of the art performance, surpassing LLaMA 2, and is truly Open Source. The authors have released the weights, training and inference code as well as the training data.