The paper explores the capability of Large Language Models (LLMs) to autonomously hack websites. It demonstrates that some models (especially GPT-4) can perform complex cybersecurity attacks, such as blind database schema extraction and SQL injections, without human intervention or prior knowledge of the vulnerabilities.
VideoPrism is a a versatile video encoder that enhances video understanding across various tasks with a single model. It uses a large dataset of over 36 million high-quality video-caption pairs and 582 million video clips with noisy associated texts (extracted from ASR - Automatic Speech Recognition) for training. During pretraining, it uses global-local distillation of semantic video embeddings and a token shuffling scheme, allowing the model to focus on the video modality while leveraging the text. VideoPrism achieves state-of-the-art performance on a wide range of video understanding benchmarks (30 out of 33 tested benchmarks).
In this paper, the authors introduce Web Rephrase Augmented Pre-training (WRAP), aimed at enhancing language model training efficiency by rephrasing web documents into styles like Wikipedia or question-answer formats. This approach addresses the challenges of learning from noisy, unstructured web data, which typically requires significant compute and data resources.
The paper introduces REALIGN, a novel approach aimed at enhancing the quality of instruction data for large language models (LLMs) to better align with human values. This method reformats instruction data responses into a format that aligns with pre-established criteria and evidence, reducing errors and scaling issues associated with manual annotation and LLM hallucinations. The method is orthogonal to existing techniques and shows significant improvements in LLM performance without requiring new data or advanced training techniques.
The evolution of large language models (LLMs) has reached a point where the balance between data consumption and model quality is one of the key challenges. With the increasing computational costs and diminishing returns from simply scaling up, the focus has shifted towards more data-efficient pre-training techniques. Today’s paper aims to address these challenges by proposing a method of training data-efficient LLMs.
Spectral DeTuning is a new method that successfully recovers the original weights of generative models before they were fine-tuned with human feedback or other customization. It shows that pre-fine-tuning weights are recoverable and methods fine-tuned with LoRA can be susceptible to a new weight recovery type of attack.
Google has just released Gemini 1.5 Pro with a 1M tokens context size. Curious about how they might have achieved this? We may have some hints from this new paper from UC Berkeley, where the authors train a multimodal language model, called LWM (Large World Model) with, you guessed 1M tokens context size. In order to learn with such a huge context, there are several challenges that need to be addressed: memory constraints, computational complexity and limited datasets. Additionally, it open-sources several 7B models, designed to process lengthy text and video content, enhancing capabilities in text and video processing tasks. Now, let’s see how they achieve this!
In this paper from Google DeepMind, the authors have explored a new way for large language models (LLMs) to reason without the usual need for crafted prompts. More exactly, they study how LLMs can generate reasoning paths on their own, just by tweaking how they decode information. Up until now, getting LLMs to reason involved guiding them with carefully designed prompts. Contrary, this method introduces CoT-decoding and shows that by changing how models consider different possible next words (tokens) while they decode a question, the models can start reasoning through problems without any prompts. This finding suggests that LLMs might have an in-built ability to reason more deeply than previously thought, as they can display this capability when allowed to consider alternatives during their decoding process.
In this new paper from Google DeepMind, the authors study the significance of premise ordering in different reasoning tasks when using large language models (LLMs). Despite the logical axiom that the sequence of premises should not affect the validity of a conclusion, this study uncovers a sensitivity of LLMs to the arrangement of these premises. Let’s consider the following example to make things more clear:
Meta Reality Labs has developed a new system called Lumos. This system combines Multimodal Large Language Models (MM-LLMs) with Scene Text Recognition (STR) in order to improve the performance on various tasks such as multimodal question-answering, text summarization, etc. It uses STR to extract text from first-person point-of-view images which is used to augment the input to a multimodal large language model. This approach has a lot of challenges ranging from overall latency to compute resources and performing STR on “in-the-wild” images.
OS-Copilot is a new framework that introduces a method for creating general-purpose computer agents with enhanced capabilities and adaptability. It aims to facilitate the development of agents that can effectively interact with various operating system components, performing a wide range of tasks, from navigating files and executing code to handling multimedia content. The framework’s effectiveness is demonstrated through the performance of its agent, FRIDAY, on the GAIA benchmark, surpassing existing methods and showcasing its ability to learn and accumulate skills in unfamiliar environments.
SELF-DISCOVER is a new method that allows LLMs such as GPT-4 and PaLM 2 to autonomously construct their internal reasoning structures. This approach enhances LLMs’ abilities to address complex reasoning tasks while achieving significant gains in performance.
In-context learning (ICL or few-shot prompting) it’s one of the most widely used methods for adapting LLMs to downstream tasks, by learning from just a few input-output examples. In this paper, the authors introduce LEAP (Learning Principles). The goal is to learn more from this few examples. In order to achieve this, the model learns task-specific principles that allow a better performance on new unseen questions. Below you can see some examples of learned principles:
Tag-LLM is a new method from Microsoft Research designed to enhance the performance of Large Language Models (LLMs) in specialized domains. This approach addresses the challenge of applying general-purpose LLMs to tasks in areas such as the physical and biomedical sciences, where specific domain knowledge is crucial and often not well-represented in the models’ pretraining data.
A new paper titled “More Agents Is All You Need” from Tencent presents an approach to significantly enhance the performance of large language models (LLMs) through a simple yet effective method: use a sampling-and-voting mechanism to combine multiple agents and boost performance. Below you can see a teaser of the main results:
Depth Anything offers a practical approach to depth estimation, specifically monocular depth estimation where depth is estimated from a single image. This has far-reaching applications in fields like self-driving cars and virtual reality. Instead of relying on hard-to-obtain labeled images, Depth Anything leverages a large dataset of 62 million regular images for training. This allows it to predict depths accurately across a wider range of situations.
PALP: Prompt Aligned Personalization is a new paper from Google that allows the creation of images with complex and intricate prompts. Traditional text-to-image AI struggles with complex prompts and personal subjects. The trade-off between personalization and prompt fidelity has been a significant hurdle. PALP aims to address this.
Lumiere is a new text-to-video diffusion model from Google AI. It uses a Space-Time U-Net architecture and unlike other models, it generates a video’s full temporal duration in one go, ensuring better coherence.
In the rapidly evolving field of LLMs, the introduction of the TrustLLM framework offers a new approach to evaluating the ethical dimensions of large language models. This comprehensive analysis not only shows the current state of AI trustworthiness but also sets the stage for a more responsible and ethical future in AI development.
In the ever-evolving landscape of Large Language Models (LLMs), the need for scalable and accurate evaluation metrics is more pressing than ever. In this blog, we will talk about SemScore, a new evaluation metric that leverages Semantic Textual Similarity (STS) to offer a direct comparison between model outputs and gold standard responses. This approach not only promises to streamline the assessment of instruction-tuned LLMs but also aligns closely with human judgment, marking a significant step forward in automated LLM evaluation.
Learning from human feedback is crucial for developing effective Large Language Models. However, gathering such feedback on a large scale can be challenging. In this paper, the authors suggest that to create superhuman AI agents, we need feedback that goes beyond what humans can provide due to the sheer volume of data involved. To address this, they propose the use of self-reward mechanisms, where the model rewards itself during training through LLM-as-a-Judge prompting.
Understanding context is one of the key requirements for a Large Language Model. Although there’s been a noticeable improvement in performance, with recent large language models (LLMs) demonstrating remarkable capabilities, there’s a growing emphasis on evaluating their problem-solving quality rather than exploring their ability to comprehend context. This new study from Georgetown University and Apple introduces a new context understanding benchmark. The benchmark consist of four tasks and nine datasets:
OLMo is a new large language model introduced by the Allen Institute for Artificial Intelligence. It’s achieves state of the art performance, surpassing LLaMA 2, and is truly Open Source. The authors have released the weights, training and inference code as well as the training data.