1 minute read

In this paper, the authors introduce Web Rephrase Augmented Pre-training (WRAP), aimed at enhancing language model training efficiency by rephrasing web documents into styles like Wikipedia or question-answer formats. This approach addresses the challenges of learning from noisy, unstructured web data, which typically requires significant compute and data resources.

Method Overview

WRAP uses an instruction-tuned model to rephrase web documents into various styles, creating synthetic data. Here’s an overview of the method:

WRAP overview

This method allows for efficient learning from a blend of real and synthetic data, significantly reducing the need for high-quality web data. The process involves prompting a pre-trained LLM to generate paraphrases, combining these with real data for model training.

Building on the observation that high-quality data, like Wikipedia, improves language modeling, WRAP employs a strategy to rephrase web documents into four distinct styles:

  • Easy - understandable even by a toddler

  • Medium - similar to Wikipedia articles

  • Hard - in terse and abstruse language

  • Q/A - in question-answering format

The prompts for each style are shown below:

Prompt templates for the 4 styles

By utilizing an instruction-tuned model, specifically Mistral-7B, WRAP generates synthetic data. WRAP then combines this synthetic data with real web data in a 1:1 ratio, incorporating both the diversity of internet content and the quality of structured rephrasing, thus enabling the model to learn from a rich dataset that balances informative content with the realistic messiness of web text.

Results

The application of WRAP on the C4 dataset resulted in approximately 3x faster pre-training and improved model perplexity by over 10% across various subsets of the Pile dataset.