OLMo is a new large language model introduced by the Allen Institute for Artificial Intelligence. It’s achieves state of the art performance, surpassing LLaMA 2, and is truly Open Source. The authors have released the weights, training and inference code as well as the training data.
The OLMo Framework
OLMo comes in 3 size, a 1B and a 7B parameter that are already released and a 65B version that it’s coming soon. Similar to other LLMs, it is using a decoder-only transformer architecture and it’s trained on around 2T tokens. However it includes a series of changes over the vanilla transformer: no biases, non-parametric layer norm, SwiGLU activation function, Rotary positional embeddings and a vocabulary size of 50280.
OLMo model sizes. The 65B version is currently still training
As part of the release, the authors include the dataset used for pre-training: Dolma. Dolma is a diverse, multi-source corpus of 3T tokens across 5B documents acquired from 7 different data sources as detailed below:
Dolma dataset details
OLMo is evaluated on a series of downstream tasks and achieves state-of-the-art results when compared to other models, as detailed below:
OLMo-7B zero-shot evaluation
Moreover, it is also tested on intrinsic language modeling evaluation on a newly introduced perplexity benchmark called Paloma. Paloma includes 585 different text domains, ranging from nytimes.com to r/depression from Reddit.
Paloma evaluation results
OLMo introduces a new 7B model, with a larger 65B version planned for release soon. Unlike other models, OLMo’s developers have shared the entire training and evaluation code, as well as the pre-training data. This approach not only supports further research but also enables a wide range of new applications! Kudos to the authors for their great work! For more details about OLMo, you can read the full paper https://allenai.org/olmo/olmo-paper.pdf.