The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

1 minute read

The paper introduces BitNet b1.58, a 1.58-bit large language model (LLM) variant that matches the performance of full-precision models while being much more cost-effective for inference.

Method Overview

BitNet b1.58 constrains all parameters to three values: -1, 0 or +1 (1.58-bit). In this way, the usual operations only require additions and no multiplications as shown below:

It’s using an absmean quantization function that first scales the weight matrix by its average absolute value, and then round each value to the nearest integer among {-1, 0, +1}.

The final model adopts the LLaMA architecture with replaced linear layers by BitLinear (as described in BitNet). It is trained from scratch with 1.58-bit weights and 8-bit activations. Because of this, the model requires “almost no multiplications”.

Results

The key results of BitNet b1.58 are:

BitNet b1.58 matches perplexity and end-task performance of full-precision LLaMA LM baselines from 3B parameters.

At 3.9B parameters, BitNet b1.58 outperforms the 3B LLaMA LM while using 3.3x less memory and being 2.4x faster.

BitNet b1.58 is significantly more energy-efficient, with up to 41x lower energy consumption than full-precision models.

It also demonstrates up to 11x higher batch size and 8.9x higher throughput compared to baselines.

It enables the creation of new hardware specifically designed to take advantage of this setup.

Conclusion

The paper introduces BitNet b1.58 which uses a ternary representation {-1, 0 or +1} for all parameters of the LLM. This shows very promising results for 3B models where it matches the performance while being around two times faster and requiring 3 times less memory. More details in the paper.

Congrats to the authors for their work!

Ma, Shuming et al. “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.” (2024).

Twitter Facebook LinkedIn

Vlad Bogolin

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Method Overview

Results

Conclusion

You May Also Enjoy

VideoMamba: State Space Model for Efficient Video Understanding

Multi-LoRA Composition for Image Generation

Design2Code: How Far Are We From Automating Front-End Engineering?