1 minute read

The paper introduces BitNet b1.58, a 1.58-bit large language model (LLM) variant that matches the performance of full-precision models while being much more cost-effective for inference.

Method Overview

BitNet b1.58 constrains all parameters to three values: -1, 0 or +1 (1.58-bit). In this way, the usual operations only require additions and no multiplications as shown below:

It’s using an absmean quantization function that first scales the weight matrix by its average absolute value, and then round each value to the nearest integer among {-1, 0, +1}.

The final model adopts the LLaMA architecture with replaced linear layers by BitLinear (as described in BitNet). It is trained from scratch with 1.58-bit weights and 8-bit activations. Because of this, the model requires “almost no multiplications”.

Results

The key results of BitNet b1.58 are:

  • BitNet b1.58 matches perplexity and end-task performance of full-precision LLaMA LM baselines from 3B parameters.

  • At 3.9B parameters, BitNet b1.58 outperforms the 3B LLaMA LM while using 3.3x less memory and being 2.4x faster.

  • BitNet b1.58 is significantly more energy-efficient, with up to 41x lower energy consumption than full-precision models.

  • It also demonstrates up to 11x higher batch size and 8.9x higher throughput compared to baselines.

  • It enables the creation of new hardware specifically designed to take advantage of this setup.

Conclusion

The paper introduces BitNet b1.58 which uses a ternary representation {-1, 0 or +1} for all parameters of the LLM. This shows very promising results for 3B models where it matches the performance while being around two times faster and requiring 3 times less memory. More details in the paper.

Congrats to the authors for their work!

Ma, Shuming et al. “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.” (2024).

Updated: