The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
The paper introduces BitNet b1.58, a 1.58-bit large language model (LLM) variant that matches the performance of full-precision models while being much more cost-effective for inference.
Method Overview
BitNet b1.58 constrains all parameters to three values: -1, 0 or +1 (1.58-bit). In this way, the usual operations only require additions and no multiplications as shown below:
It’s using an absmean quantization function that first scales the weight matrix by its average absolute value, and then round each value to the nearest integer among {-1, 0, +1}.
The final model adopts the LLaMA architecture with replaced linear layers by BitLinear (as described in BitNet). It is trained from scratch with 1.58-bit weights and 8-bit activations. Because of this, the model requires “almost no multiplications”.
Results
The key results of BitNet b1.58 are:
- BitNet b1.58 matches perplexity and end-task performance of full-precision LLaMA LM baselines from 3B parameters.
- At 3.9B parameters, BitNet b1.58 outperforms the 3B LLaMA LM while using 3.3x less memory and being 2.4x faster.
- BitNet b1.58 is significantly more energy-efficient, with up to 41x lower energy consumption than full-precision models.
- It also demonstrates up to 11x higher batch size and 8.9x higher throughput compared to baselines.
- It enables the creation of new hardware specifically designed to take advantage of this setup.
Conclusion
The paper introduces BitNet b1.58 which uses a ternary representation {-1, 0 or +1} for all parameters of the LLM. This shows very promising results for 3B models where it matches the performance while being around two times faster and requiring 3 times less memory. More details in the paper.
Congrats to the authors for their work!
Ma, Shuming et al. “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.” (2024).