BitNet b1.58: The Beginning of the Sustainable AI

The recent developments in large language models (LLMs) have significantly transformed natural language processing. With their exceptional ability to comprehend, generate, and engage with human language, LLMs have revolutionized how we interact with AI, developing advanced chatbots, virtual assistants, and sophisticated content creation, translation, and summarization tools.

However, despite their unprecedented potential, LLMs pose several challenges, particularly regarding energy consumption and computational resources. LLMs’ training and deployment require substantial energy, mainly due to the extensive use of GPUs and other high-performance computing resources. This raises environmental concerns and limits the accessibility of state-of-the-art models, as the cost and availability of the necessary hardware become significant barriers for smaller organizations and researchers.

Therefore, while LLMs continue to push the boundaries of AI, their impact on energy consumption and the need for specialized hardware highlights the importance of pursuing more efficient and sustainable approaches to model development and deployment.

That’s where Microsoft Research’s recently introduced BitNet b1.58, a variant of the 1-bit LLM architecture, plays a key role. This research has gained popularity and interest from the AI community. We will unpack this research paper on this blog.

Quantization

Before diving into its details, let’s first understand quantization and its importance.

The neural networks that make up the large language models rely on the weights and activations of the internal nodes. Quantization is a technique used to reduce the precision of the numerical values representing these model parameters (weights) and activations (outputs of layers). During the quantization process, weights and other parameters are converted values from high-precision formats, such as 32-bit floating-point (FP32), to lower-precision formats, like 8-bit integers (INT8) or even binary and ternary formats. Quantization decreases the model’s memory footprint, speeds up inference and training times, and reduces energy consumption without significantly impacting accuracy.

To illustrate this, let’s take an example of a neural network node with a precise weight of 8.6256 (without quantization). Since it’s a number with lots of decimal points, memory and processing power are needed for floating-point addition and multiplication operations. On the other hand, Rounding this value to 9 (quantization) will save space and processing power without significantly affecting the model’s performance.

What is BitNet b1.58?

Quantization into a lower precision datatype is the key element of the

BitNet b1.58 model that is introduced in this paper and this research is based on another paper called Bitnet: Scaling 1-bit transformers for large language models, which leverages a linear layer called BitLinear to train 1-bit weights (-1 or 1) than using traditional Floating points weights.

The researchers added an additional value of 0 to the original 1-bit BitNet, resulting in 1.58 bits in the binary system. This BitNet b1.58 retains all the benefits of the original 1-bit BitNet, which does not require (almost) multiplication operations for matrix multiplication and is highly optimized.

Similar to the 1-bit BitNet, this model also uses significantly less energy, requires low memory, and gives the performance of traditional LLMs, which use Floating Point 16 (FP16) values for weights.

Key Features of BitNet b1.58 :

Decoding latency (Left) and memory consumption (Right) of BitNet b1.58 varying the model size. Figure Courtesy – BitNet b1.58

BitNet b1.58 consumes more energy than LLaMA LLM at 7nm process nodes. The left shows the components of arithmetic operations energy, and the right shows the end-to-end energy cost across different model sizes.
Figure Courtesy – BitNet b1.58

1.58-bit weights: Using just three values of -1, 0, and 1 for the model weights drastically reduces memory requirements compared to 16-bit (FP16) LLMs.
Matches FP16 performance: Despite the lower precision, BitNet b1.58 can match the perplexity and accuracy of full-precision FP16 LLMs at model sizes ≥ 3B.
Faster inference: The ternary weights enable highly optimized matrix multiplication without floating-point operations, providing up to 4.1x faster inference than FP16 baselines.
Lower memory & energy: The compressed model size leads to 3.55x lower GPU memory usage at 3B scale and up to 41x lower energy consumption as the model size increases.

How does it work?

BitNet b1.58 models are trained from scratch with an “absmean” quantization function. This function ensures that the model’s weights are scaled and rounded to -1, 0, or +1, replacing traditional Linear layers with BitLinear operations designed for 1.58-bit computations. This approach makes the model highly efficient, and incorporating LLaMA architecture components such as RMSNorm, SwiGLU, and rotary embeddings enables BitNet b1.58 to match or even exceed the performance of FP16 LLaMA LLMs in tasks, achieving similar perplexity and improved zero-shot task accuracy for models that are 3B and above in size.

This model adopts a modified Transformer architecture with 1.58-bit weights and 8-bit activations, significantly reducing multiplication operations. This design suggests that the newer models are shifting towards a computation paradigm that nearly eliminates the need for multiplication in matrix computations.

Why is this Significant:

The development of BitNet b1.58 is a groundbreaking refinement for several reasons:

Cost and Energy Efficiency: By reducing the precision of weights to 1.58 bits, BitNet b1.58 drastically cuts down the energy and computational costs associated with running LLMs, making it a more sustainable option.
Model Performance: Despite its reduced bit representation, BitNet b1.58 matches or even surpasses the performance of full-precision LLMs in terms of perplexity and task-specific metrics, starting from a 3B model size.
Scalability and Future Applications: The model demonstrates excellent scalability and potential for future applications. Due to its reduced computational requirements, it enables more sophisticated AI models on edge and mobile devices.

Potential Future of Large Language Models:

The BitNet b1.58 model demonstrates significant cost and energy efficiency improvements while performing as well as a traditional transformer model, revealing limitless potential. Let us take a look at a few.

1-bit Mixture-of-Experts (MoE) LLMs

Mixture-of-Experts (MoE) is a cost-effective approach for LLMs, but it has certain limitations. While it reduces computation FLOPs, it has issues with high memory consumption and inter-chip communication overhead, which restrict its application and deployment. However, these limitations can be addressed by using 1.58-bit LLMs. The reduced memory footprint of 1.58-bit LLMs reduces the number of devices required to deploy MoE models and significantly minimizes the overhead of transferring activations across networks. In fact, if the entire MoE model can be placed on a single chip, it can eliminate inter-chip communication overhead and dramatically streamline the deployment and execution of these powerful models.

Long Sequence in LLMs ( Long Text Processing)

The Current LLMS uses high memory to process long text because it uses key-value (KV) caches that store intermediate computations for quick access during sequence processing. BitNet b1.58 addresses this issue head-on by optimizing the data format of activations (the outputs of neural network layers) from the conventional 16 bits to 8 bits. This reduction effectively halves the amount of memory required to store these activations, enabling the model to handle sequences twice as long with the same amount of memory. The researchers anticipate that this can be further optimized by compressing activations to 4 bits or even less without losing information (Lossless Compression).

LLMs on Edge and Mobile.

Edge/mobile devices are constrained by memory and processing power and are often equipped with more CPUs than GPUs. 1.56-bit LLMs can perform well on these less powerful CPUs, which opens up the possibility of building new applications and use cases. These devices can perform locally with these models, be it a conversation or translation.

New Hardware for 1-bit LLMs.

All the current large language models heavily depend on complex GPU’s computational power to run the large language models. GPU’s are expensive and require significant energy for its computational process.

Recent developments, such as Groq5, have produced promising results in the creation of specialized hardware known as Language Processing Units (LPUs). These hardware units are designed to meet the computational requirements of LLMs and enhance the performance and efficiency of these increasingly complex and resource-intensive models.

The researchers suggest creating new hardware and systems optimized for 1-bit LLMs, lowering the computational load and energy consumption. This would involve designing processing units that can efficiently handle the simplified yet highly efficient computations of 1-bit models, improving their performance and making them more practical for a broader range of applications.

As artificial intelligence advances, large language models like BitNet b1.58 are making AI more accessible and sustainable. By innovating beyond traditional computational methods, BitNet b1.58 reduces the cost and energy required for AI technology. This is a big step towards a more environmentally responsible tech industry. The integration of these efficient models has the potential to accelerate AI innovation, making sophisticated language processing tools universally available. This will help to bridge the digital divide and promote a greener future for our planet. BitNet b1.58 is an example of how technological advancement and ecological stewardship can converge, ensuring that the future of AI is both inclusive and sustainable.

Key Links

BitNet b1.58 Research Paper
Authors: Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei⋄

BitNet b1.58: The Beginning of the Sustainable AI

Quantization