Microsoft Research has introduced BitNet b1.58 2B4T, a new language model with 2 billion parameters that uses only 1.58 bits of weight per layer instead of the usual 16 or 32. Despite its reduced size, it matches the performance of full-precision models and runs efficiently on both GPUs and CPUs.
The model was trained on a large dataset of 4 trillion tokens and excels in various tasks such as language understanding, math, coding, and conversation. Microsoft has made the model weights available on Hugging Face, along with open-source code for running it.
According to the technical report, BitNet b1.58 2B4T achieves comparable performance to leading full-precision LLMs while offering significant advantages in computational efficiency, including reduced memory footprint, energy consumption, and decoding latency.
The model’s architecture is based on the standard Transformer model with modifications using the BitNet framework. The central innovation involves replacing traditional linear layers with custom BitLinear layers that use quantisation to reduce weight size during the forward pass.
BitNet b1.58 2B4T uses absolute mean (absmean) quantisation scheme for weights and absolute maximum (absmax) quantisation strategy for activations. It also incorporates subln normalisation, squared ReLU activation, rotary position embeddings, and a byte-level Byte-Pair Encoding scheme with a vocabulary size of 128,256 tokens.
The training process consists of three phases: pre-training, supervised fine-tuning (SFT), and direct preference optimisation (DPO). BitNet b1.58 2B4T demonstrates that it’s possible to achieve dramatic reductions in computational requirements without compromising performance, making it a significant step forward in developing more efficient AI models.
Source: https://analyticsindiamag.com/ai-news-updates/microsoft-unveils-1-bit-compact-llm-that-runs-on-cpus