There’s a new quantization algorithm on the town! The ** Additive Quantization of Language Fashions (AQLM)** [1] quantization process was launched in early February 2024 and has already been built-in to HuggingFace Transformers (as of model 4.38.0–21/02/2024) and HuggingFace PEFT (as of model 0.9.0–28/02/2024). Because of this checkpoints quantized utilizing AQLM could be loaded utilizing these libraries and HuggingFace Transformers can be utilized to quantize appropriate checkpoints utilizing AQLM.

On this weblog put up, we are going to look at the important thing outcomes offered within the AQLM paper [1] and supply an in depth overview of the important thing ideas behind this new quantization approach.

On this article, we are going to first evaluation the important thing outcomes offered within the AQLM paper. Subsequent, we are going to look at the motivations for quantizing massive language fashions for inference. We’ll then dive into the small print of Multi-Codebook Quantization (MCQ), a way uniquely leveraged by AQLM for weight quantization. After breaking down the reminiscence footprint of AQLM fashions and inspecting key quantization parameters, we are going to clarify the AQLM quantization process step-by-step. Lastly, we are going to focus on the idea of Pareto effectivity because it pertains to mannequin quantization, offering perspective on how AQLM pushes the boundaries of Pareto-optimal quantization.

Present weight-only quantization algorithms might technically quantize mannequin weights all the way down to the 2-bit vary. Nevertheless, they failed at successfully preserving mannequin accuracy. AQLM is a brand new weight-only post-training quantization (PTQ) algorithm that units a brand new state-of-the-art for the two bit-per-parameter vary. It additionally offers smaller benchmark enhancements in comparison with present strategies for the 3-bit and 4-bit ranges (Desk 1). Particularly, AQLM outperforms fashionable algorithms like GPTQ [2] in addition to more moderen however lesser identified strategies equivalent to QuIP [3] and QuIP# [4]. AQLM authors additionally declare that their quantization algorithm pushes the Pareto frontier of the tradeoff between mannequin accuracy and reminiscence footprint under 3 bits per parameter for the primary time.

The desk under summarizes the efficiency of AQLM when compressing the Llama-2–70B mannequin to 4-bit, 3-bit, and 2-bit per parameter. Efficiency is measured by perplexity on the WikiText2 [5] and C4 [6]. datasets (decrease is best) in addition to zero-shot accuracy on the WinoGrande [7] and HellaSwag [8] benchmarks (larger is best). For comparability, the efficiency of QuIP#, the highest competing technique, is proven for 4-bit and 2-bit compression. For the reason that obtainable QuIP# implementation doesn’t assist 3-bit compression, SpQR [9]is included because the comparability technique for AQLM at 3 bits.

Whereas quantization can generally cut back inference latency in comparison with FP16, this isn’t assured. In benchmarks, AQLM-quantized fashions confirmed average latency enhancements, with speedups starting from 1.2x to 2x typically, and as much as 3.05x in the most effective case. Nevertheless, latency discount was not the main focus of AQLM’s designers. Their precedence was maximizing accuracy inside a goal mannequin measurement, slightly than optimizing for pace. Consequently, the latency features from AQLM quantization are noticeable however not as dramatic because the enhancements from different present quantization algorithms.

Nonetheless, AQLM marks an necessary step in the direction of making massive language fashions extra accessible on shopper {hardware} and cellular gadgets. For instance, when quantizing a 7B mannequin from 16-bit half precision codecs like FP16 (16 bits or 2 bytes per parameter) down to only 2 bits per parameter (0.25 bytes per parameter), the reminiscence footprint is decreased by an element of 8x — reducing from 14GB all the way down to only one.75GB.

PTQ strategies fall into two classes: people who quantize simply the mannequin weights, and people who quantize each weights and activations. AQLM falls into the primary class, solely quantizing weights. Mannequin weights are static by definition, to allow them to be quantized offline earlier than deployment and even distributed on platforms such because the HuggingFace Mannequin Hub. Activations embody every part else, together with the key-value (KV) cache, and are solely identified at runtime throughout inference.

The primary checkpoints quantized (principally to 2 bits) utilizing AQLM have began to seem on the HF Hub. Nevertheless, TheBloke, a well-liked mannequin quantizer, has not but included this quantization approach in his set of quantization strategies.

When quantizing LLMs weights, not all of the weights are literally quantized. Solely the parameters that make up the majority of the parameter depend, like the massive projection matrices of each the eye and feed-forward layers, are usually quantized. Different parameters are normally saved in native precision.

When choosing weight-only quantization, environment friendly combined precision kernels for matrix multiplications are normally not obtainable. In consequence, quantized weights are dequantized at runtime after being fetched from reminiscence. Relying on the overhead of dequantization, the latency reductions from decrease knowledge switch could be partially preserved or fully offset.

There are 4 fundamental advantages related to the decreased weight reminiscence footprint of quantized fashions for LLM inference:

By lowering the burden’s reminiscence footprint, quantizing massive language mannequin weights for inference offers 4 fundamental advantages:

- Lowered {hardware} necessities for mannequin serving: A quantized mannequin could be served utilizing inexpensive GPUs and even made accessible on shopper gadgets or cellular platforms.
- Elevated area for the KV cache to allow bigger batch sizes and/or sequence lengths.
- Quicker decoding latency. Because the decoding course of is reminiscence bandwidth sure, much less knowledge motion from decreased weight sizes straight improves this, except offset by dequantization overhead.
- The next compute-to-memory entry ratio (by decreased knowledge motion), generally known as arithmetic depth. This permits for fuller utilization of obtainable compute assets throughout decoding.

**AQLM applies Multi-Codebook Quantization (MCQ) to compress the weights of LLMs.** Initially, MCQ was developed to allow environment friendly nearest neighbor search on vector databases. It really works by splitting every vector of the database into **subgroups** (sub-vectors), that are in flip approximated utilizing discovered vectors named **codewords**. A **codebook** is a set of such codewords. This permits similarity computations to be carried out effectively utilizing the finite set of codewords as a substitute of the complete vector database.

In AQLM, the vectors which are quantized correspond to the rows of the burden matrices. That’s, AQLM quantizes the output channels of every weight matrix utilizing MCQ.

**Observe:** It needs to be famous that AQLM makes use of the *W.X* notation conference (*W* and *X* are the burden and activation matrices respectively), whereas another quantization papers use the reverse *X.W* conference. This implies the output channels that AQLM quantizes correspond to the rows of the burden matrix, whereas in *X.W* notation, they’d be the columns.

**Every row of the burden matrix** of form *(d_out, d_in)* is split into sub-vectors known as **teams** of measurement *(1, g)*. **Assuming the codebooks have already been discovered**, AQLM approximates every group as **the sum of M same-size**

**codewords**which are saved at native precision. Every codeword belongs to a unique codebook, every codebook containing

*2^B*codewords. To reconstruct a gaggle utilizing the discovered codebooks, we really solely have to retailer the index of every constituent codeword in its codebook. This index could be represented as a

*2^B*-dimensional one-hot vector known as a

**code**. So every group is represented by

*M*one-hot code vectors of measurement

*2^B*. Storing such a one-hot vector requires

*B*bits. Subsequently, the entire reminiscence footprint to retailer the compressed illustration of every group is

*M*x

*B*bits.

The method of constructing the quantized illustration in AQLM is summarized in Determine 1. It needs to be famous that earlier than splitting every output channel into teams, the output channels are scaled by a discovered scaling issue.

As talked about beforehand, at inference time, the matrix multiplication with activations *X* makes use of **dequantized**, native-precision parameters slightly than the quantized code vectors. As proven in Determine 2, the dequantization course of works by decompressing the code vectors again into one-hot index vectors to retrieve the corresponding codewords from every codebook. These codewords are summed collectively, then scaled to breed the unique, half-precision weight values for computation.

Most significantly, what’s the achieved common variety of bits per parameter utilizing AQLM? To retailer an AQLM-quantized weight matrix, the next info must be saved:

*M*codebooks, every containing*2^B*codewords saved at native 16-bit precision. Every codeword has measurement*(1, g)*.*d_out*scaling elements, every saved as a 16-bit float*M*code vectors of*B*bits every to encode every group, of which there are whole*d_out*x*d_in/g*.

Subsequently, the typical variety of bits per parameter could be calculated with the next method:

It needs to be famous that the method above calculates the typical bits per parameter for a single weight matrix, i.e. a single layer, not the whole mannequin.

Let’s have a look at every time period’s contribution for various configurations (Desk 2) taking Llama-2–70B feed-forward layer for example :

To know how every time period contributes for various configurations, let’s look at a selected instance: the feed-forward layer of the Llama-2–70B mannequin (*d_in=8 192* and *d_out=28 672*). Desk 2 exhibits the breakdown of every time period’s contribution throughout totally different configurations for this layer.

The scaling issue phrases are all the time negligible of their contribution. The typical variety of bits per parameter is primarily dictated by the codes encoding every group. The codebook phrases usually have a small contribution, except each *B* and *g* are set to comparatively excessive values (as in State of affairs D).

The group measurement *g*, variety of codebooks *M*, and codebook measurement *B* are hyperparameters in AQLM’s quantization course of. Assuming the code phrases dominate the typical bits per parameter, we are able to approximate the entire as *B.M/g*. This implies a number of mixtures of *g*, *M*, and *B* can fulfill the identical total bit finances. To pick the optimum configuration, we have to look at how these parameters influence mannequin efficiency.

**Observe:** The names of AQLM-quantized fashions comply with a `XBit-MxB`

naming scheme equivalent to `ISTA-DASLab/gemma-2b-AQLM-2Bit-1x16-hf`

for the 2-bit quantized model of Gemma-2B utilizing one codebook with 65 536 (2¹⁶) codewords. Understanding the entire bit finances, *M* and *B*, we are able to simply derive *g*.

Relating to latency, the upper the variety of codewords, the slower, i.e. the decrease the latency speedup. For instance, matrix-vector multiplication of the 2-bit 1×16 (65 536 codewords whole) Llama-7B mannequin on GPU (Nvidia RTX 3090) exhibits a x1.31 speedup in comparison with the FP16 mannequin, whereas the identical measurement 2×8 (512 codewords whole) mannequin achieves a x1.57 speedup.

Nevertheless, reducing the variety of codewords negatively impacts mannequin accuracy. For instance, the paper demonstrates that the 1×16 Llama-7B mannequin (2-bit vary) achieves a perplexity rating of 6.29 on WikiText2 [5], whereas the 2×8 variant of the identical mannequin scores 7.98 on the identical dataset. As compared, the FP16 model scores 5.12.

Now, contemplating a set whole bit finances (e.g. 2 bits) and codebook measurement *B* (e.g. B=8), there are a number of legitimate (*M, g*) pairs that fulfill the finances constraint. For example, with *B=8*, the pairs (1, 4), (2, 8), …, (8, 32), and so forth. are legitimate configurations. The paper demonstrates that inside a given finances, bigger (M, g) values correlate with decrease perplexity, i.e. decreased quantization errors, though with diminishing returns. This reveals a latency-accuracy tradeoff — larger M improves accuracy but additionally will increase latency.

**Observe: **For a lot of quantization strategies, the typical bits per parameter is dictated by the precision used to retailer parameters, equivalent to INT8, INT4, INT3, and so forth. This solely permits just a few discrete common bits sizes. In distinction, AQLM offers rather more flexibility — by adjusting the *g*, *M*, and *B* hyperparameters, a wider vary of common bits could be achieved with finer granularity (as proven in Desk 3).

**Observe: **Leaving mannequin accuracy apart, it’s seemingly that not all configurations are equally environment friendly. For example, if the worth of *B* isn’t a a number of of 8, then every saved code doesn’t make the most of all of the bits throughout the bytes wanted to symbolize it

Within the earlier part, we assumed the codebooks and codes had been already discovered as a way to reveal how AQLM builds a compressed illustration. **In observe, quantizing a mannequin with AQLM includes studying these codebooks.** As soon as the codebooks have been discovered, compressing a weight matrix utilizing the method described above is simple.

For an enter half-precision weight matrix *W*, the AQLM quantization course of learns: *M* codebooks *C*, *d_out* scaling elements *s*, and for every group, *M* code vectors *b* . These are discovered by minimizing the next loss operate:

To study the codebooks and the codes, **calibration knowledge **(i.e. coaching knowledge) is required. The authors use just a few hundred 4096-length sequences from the RedPajama-v1 dataset [10] as calibration knowledge. Efficiency is measured by evaluating perplexity on the WikiText2 [5] and C4 [6] datasets, which function validation units.

technicalities of this explicit coaching would take us too far into the peculiarities of codebook studying. We’ll simply cowl the AQLM coaching (and subsequently quantization) process fundamental steps.

The AQLM algorithm really applies to every Transformer decoder block. For a given decoder block, quantization is a two-step course of:

- Codebooks, scaling elements and codes are discovered for every linear layer within the block. In every case, the loss operate minimization happens in two phases: 1. The codes are discovered first utilizing the initialized codebooks and scaling elements. The codebooks listed below are fastened, initialized with a residual k-means method. 2. With the codes discovered from the primary stage remaining fastened, the codebooks and scaling elements are then up to date ranging from their initialized values.
- After quantizing every linear layer in a decoder block, the block’s codebooks, scaling elements, and non-quantized parameters (like normalization layer scales/biases) bear additional fine-tuning. The codes stay frozen at this stage. This fine-tuning makes use of enter and output activations recorded earlier than quantization and permits joint optimization of the parameters throughout layers. Optimizing collectively accounts for interactions between quantization errors throughout layers, which is necessary at very low bitrates the place quantization errors are comparatively bigger.

The AQLM authors declare to have pushed the Pareto frontier for the tradeoff between mannequin accuracy (measured by perplexity for instance) and reminiscence footprint under 3 bits per weight for the primary time. Whereas an necessary achievement, what does this milestone symbolize?

Pareto optimality refers to an environment friendly state the place one metric can’t be improved with out negatively impacting one other metric. For instance, think about a system described by two fascinating traits. A Pareto-optimal state is one the place there exists no modification that would enhance one attribute with out worsening the opposite. Conversely, if a change might positively have an effect on one attribute for gratis to the opposite, that may be thought-about Pareto-inefficient, as a extra optimum state is feasible. The Pareto frontier plots all such Pareto-optimal states.

When utilized to mannequin quantization, every mannequin variant (quantized or full-precision) represents a state described by its accuracy and reminiscence footprint. The Pareto frontier includes the set of (normally quantized) fashions with the optimum tradeoff between accuracy and measurement. On this frontier, there exists no technique to additional compress mannequin measurement with out dropping accuracy, or enhance accuracy with out rising reminiscence necessities.

For instance, the paper exhibits Llama-2–13B quantized utilizing AQLM to 2 bits per weight achieves 5.65 perplexity, whereas 4-bit AQLM quantization of Llama-2–7B achieves 5.21 perplexity. Each occupy ~1.7GB, however the 2-bit mannequin has worse accuracy. Subsequently at this footprint, the 4-bit mannequin is extra environment friendly — larger accuracy for a similar 1.7GB measurement.

How is that doable? These Pareto effectivity limitations stem from the problem quantization methods face in avoiding substantial accuracy losses at extraordinarily low bit-per-parameter values.

If we assume all quantization methods might completely protect mannequin accuracy, then every time a brand new approach achieves larger compression, the Pareto frontier would merely shift to incorporate solely fashions quantized utilizing that newest approach (Determine 3).

Nevertheless, as a result of quantization results in losses in mannequin accuracy, attaining larger compression doesn’t essentially imply reaching the Pareto frontier if the accuracy loss is just too nice in comparison with different present methods (Determine 4).

Pushing the Pareto frontier under 3 bits per weight implies that present sub-3-bit quantized fashions weren’t Pareto optimum — for a given mannequin reminiscence footprint, accuracy was not maximized. The authors decide 2.5 bits because the optimum fee for the Llama-2 household with AQLM. In different phrases, Llama-2 fashions which are quantized to make use of a mean of two.5 bits per parameter utilizing AQLM sit on the Pareto frontier.

On this put up, we launched AQLM, a brand new quantization algorithm that applies Multi-Codebook Quantization (MCQ) to massive language fashions for the primary time. AQLM units a brand new state-of-the-art for mannequin compression within the 2-bit per parameter vary and achieves Pareto optimality with sub-3-bit fashions for the primary time.

With its groundbreaking compression charges and upkeep of accuracy, AQLM represents a serious step ahead in deploying massive language fashions effectively and making massive language fashions extra accessible to shopper {hardware} and cellular gadgets.

AQLM is already supported by the HuggingFace Transformers and PEFT libraries, making it straightforward for builders to leverage AQLM’s benefits!