Quantization

tags: Computer science, Neural networks

The goal of quantization in neural network training is to make neural networks more efficient by simplifying their computations. This is done by replacing floating point operations by operations on smaller number types (quantization of the parameters). The goal of quantization is to preserve the accuracy of the model while doing this conversion.

Quantization of large language models

The LLM.int8() paper (Dettmers et al. 2022) explains some interesting issues and solutions for quantization of transformer-based large language models. Notably some emergent properties arise in these language models. More details in the author’s blog post.

Quantization with Pytorch

Pytorch allows you to perform three kinds of quantization for your model, with various tradeoffs:

Dynamic quantization (weights quantized with activations read/stored in floating point and quantized for compute)

To quantize all Linear layers in a model to use int8 weights and activations, use the following code:
```
import torch.quantization
quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
```
Static quantization (weights quantized, activations quantized, calibration required post training). This option requires calibration with a representative dataset to determine optimal quantization parameters for the activations.
Static quantization aware training (weights quantized, activations quantized, quantization numerics modeled during training). This is usually the method that yields the highest accuracy. All computations are still done with the floating points numbers, while aware that they will be rounded to the nearest int8 during quantization.

Bibliography

Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer. August 15, 2022. "Llm.int8(): 8-bit Matrix Multiplication for Transformers at Scale". arXiv. DOI.

Quantization of large language models

Quantization with Pytorch

Bibliography

Comments