The goal of quantization in neural network training is to make neural networks more efficient by simplifying their computations. This is done by replacing floating point operations by operations on smaller number types (quantization of the parameters). The goal of quantization is to preserve the accuracy of the model while doing this conversion.
Quantization of large language models
The LLM.int8() paper (Dettmers et al. 2022) explains some interesting issues and solutions for quantization of transformer-based large language models. Notably some emergent properties arise in these language models. More details in the author’s blog post.
Quantization with Pytorch
Pytorch allows you to perform three kinds of quantization for your model, with various tradeoffs:
-
Dynamic quantization (weights quantized with activations read/stored in floating point and quantized for compute)
To quantize all
Linear
layers in a model to useint8
weights and activations, use the following code:import torch.quantization quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
-
Static quantization (weights quantized, activations quantized, calibration required post training). This option requires calibration with a representative dataset to determine optimal quantization parameters for the activations.
-
Static quantization aware training (weights quantized, activations quantized, quantization numerics modeled during training). This is usually the method that yields the highest accuracy. All computations are still done with the floating points numbers, while aware that they will be rounded to the nearest
int8
during quantization.
Bibliography
- Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer. . "Llm.int8(): 8-bit Matrix Multiplication for Transformers at Scale". arXiv. DOI.