Overview of Post-training Quantization and Examples of Algorithms and Implementations

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Image Processing Reinforcement Learning Probabilistic Generative Modeling Deep Learning Python Navigation of this blog

Overview of Post-training Quantization

Post-training quantization is a method of quantizing a model after the neural network has been trained, in which the weights and activations of the model, which are usually expressed in floating-point numbers, are converted to a form expressed in low-bit numbers, such as integers. This reduces the model’s memory usage. This reduces model memory usage and improves inference speed. The following is an overview of post-training quantization:

1. Applied after training:

Post-training quantization is applied after the model has been trained. Since the model has already been trained, this technique is more concise than in-training quantization and can usually be applied more quickly.

2. quantization with an integer or a small number of bits:

Convert model parameters (weights and biases) and activations, which are usually expressed in floating-point numbers, into a form expressed in integers or a small number of bits. For example, it is common to use 8-bit signed integers, which reduces memory usage and increases inference speed.

3. adjustment of quantization parameters:

During quantization, parameters such as the range and scale of the transformed data need to be adjusted. These parameters are used to convert the model weights and activations to integers and are adjusted so that the converted values are as close as possible to the original floating point numbers.

4. loss of precision:

Because quantization constrains the parameters of the model, there is generally a loss of precision. However, quantization with optimally tuned parameters minimizes the loss of accuracy.

5. fine-tuning:

After applying post-quantization, fine-tuning is usually performed. This is a method to reduce the loss of accuracy and to obtain a quantized model with performance close to the original model.

Post-training quantization is a widely used method for easily optimizing existing models, and many deep learning frameworks and libraries (TensorFlow, PyTorch, TensorRT, etc.) support post-training quantization, making it easy to reduce memory and computational resources during inference.

Algorithms related to Post-training Quantization

The specific algorithm for post-training quantization may vary depending on the framework or tool, but the general procedure is as follows

1. model preparation:

First, a neural network model that has been trained is prepared. Usually, this model has weights and activations expressed in floating point numbers.

2. selection of quantization targets:

Parameters (usually weights and activation) to which quantization is applied are selected. Usually, the whole or part of the model is targeted.

3. quantization with an integer or a small number of bits:

Converts the selected parameters into a form that is expressed as an integer or a small number of bits. For example, 8-bit signed integers and floating point numbers are often used.

4. quantization parameter adjustment:

Adjust the range, scale, and other parameters of the quantized parameters. This ensures that the value is as close as possible to the original floating-point number.

5. fine-tuning:

Fine-tuning is usually performed after quantization. This is a technique to reduce the loss of accuracy and to obtain a quantized model with performance close to the original model.

The specific algorithm and implementation depends on the framework and tools used. For example, deep learning frameworks such as TensorFlow and PyTorch provide dedicated tools and APIs for post-training quantization. Using these tools, it is possible to easily quantize the model after training.

Application Examples of Post-training Quantization

The following are examples of common applications of post-training quantization.

1. deployment in edge devices:

Edge devices and embedded systems have limited computing resources. Post-training quantization of trained models reduces model memory usage and improves inference speed. This is an important case study for deployment on edge devices.

2. mobile applications:

Mobile applications require reductions in application size and communication costs. Post-training quantization of trained models reduces model size and improves the efficiency of mobile applications.

3. low-power and IoT devices:

Low-power and IoT devices have limited computational resources and battery life; post-training quantization allows for efficient model inference and improved utilization on these devices.

4. inference on web services and the cloud:

Inference on web services and in the cloud also requires reduced memory usage and increased inference speed, and optimizing trained models with post-training quantization can satisfy these requirements.

5. model deployment efficiency improvement:

This can be useful when existing large-scale models are to be lightened to improve resource usage efficiency during deployment. This allows models to be deployed on more platforms and environments.

In these cases, post-training quantization plays a role in optimizing existing models to save resources and improve efficiency during deployment. In general, post-training quantization is applied after training, so there is no need to retrain the model, and the advantage is that the model can be easily optimized.

Example implementation of Post-training Quantization

The specific steps to implement post-training quantization vary depending on the deep learning framework or tool being used. The following are basic examples of post-training quantization implementation in TensorFlow and PyTorch.

When using TensorFlow:

In TensorFlow, post-training quantization is performed using the TensorFlow Lite Converter. The following is an example of converting a model trained in Keras to a TensorFlow Lite model.

import tensorflow as tf
from tensorflow.keras.models import load_model

# Load a Keras model (use MNIST model as an example)
model = load_model('mnist_model.h5')

# Apply post-training quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter.convert()

# Save quantized model
with open('quantized_model.tflite', 'wb') as f:
    f.write(quantized_tflite_model)

When using PyTorch:.

PyTorch uses TorchScript and TorchQuantization for post-training quantization. The following is an example of quantization of a trained PyTorch model.

import torch
import torchvision.models as models

# Use ResNet18 as an example
model = models.resnet18(pretrained=True)
model.eval()

# quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,  # Original model
    {torch.nn.Conv2d, torch.nn.Linear},  # Module to be quantized
    dtype=torch.qint8  # Quantized to 8-bit integer
)

# Save quantized model
torch.jit.save(torch.jit.script(quantized_model), 'quantized_model.pt')

In these examples, quantization is performed after training and the final model is stored. The method and number of bits of quantization can also be selected.

Challenges and measures for Post-training Quantization

Post-training quantization is a useful method, but several challenges exist. These issues and their solutions are described below.

1. loss of accuracy:

Challenge: Quantization reduces accuracy by converting floating-point numbers to integers or a small number of bits.

Solution: The loss of accuracy can be reduced by adjusting the precision of quantization. Combination with other methods such as fine-tuning and avoiding bias in training data will also be considered.

2. limited reduction in model size:

Challenge: For some models, post-training quantization may not reduce the model size enough.

Solution: More advanced quantization methods and a review of model architecture will enable a larger model size reduction.

3. the need for fine-tuning:

Challenge: In general, quantization requires fine-tuning, and lack of fine-tuning may cause performance degradation.

Solution: Fine-tuning after quantization can improve performance; fine-tuning is an important factor in adapting the model to new data sets and inference environments.

4. handling of custom operations:

Challenge: Some models have difficulty handling quantization for custom or specific operations.

Solution: Consider handling quantization for custom operations, or specify that certain operations in the model should not be quantized.

5. application to different domains:

Challenge: Post-training quantization is usually most effective in the domain in which the model was trained. Application to different domains may degrade performance.

Solution: Fine-tuning in a different domain or retraining the quantization model on a new data set can improve performance.

Reference Information and Reference Books

For reference information, see “General Machine Learning and Data Analysis” “Small Data Learning, Combining Logic and Machine Learning, Local/Group Learning,” and “Machine Learning with Sparsity”

For Reference book “Advice for machine learning part 1: Overfitting and High error rate“

“Machine Learning Design Patterns“

“Machine Learning Solutions: Expert techniques to tackle complex machine learning problems using Python“

“Machine Learning with R“等がある。

1. 《Efficient Processing of Deep Neural Networks》
Authors: David S. Wills, Vivienne Sze, Yu-Hsin Chen, and Joel Emer
– Description: provides a comprehensive overview of methods for efficiently processing deep learning models, in particular model compression and quantisation techniques.

2. 《Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow》
Author: Aurélien Géron
– Description: describes a method for implementing quantisation (Post-training Quantization and Quantisation-aware Training) using TensorFlow.

3. 《Neural Network with Model Compression》
– Description: covers deep learning model compression methods such as network quantisation, pruning and knowledge distillation.

4. 《TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers》.
Authors: Pete Warden and Daniel Situnayake
– Description: Presents practical ways to deploy lightweight models and utilise TensorFlow Lite, including post-training quantisation.

5. 《Model Compression and Acceleration for Deep Neural Networks》
Authors: Wenming Zhang, Xiangyu Zhang, and Jian Sun
– Description: Systematically explains techniques for lightweighting and accelerating models, including model quantisation, pruning and knowledge distillation.

Related online resources.
– TensorFlow Official Guide on Post-training Quantization
[TensorFlow Lite Model Optimization] – Learn specific procedures for post-training quantization using TensorFlow.

– PyTorch Documentation.
[Quantization in PyTorch] – Learn how to implement Post-training Quantization in PyTorch.