What is Quantization in deep learning?
Many industries are keen to apply the label “AI” to their products and services. Because of that, almost every industry is moving towards an AI world. When we consider real-time applications, deploying AI applications in devices is important. But the deployment of AI/ML models at Edge is very challenging. Because most models are trained in 32-bit floating-point arithmetic, the performance may be reduced when deployed at edge devices causing some lag in real-time responses, reduced accuracy and taking a long time to predict results. Then it will affect the user experience. So, to prevent those issues, it’s better in many cases to use reduced precision or 8-bit integer numbers. But the accuracy may lower when rounding the weights after training especially if the Weights have a wide dynamic range. At this point, Quantization takes an important place.
Quantization in deep learning is a technique used for performing computations and storing tensors at lower bit-widths than floating point precision. This allows for the use of high-performance vectorized operations on many hardware platforms.
There are two forms of quantization called post-training quantization and quantization-aware training. Post-training quantization is a conversion technique that can reduce model size while also improving CPU and hardware accelerator latency with little degradation in model accuracy. It reduces the required computational resources for inference. In contrast, the Quantization aware-training can model the quantization effect during training.
Although Quantization has many benefits, there are several challenges to be faced here. Some of them are as follows.
- Significant Accuracy Loss in some models: After proper quantization steps, accuracy may be dropped by 1%. However, it is acceptable according to industry standards. But some models like BERT may have an accuracy drop of more than 1%.
- Backpropagation becomes infeasible: When weights are quantized, backpropagation becomes difficult. Because the gradient cannot backpropagate through discrete values. If that is the case, approximation methods can be suggested to estimate the gradients of the loss function.
- Quantized weights make models hard to converge: When the weights are quantized, the models find it hard to converge during the training process. To get better performance learning rate should be low.
To understand more about this let’s deep dive into mathematical concepts in my next article. Please hit the clap icon below if you enjoyed this article.:)