For background on quantization, please read this link (INT8 quantization proposal).
This thread only focuses on implementation of quantized layers in TVM.
High-level overview
Hardware vendors are adding support for optimized INT8 operations in the hardware (Intel (https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training), Nvidia (https://devblogs.nvidia.com/mixed-precision-programming-cuda-8/)). To take full advantage of the hardware, we need to generate code that can generate these new instructions. In addition, since time-consuming layers like convolution have high data reuse property, we also have to find new schedules that can efficiently utilize the hardware.
Proposal
My current proposal is to focus on Intel Skylake and resnet-18 for now and complete an end-to-end implementation. We can start with the current TVM convolution layer optimized schedules and explore how new instructions change that schedule. Similarly, we can generate the quantized implementations for other layers in resnet-18.
When the end-to-end implementation is flushed out, we can add more backends (Nvidia, ARM)
Action Items
There will likely be many design decisions within each step, but this list is only covering the high level action items.
-
TOPI - Generate the optimized quantized convolution schedule with optimized hardware instructions.
- Understand how does it affect data layout in and across kernels.
- Intermediate outputs need higher precision (INT32) to avoid overflow. This will require adding support for mixed precision arithmetic in TVM.
- The code generation will rely on LLVM to pattern match to INT8 operations. Intel LLVM team is currently working on that. We can also look at inline assembly if need be (https://github.com/dmlc/tvm/pull/1486).
-
TOPI - Generate the optimized quantized schedules for fully connected, pooling, relu layers. The goal is to enable quantization on resnet-18
-
NNVM - Modify the input graph to support quantization - like add input/output quantization layers, using the quantized models instead of precise ones.
def deploy_quantized_model(sym, qauntized_params)
# Runs the quantized models
# Inputs
# sym - input network - NNVM modifies the network to support quantized inference
# quantized_params - input params that will be quantized
Comments/suggestions are welcome.