There are few discussions around Automatic FP16 Downcasting but there is no implmentation yet. We would like to propose a RFC for this topic.
The idea is pretty straightforward. For a fp32 model, we need to downcast the input and all the paramaters to fp16. After this, the output would also be fp16 so we need to upcast the output to fp32 in the end.
I already have a draft PR for this here. Currently the pass is written in python. We have tested it with several models in gluon model zoo and the accuracy looks good so far. The next step is to expand the model coverage by handling some edge cases and re-implement in with C++.
We are planning to reuse the exsisting auto-quantization pass to downcast to FP16 (or bfloat etc). Investigate if we need target-aware transformation infra, or can it be represented using the QConfig.
The proposal is discussed with @ janimesh.
The work will be sperated into 2 stages. The first stage will focus on the accuracy and the second stage will focus on the performance.
Stage 1: Framework Coverage + Model Coverage + Accuracy
Focus on the accuracy of the downcasted fp16 graph. Need to expand the model coverage with multiple model zoos. Currently we have tested with Resnet, vgg, mobilnet in gluon model zoo. We are planning to expand the coverage to GluonCV object detection model and the image classifier models in tensorflow model zoo. Here, we will use Intel machines to get the accuracy numbers. The goal is to check the robustness of downcast pass.
Stage 2: Performance Improvment for Nvidia GPU
Focus on the performance. We are targeting CUDA on GPU and ARM CPU with float16 support. Currently we are experiencing some errors during cude codegen as in [ERROR]CUDA compilation error.
In future, we can target ARM devices that have native FP16 support.
Any comments or thoughts are welcome