Recently when I was running a int8 quantized DeepLab v3 model (imported from ONNX) using OpenCL target, I noticed that the inference time was quite long (~7s/image on NVIDIA T4). I was using the FakeQuantizationToInteger pass to convert as many relay ops to qnn ops as possible.
I changed ops specified for tuning from ‘nn.conv2d’ to ‘qnn.conv2d’, but it resulted in being returned no tasks from autotvm.task.extract_from_program(…).
I am trying to autotune my resnet50 model which has been int8 quantized using tflite, using AutoTVM. I specified qnn.conv2d in the tuning ops but none of the conv layers are being picked up during tuning. The tuning progress simply shoots to 50% , and then 100% completion, with only 2 tasks getting picked up in total, against a usual 22 for resnet.
Have you found any resource which has a fix for this? I have searched but in a losing effort.