Hi, TVM community!
Recently when I was running a int8 quantized DeepLab v3 model (imported from ONNX) using OpenCL target, I noticed that the inference time was quite long (~7s/image on NVIDIA T4). I was using the FakeQuantizationToInteger pass to convert as many relay ops to qnn ops as possible.
I tried autotvm for tuning (referring to Auto-tuning a Convolutional Network for NVIDIA GPU — tvm 0.14.dev0 documentation), but it didn’t improve performance.
I changed ops specified for tuning from ‘nn.conv2d’ to ‘qnn.conv2d’, but it resulted in being returned no tasks from autotvm.task.extract_from_program(…).
From a past question (Autotvm.task_extract_from_program in TFLite - #18 by anijain2305), autotvm seems it had not supported tuning for int8 at that time.
Are qnn ops not supported on autotvm yet?