I understand it now. I looked into TF and MxNet frameworks earlier, they liked to have a requantize operator separately. I think the reason is that requantize is used very often, and there TF quantized_conv2d does not have a requantize inside.
In your case, I think the NPU quantized_conv2d requires “qnn.conv2d + requantize” wrapped up together. Or maybe we want to call it a Fused operator. I think there are multiple ways to do this
-
If I think from a HW accelerator viewpoint, it will have its requirements that certain ops should be fused. This fused operator will go through 3rd party compiler, which might use TVM or their own codegen. So, we can write a pattern detector, that detects the sequence of Relay operators and replaces them with an accelerator-friendly fused operator. This approach has its cons, because it might be difficult to detect patterns because the IR has become too low-level already.
-
Other option maybe is to create another dialect for your NPU. This one has new operator called NPU.conv2d. To 3rd party codegen, you can give this operator. If you want TVM codegen, this can lower to “qnn.conv2d + qnn.requantize”, which is further lowered to pure-Relay ops.
If we are prototyping, then 2nd option might be faster. First operator requires serious changes in Graph Fusion Relay pass.