This looks like great work, thanks for the RFC!
I agree that it’s very valuable for there to be a stage in the Relay lowering where the ‘QNN-ness’ is explicit - we’ve got both hardware and performance libraries which accelerate quantized operators specifically.
One of the things that complicates our ability to match QNN operators though is the inconsistent way they’re represented. For instance, for QNN convolution we must match qnn.conv2d -> bias_add -> qnn.requantize whereas for something like sigmoid we must instead match qnn.dequantize -> sigmoid -> qnn.quantize. This broadly corresponds to the difference between QNN ops that have ‘native int8’ support and those which are faked through fp32.
So with regard to your suggestions about how we can do pattern-based rewriting, I wonder if we could consider a 2-stage rewrite. A first one which would turn convolution into ‘faked int8’ convolution (qnn.dequantize -> nn.conv2d -> qnn.quantize) and then a second pass which rewrites that into the proper int8 quantized convolution (skipping qnn.conv2d). The first form would be a good target for hardware off-loading and the second might avoid some of repetition you’ve described.