Lots of interesting thoughts here. Overall it seems the main pain point is that it’s really hard to match quantized operations to do BYOC or something similar. I do think a more “unified” relay representation is the way to do this and this work can certainly lay the foundation for that. Here are my thoughts on this:
I think a major issue with quantized operations vs. non-quantized operations in general is how much rounding matters. If you lose 1 out of 4 bits of information it can be really significant. Therefore, implementation details matter a lot more than FP32 case because they can change how rounding is done and therefore affect the semantics of the operation in a more meaningful way. As an example, we can imagine doing a quantized convolution and bias-add operation either by taking the accumulation buffer of the convolution and using that for the bias-add or downsampling the accumulation buffer to 8 bits and using that for the bias-add. Obviously the first one is preferable but maybe you have hardware which can only do the second. We therefore have to be able to support both in QNN.
The main point is that while conv2d represents a mathematical operation which is well defined, qnn.conv2d really needs to represent a family of mathematical operations each of which approximates conv2d in a different way. Right now what we’re running into I believe is the fact that qnn.conv2d is very specific and doesn’t provide enough knobs to change the semantics of the operations.
Keep in mind that I’m not familiar with a lot of the examples that @mbaret makes but it seems to me that a lot of these problematic patterns have to do with getting things to the correct input types for QNN ops. We can easily imagine a world where these QNN ops can take in really any input pattern and internally when things are lowered things are cast to the correct type. In the case of a conv2d we might imagine a conv2d-bias-add block with some sort of knobs exposed that might specify how the add after the conv2d is performed. We then wouldn’t have these scattered requantized, cast, etc. which might make the pattern matching for BYOC easier.
I know fused operator nodes aren’t really very relay-y but then again QNN isn’t normal relay since as mentioned before, QNN.conv2d needs to really represent a lot of different operations. The potential downside is having an explosion of potential fused operator nodes. However I argue that every fused operator node is just a case with special implementation details which we would have to deal with anyway.
Basically, it seems if we want nice pattern matching off the QNN graph, we have to avoid leaking implementation details to the QNN relay graph. We still need to specify these implementation details somewhere so we do so by adding new parameters to QNN or creating new QNN symbolic ops.