Right, my point here is that the challenge you are encountering is that the quantization framework translates normal Relay ops into affine space (which sometimes is multiple Relay ops), and then you have to match the affine space version of the Relay op that the framework created, which is tricky. Really what you want to do is know what the QParams are and just offload the original Relay op without worrying about what the affine space version of the Relay op is.
I’m not sure what the best way to solve this is, though.
You could make a ton more symbolic QNN ops that store the QParams directly, but then you end up in a situation where you need to make QNN corresponding to most Relay ops, which doesn’t make a ton of sense.
Or we could do something like insert qnn.requantize
ops and change the dtype of all the intermediate ops to be int8
, and annotate all the intermediate ops with their QParams, so you could match those ops directly and offload them. This graph wouldn’t be correct because Relay ops like sigmoid
wouldn’t take QParams into account, but it wouldn’t matter because you’d just replace them with your kernel, which does take the QParams into account.