Dear Community: I load my onnx model using tvm.relay.frontend.from_onnx .Then I convert it to int8 using the following code:
with relay.quantize.qconfig():
mod = relay.quantize.quantize(mod, params)
And i got the execution time is 25ms,which the fp32 is 40ms.But if i quantize the model with the following code:
mod = relay.quantize.quantize(mod, params)
The execution time change to 40ms,slower than quantize-method with the code with relay.quantize.qconfig()
.
Then i try to load pre_quantize model from ort,the performance is same as the quant-model without qconfig.
So there are my questions:
1.Why the performances are different?The two ways should both using the default quantize.config.
2.What did the code with relay.quantize.qconfig()
exactly do?If i want to load pre_quantize model,how can i get the better performance?