Same TVM Quantize method result in different performances

AnuNaki · June 29, 2023, 7:48am

Dear Community: I load my onnx model using tvm.relay.frontend.from_onnx .Then I convert it to int8 using the following code:

with relay.quantize.qconfig():
mod = relay.quantize.quantize(mod, params)

And i got the execution time is 25ms,which the fp32 is 40ms.But if i quantize the model with the following code:

mod = relay.quantize.quantize(mod, params)

The execution time change to 40ms,slower than quantize-method with the code with relay.quantize.qconfig(). Then i try to load pre_quantize model from ort,the performance is same as the quant-model without qconfig.

So there are my questions:

1.Why the performances are different?The two ways should both using the default quantize.config. 2.What did the code with relay.quantize.qconfig() exactly do?If i want to load pre_quantize model，how can i get the better performance?