Hello everyone, when I tried to infer a Pytorch quantized model on the GPU(3060), I found that the inference speed of the quantized model was much slower than a float model. How can I set up to accelerate the inference speed of the quantized model.
- quantized model
model
model = torchvision.models.quantization.resnet18(pretrained=True).eval()
pt_inp = torch.rand([32, 3, 224, 224])
quantize_model(model, pt_inp)
script_model = torch.jit.trace(model, pt_inp).eval()
tvm_model, params = relay.frontend.from_pytorch(script_model, [("input",(32,3,224,224))])
inference
target = tvm.target.cuda(model="3060", arch="sm_86")
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(model, target=target, params=params)
dev = tvm.device(str(target))
module = graph_executor.GraphModule(lib["default"](dev))
print(module.benchmark(dev, number=1, repeat=100))
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
4259.5160 4265.2159 4279.3836 4232.3964 13.6820
after tuning
target = tvm.target.cuda(model="3060", arch="sm_86")
with auto_scheduler.ApplyHistoryBest(log_file):
with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
lib = relay.build(mod, target=target, params=params)
dev = tvm.device(str(target))
module = graph_executor.GraphModule(lib["default"](dev))
print(module.benchmark(dev, number=1, repeat=100))、
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
4272.4422 4265.6385 4299.3868 4242.7064 14.4868
- float model
model
model = torchvision.models.resnet18(pretrained=True).eval()
pt_inp = torch.rand([32, 3, 224, 224])
quantize_model(model, pt_inp)
tvm_model, params = relay.frontend.from_pytorch(script_model, [("input",(32,3,224,224))])
inference
target = tvm.target.cuda(model="3060", arch="sm_86")
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(model, target=target, params=params)
dev = tvm.device(str(target))
module = graph_executor.GraphModule(lib["default"](dev))
print(module.benchmark(dev, number=1, repeat=100))
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
37.5285 37.5492 40.6122 37.0720 0.4684
Moreover, I tested a single convolution layer separately, and the result was that float convolution was faster than integer convolution, even after tuning for integer convolution.