Quantized models are slower than float models on GPUs

Hello everyone, when I tried to infer a Pytorch quantized model on the GPU(3060), I found that the inference speed of the quantized model was much slower than a float model. How can I set up to accelerate the inference speed of the quantized model.

  • quantized model

model

model = torchvision.models.quantization.resnet18(pretrained=True).eval()
pt_inp = torch.rand([32, 3, 224, 224])
quantize_model(model, pt_inp)
script_model = torch.jit.trace(model, pt_inp).eval()
tvm_model, params = relay.frontend.from_pytorch(script_model, [("input",(32,3,224,224))])

inference

target = tvm.target.cuda(model="3060", arch="sm_86")
with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(model, target=target, params=params)
dev = tvm.device(str(target))
module = graph_executor.GraphModule(lib["default"](dev))
print(module.benchmark(dev, number=1, repeat=100))

Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
 4259.5160    4265.2159    4279.3836    4232.3964     13.6820   

after tuning

target = tvm.target.cuda(model="3060", arch="sm_86")
with auto_scheduler.ApplyHistoryBest(log_file):
    with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
        lib = relay.build(mod, target=target, params=params)
dev = tvm.device(str(target))
module = graph_executor.GraphModule(lib["default"](dev))
print(module.benchmark(dev, number=1, repeat=100))、

Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
 4272.4422    4265.6385    4299.3868    4242.7064     14.4868
  • float model

model

model = torchvision.models.resnet18(pretrained=True).eval()
pt_inp = torch.rand([32, 3, 224, 224])
quantize_model(model, pt_inp)
tvm_model, params = relay.frontend.from_pytorch(script_model, [("input",(32,3,224,224))])

inference

target = tvm.target.cuda(model="3060", arch="sm_86")
with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(model, target=target, params=params)
dev = tvm.device(str(target))
module = graph_executor.GraphModule(lib["default"](dev))
print(module.benchmark(dev, number=1, repeat=100))

Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
  37.5285      37.5492      40.6122      37.0720       0.4684

Moreover, I tested a single convolution layer separately, and the result was that float convolution was faster than integer convolution, even after tuning for integer convolution.

I think it depends. Firstly, to enhance INT8 performance, it’s necessary to explicitly invoke the SIMD instruction DP4A on CUDA cores or S8 Tensor cores for acceleration. If you don’t explicitly invoke the assembly, you might not achieve the expected 2x speedup compared to float16 or 4x compared to float32. Instead, you might maintain performance relatively similar to float32. Unfortunately, current auto scheduling doesn’t support DP4A. You might want to consider using autotvm or manually writing a schedule for your computation.

One potential reason for this performance even decreased is some versions of certain NVIDIA GPUs have distinct CUDA cores. Some of these can only run FP32, while others can support FMA, HFMA2, and DP4A. If that’s the case, the FP16 performance will be the same as FP32. If you don’t utilize DP4A, you might only achieve half the throughput of FP32. Based on the specifications I found at https://www.techpowerup.com/gpu-specs/geforce-rtx-3060-12-gb.c3682, I suspect this might be the case for the 3060. However, this is purely speculative. I recommend using Nsight Compute for more detailed profiling.

thanks, I will do it !

May I ask how to specify dp4a when building a target

Please check out sota_nn, and I think you can give meta-schedule a try. I suggest it offers auto tensorization support for dp4a。 T.block_attr({"meta_schedule.auto_tensorize": "dp4a"})