[Pytorch] Batch AutoTVM results only marginally faster than vanilla GPU

Hello,

I ran auto-tuning according to the tutorial on the official website (https://tvm.apache.org/docs/tutorials/autotvm/tune_relay_cuda.html#sphx-glr-tutorials-autotvm-tune-relay-cuda-py), on a torchvision GoogleNet model which I first converted to ONNX (no problems with the conversion).

There were 67 tasks extracted, and tuning took around 24 hours on a single Nvidia 2080Ti GPU.

Input tensor shape: (64,3,224,224)

When I ran a benchmark, these are the results I got:

GoogleNet

  • Torch Vanilla CPU Runtime = 1357.69 [ms]

  • Torch Vanilla CUDA Runtime = 24.97 [ms]

  • AutoTVM CUDA Runtime = 22.00 [ms]

  • Vanilla TVM CUDA Runtime = 54.22 [ms]

When I ran a benchmark without autotvm on a batch size of 1, I got 1.5x improvement with tvm vs. vanilla.

Is there a problem with large batch sizes? Or am I doing something wrong with the auto-tuning?

    if AutoTVM:
        dtype = "float32"
        log_file = f"{MODEL_NAME}_NEW.log"
        input_name = "inputImage"
        tuning_opt = {
                'log_filename': log_file,
                'tuner': 'xgb',
                'n_trial': 600,
                'early_stopping': 200,
                'measure_option': autotvm.measure_option(
                    builder=autotvm.LocalBuilder(),
                    runner=autotvm.LocalRunner(repeat=3, min_repeat_ms=150, timeout=4))
            }

        if tune_kernels_flag:
            print("Extract tasks...")
            tasks = autotvm.task.extract_from_program(mod["main"],
                                                      target=target,
                                                      params=params,
                                                      ops=(relay.op.get("nn.conv2d"),))

            # run tuning tasks
            tune_tasks(tasks, **tuning_opt)

AFAIK, the TOPI template for CUDA is not optimized for large batch size, so the latency of a batch N conv2d would be close to N * the latency of batch 1 conv2d.