Hello,
I ran auto-tuning according to the tutorial on the official website (https://tvm.apache.org/docs/tutorials/autotvm/tune_relay_cuda.html#sphx-glr-tutorials-autotvm-tune-relay-cuda-py), on a torchvision GoogleNet model which I first converted to ONNX (no problems with the conversion).
There were 67 tasks extracted, and tuning took around 24 hours on a single Nvidia 2080Ti GPU.
Input tensor shape: (64,3,224,224)
When I ran a benchmark, these are the results I got:
GoogleNet
-
Torch Vanilla CPU Runtime = 1357.69 [ms]
-
Torch Vanilla CUDA Runtime = 24.97 [ms]
-
AutoTVM CUDA Runtime = 22.00 [ms]
-
Vanilla TVM CUDA Runtime = 54.22 [ms]
When I ran a benchmark without autotvm on a batch size of 1, I got 1.5x improvement with tvm vs. vanilla.
Is there a problem with large batch sizes? Or am I doing something wrong with the auto-tuning?
if AutoTVM:
dtype = "float32"
log_file = f"{MODEL_NAME}_NEW.log"
input_name = "inputImage"
tuning_opt = {
'log_filename': log_file,
'tuner': 'xgb',
'n_trial': 600,
'early_stopping': 200,
'measure_option': autotvm.measure_option(
builder=autotvm.LocalBuilder(),
runner=autotvm.LocalRunner(repeat=3, min_repeat_ms=150, timeout=4))
}
if tune_kernels_flag:
print("Extract tasks...")
tasks = autotvm.task.extract_from_program(mod["main"],
target=target,
params=params,
ops=(relay.op.get("nn.conv2d"),))
# run tuning tasks
tune_tasks(tasks, **tuning_opt)