Tips for troubleshooting tuning slowdowns?

Does anyone have tips or sequence they use to troubleshoot performance slowdowns that come during tuning?

This weekend I was playing around with the TVMC driver. This makes it pretty convenient to take an existing ONNX file and see what happens when you tune with either autotvm or the new autoscheduler and then compare performance with untuned models.

Command likes I was using were typically of the form with fairly basic options.

tvmc tune --target rocm --output tunedbfile.json --enable-autoscheduler onnxfile tvmc compile --target rocm --tuning-records tunedbfile.json --output compilefile.tar onnxfile tvmx run --device rocm --fill-mode random --print-time --repeat 100 compilefile.tar

Sometimes the tuning significantly but there were also sometimes when it seemed tuning fell off a cliff to make a much slower program.

What are the best ways to troubleshoot and figure out what is going on when the performance significantly degrades? Any tips or tricks? I’ve found that I can use the --profile option on the tvmc run command to at least see the kernels in question that become much slower. However, I’m not quite sure on where/how to look next.

As a practical example, inception model exported from torchvision library works pretty well and resnet50 exported from torchvision is an example that slows down with a few convolution kernels much slower.

Not quite sure where/how to dig deeper on what might be going on.

Additional data. With the exact same ONNX models, I can also sometimes see slowdowns for other CPU/GPU combinations. For example, I see the following:

inceptionv3:

  • autoscheduler is faster on Radeon VII (0.57x elapsed time) and slower on RTX 3070m (1.09x elapsed time)
  • autorvm is slower on Radeon VII (1.12x) and RTX 3070m (1.37x)

resnet50

  • autoscheduler is slower on Radeon VII (2.26x) and RTX 3070m (1.04x)
  • autotvm is slower on Radeon VII (3.51x) and RTX 3070m (1.25x)

vgg16

  • autoscheduler is slower on Radeon VII (4.19x) and RTX 3070m (1.92x)
  • autotvm is slower on Radeon VII (1.47x) and RTX 3070m (1.08x)

Tuning these ONNX files results in slower code in 5 of 6 cases than using untuned ONNX files to start. Howe to best investigate further?

Do you mean it takes longer to tune, or the kernels after tuning are slower? Note that recently we have refactored auto scheduler with PopenPool https://github.com/apache/tvm/pull/8492, but we expect the tuning speed is not affected

I mean the kernels after tuning are slower.

More precisely, measurements above are for an end-to-end execution of models with many kernels. Some kernels go faster, some slower - but overall the net effect is the models run slower after autoscheduling than the same model which has not been tuned.

I can look at profile data to find kernels and see which go slower. However, not quite sure on steps typically done after that.