TOPI autotuning integration

Wheest · March 24, 2020, 8:48pm

I’m trying to integrate an autotuning schedule that I’ve created for a special case of convolution to work with TOPI.

However, I’m having difficulty getting it integrated correctly with autotvm.

When running my compute description and schedule in a standalone system, I am successfully able to autotune. This is following the decorator style described in the tune_simple_template.py tutorial. By default my schedule uses standard things like SIMD, and gets good times. Autotuning squeezes the extra performance.

I have successfully integrated into TOPI to run without autotuning. Using the decorators, it uses my compute definition and schedule. My benchmark takes X ms. If I manually disable vectorisation, it takes 4X ms. The default version in v0.6 takes Y ms.

However, I am confused when I try to use autotvm. If I run my benchmark using tvm, during autotuning I can confirm that my compute definition and schedule are called (using the old reliable printf debugging).

However, at the end of autotuning it seems to fall back to using the default implementation, with the time being around Y ms, rather than X ms. My suspicion that the default is used, I disabled SIMD on my schedule, and this has no effect.

To debug, I feel I need to understand the flow of execution when integrated in TOPI better. Taking ARM CPU conv2d as an example, this is what my traces and reasoning has brought me.

I would appreciate it if anyone could point out any holes, or resources I could look at to get better.

I’m targeting an ARM CPU, so a lot of the things I’m using are in the topi/python/topi/arm_cpu/conv2d.py file.

When we autotune, here’s what I think happens:

Define the computation

legalise conv2d from relay
access the callback registered to autotvm as a compute definition conv2d
build the compute definition for the chosen layout, and define tuning knobs

Call the schedule

call schedule_conv2d_nchw, registered as an autotvm schedule. Call the appropriate schedule function for the compute used (e.g. Winograd, Spatial pack, etc.)
our normal schedule is applied, e.g. loop unrolling, vectorisation etc

Autotuning occurs

Unsure all of the steps that happen here, am familiar what happens in the context of a standalone version.

haichen · March 24, 2020, 9:46pm

Did you use the latest TVM master version? In latest version, we move to use Relay Op Strategy to choose which implementation to compile for each op. You need to add your implementation in the strategy in order to be used during the compilation.

Wheest · March 25, 2020, 1:07pm

Thanks, I’ve been using the v0.6 release, rather than the development branch. This Relay Op Strategy design seems to bring a lot more clarity to the process, and hopefully I’ll get a MWE off the ground soon.