AutoTVM autotuing a network

Maximilianxu · November 27, 2020, 2:01pm

Hi.

I run the snippet from https://tvm.apache.org/docs/tutorials/autotvm/tune_relay_x86.html#sphx-glr-tutorials-autotvm-tune-relay-x86-py with options: network: resnet18 / mobilenet n_trial: 20000 tuner: ga/random, neigher worked target: llvm

It always shows sWARMING:autotvm:Too many errors happen in the tuning. Now is in debug mode autotvm:too many errors happen in the tuning

The tvm version is 0.8dev0. As for the llvm version, I tried 12.0 and 7.1, but neither of them worked.

Any ideas?

Wheest · November 27, 2020, 2:28pm

Can you try running with a smaller network? E.g. import a single layer network from PyTorch/whatever, and see if the errors still occur.

Is this for a local, or RPC auto-tune?

Maximilianxu · November 28, 2020, 3:45am

Tuning a single operator is fine.

I used a local runner for the LLVM target.

The server is 48-core Xeon Golden 5118@2.3GHz.

Wheest · November 28, 2020, 12:05pm

Hmm. I have had similar problems with this before, and I don’t think I ever discovered a root cause. Perhaps changing TVM version to the more stable 0.7 could help, unless there’s a key feature in 0.8 that you need.

You mentioned local runner, so it won’t be network issues.

Otherwise, another approach could be a binary search of commenting out half of the network, seeing if it crashes. If not, uncomment the 3rd quarter, if it does, comment out the 2nd quarter, etc. Maybe it’s a single layer that is breaking TVM, and could serve as a bug report.

Maximilianxu · November 29, 2020, 9:34am

Thank you. I will check the network next… It’s so weird because I know some CUDA schedules can be invalid, but such cases are so rare with LLVM target which does not requires launch parameters, etc.

But I will check the model if something wrong. Thank you again…

Wheest · November 29, 2020, 2:03pm

Yeah, my experience was also with an LLVM target. I don’t recall there being an obvious error cause. Good luck, sorry I couldn’t bring more illumination.