[RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0)

Adding “-libs=cblas” in target and building tvm with MKLDNN will use mkl gemm for dense.

1 Like

OpenCL Mali’s analog to “shared” is “local” IIRC. Nvidia GPU calls “shared” a configurable part of L2 and I believe Mali GPU calls the configurable part of L2 as “local” if I’m not mistaken.

Welcome to the TVM community :slight_smile:

Mali doesn’t really have an equivalent to Nvidia’s shared memory, it uses the system RAM backed by an unconfigurable cache. Local is just OpenCL’s term for CUDA’s shared. This means that using explicit cache read/writes to shared/local aren’t advised when optimising for Mali.

As to explicitly generating vectorize instructions, that will depend on the architecture in question. Post-Midgard GPUs should not require it (other than perhaps vectorizing load/stores).

@yangjunpro AKG has been integrated into MS now, as a submodule. you can checkout here, https://github.com/mindspore-ai/mindspore.

1 Like

I got two questions:

  • How much time & machines taken to optimize the networks in the paper?(halide auto schedule said “a few hours per app”)
  • Can Ansor handle network with dynamic shape?
  1. A few hour per network in a single machine.
  2. Not yet at this moment.

Thanks to everyone’s co-work, we’ve merged our first PR of minimum Ansor system. Update the later upstream plan here.

In each part, we may split it to several small PRs.


Thanks for the contribution. I am not trying to block the progress. But I do have one question. As @MarisaKirisame pointed out that there are comments not addressed in the ansor PR, should we fix those first or move forward with more and more code and come back in the end to fix?

Hi all,

First of all, congratulations for the excellent work!

I have a question: how Ansor and graph-tuner are related?

The nice thing about graph-tuner is that we can determine the cost of altering the layout of the tensors within the network. Is there a way to graph-tune the network through Ansor?

@zhiics I summarized all unresolved comments in that PR. (https://github.com/apache/incubator-tvm/pull/5962#issuecomment-657068540). If someone finds there is something missing, he/she can append the item to that list.

We merge the PR first because it seems everyone agrees with the overall architecture. There are a lot of suggestions on details (e.g., names of internal functions). We considered all of them. However, we rejected some of them if we do not think the suggestion is better.

As @jroesch mentioned, we will do a final review of API when we fully land ansor (i.e. when we add the auto-tuning tutorials for most users)

@zhiics Yes, since the current code is only a little part of our whole system, we thought it would be better to carry forward if the overall architecture is accepted. Or it may take further several months for us to focuse on some small details and block the later features(e.g. our formal search policy, cost model, relay Integration … which in deed are the more important parts of Ansor).

Thanks to everyone’s comments, you’re quite welcome to initiate a discussion on any part at any time if one has some opinions. :smiley:

@giuseros To be honest, currently Ansor still have problems cooperating with graph-tuner, but we figured out another approach to get a better overall performance than graph-tuner(the baseline of TVM in our paper used graph-tuner).

Ansor & AutoTVM will continue to coexist before we can find a way to cover all the use case of AutoTVM.

Thanks for your answer @jcf94! I have another question: does Ansor introduce optimization levels as well?

Like -O0, -O2, O3 for gcc/clang, every level could provide less tuning knobs, allowing for quicker tuning times (but worse performance).

Thanks, @giuseros. That’s an interesting idea.

At the beginning, we just thought about how to automatically generate schedules with the best performance for different kinds of ops. As for the efficiency, we count on a pre trained cost model and evolutionary search to find a good schedule quickly.

@merrymercy Maybe we can try some approaches like this? e.g. Set different level to control the granularity of split factor.

IMO, for quicker tuning time,

  1. I think set n_trial be smaller could solve most problems, for example set n_trial be 100 we maybe could achieve 70~80% performance, but it will cost much less time compared 20000 times.

  2. Pre-trained cost model could also help us to reduce tuning time

  3. Ansor’s task scheduler / micro-to-macro strategy could also help this, even recent utility of clflush (which make us doesn’t need min_repeat_ms, just run several times could also help, say tuning one resnet18 model on x86, we could boost 4.2x times).

  4. Limited tuning knob could also help this, but I think previous ways could help to solve this problem a lot.

In my view, if Ansor does intend to replace all the schedules in TOPI it will need to be runnable in a no-tuning mode. In an ahead-of-time flow, I think it’s too strong a restriction to require that the user has access to the hardware as many targets will be cross compiled. I think it’s also important more generally because most developers are not used to the auto-tuning process and should be able to have a decent out-of-the-box experience without having to set up the tuning.

I can see two main ways of doing this via Ansor. One is to train the cost model sufficiently well so that it alone can be queried. Then we can distribute cost models in a TopHub like arrangement.

The other option I see is to write sufficiently many good and general ‘rules’ that the initial schedules that come out of Ansor untuned perform reasonably well.


Yes, we are working to produce a offline cost model and could use it to produce optimal schedules. It is the same as your thought.


Does auto_scheduler support auto tensor core in fp16 and int8 ? I didnt see in examples

It is not supported currently. We have some experimental examples here. cc @jcf94

I’m wondering if there are any TopHub-like workflows that have been created for Ansor yet? I’m looking at tuning reuse.

E.g. if I look at the tuning config I produce for MobileNetV2 mobilenet-NCHW-B1-llvm.json, I can see entries such as:

{"i": [["[\"0abeaf8a7df88ad015de76910b5779a1\"]", "llvm -keys=cpu -link-params=0 -mcpu=core-avx2", [8, 64, 64, 0, 0, 0, 0, 0], "", 2], [[], [["SP", 2, 0, 1, [1, 1, 1], 1], ["SP", 2, 4, 1000, [25, 2, 4], 1], ["SP", 2, 8, 1280, [16], 1], ["RE", 2, [0, 4, 1, 5, 8, 2, 6, 9, 3, 7]], ["FSP", 4, 0, 0, 2], ["FSP", 4, 3, 1, 2], ["RE", 4, [0, 3, 1, 4, 2, 5]], ["CA", 2, 4, 3], ["FU", 4, [0, 1, 2, 3]], ["AN", 4, 0, 3], ["PR", 2, 0, "auto_unroll_max_step$64"]]]], "r": [[0.000296812, 0.000293744, 0.000262426, 0.000270827, 0.000182757, 0.000185249, 0.000183439, 0.000183809, 0.000186335, 0.000268723], 0, 0.881661, 1611833413], "v": "v0.5"}

Which is somewhat similar to an equivalent entry from the autoTVM logfile for MobileNetV2:

{"input": ["llvm -keys=cpu -link-params=0 -mcpu=core-avx2", "conv2d_NCHWc.x86", [["TENSOR", [1, 320, 7, 7], "float32"], ["TENSOR", [1280, 320, 1, 1], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "NCHW", "NCHW", "float32"], {}], "config": {"index": 1996, "code_hash": null, "entity": [["tile_ic", "sp", [-1, 32]], ["tile_oc", "sp", [-1, 640]], ["tile_ow", "sp", [-1, 7]], ["tile_oh", "ot", 2]]}, "result": [[1000000000.0], 6, 10, 1611932834.9044788], "version": 0.2, "tvm_version": "0.8.dev0"}

I’m wondering for Ansor, if I had a network which had some layers identical to MobileNetV2, how I would parse the MobileNetV2 Ansor log file to use the tuning parameters for those shared layers.

The Ansor log format doesn’t seem to be canonicalised yet.

Unfortunately, we don’t have TopHub-like pre-tuned schedules for Ansor yet. One reason is that Ansor is still involving and everything could be changed. For example, the log you post is already out-of-date.