[RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0)

jcf94 · June 22, 2020, 11:39am

The first PR of Ansor has been submitted at https://github.com/apache/incubator-tvm/pull/5883

xqdan · June 23, 2020, 12:36am

we do support ascend310 op codegen on AKG side, but not in MindSpore for now.

yangjunpro · June 24, 2020, 1:13am

Is there any details about the TVM + MKLDNN BERT integration work? I would like to take a look to see its potential connection with Ansor.

yangjunpro · June 24, 2020, 1:15am

Hi @xqdan, when you say “not in MindSpore for now”, do you mean AKG is still a standalone codegen toolkit? Or it currently has already been integrated into your internal TensorFlow/PyTorch versions?

kevinthesun · June 24, 2020, 7:04am

Adding “-libs=cblas” in target and building tvm with MKLDNN will use mkl gemm for dense.

farhan · June 24, 2020, 9:57pm

OpenCL Mali’s analog to “shared” is “local” IIRC. Nvidia GPU calls “shared” a configurable part of L2 and I believe Mali GPU calls the configurable part of L2 as “local” if I’m not mistaken.

mbaret · June 24, 2020, 10:22pm

Welcome to the TVM community

Mali doesn’t really have an equivalent to Nvidia’s shared memory, it uses the system RAM backed by an unconfigurable cache. Local is just OpenCL’s term for CUDA’s shared. This means that using explicit cache read/writes to shared/local aren’t advised when optimising for Mali.

As to explicitly generating vectorize instructions, that will depend on the architecture in question. Post-Midgard GPUs should not require it (other than perhaps vectorizing load/stores).

xqdan · July 7, 2020, 12:24am

@yangjunpro AKG has been integrated into MS now, as a submodule. you can checkout here, https://github.com/mindspore-ai/mindspore.

lenLRX · July 12, 2020, 4:23pm

I got two questions:

How much time & machines taken to optimize the networks in the paper?(halide auto schedule said “a few hours per app”)
Can Ansor handle network with dynamic shape?

comaniac · July 13, 2020, 12:46am

A few hour per network in a single machine.
Not yet at this moment.

jcf94 · July 17, 2020, 3:37pm

Thanks to everyone’s co-work, we’ve merged our first PR of minimum Ansor system. Update the later upstream plan here.

In each part, we may split it to several small PRs.

Phase 0: Ansor Minimum System(apache/incubator-tvm#5962)
Phase 1: Ansor Components: Cost model, Other transform steps
- Namespace renaming (apache/incubator-tvm#6059)
- RPC Runner(apache/incubator-tvm#6077)
- Annotation/ComputeAt/ComputeRoot/ComputeInline steps(apache/incubator-tvm#6073)
Phase 2: SketchSearchPolicy: As proposed in the paper
Phase 3: Relay Integration: End to end network support
Phase 4: API refine, Fully custom sketch support, Documents refine

zhiics · July 16, 2020, 5:40pm

Thanks for the contribution. I am not trying to block the progress. But I do have one question. As @MarisaKirisame pointed out that there are comments not addressed in the ansor PR, should we fix those first or move forward with more and more code and come back in the end to fix?

giuseros · July 16, 2020, 5:48pm

Hi all,

First of all, congratulations for the excellent work!

I have a question: how Ansor and graph-tuner are related?

The nice thing about graph-tuner is that we can determine the cost of altering the layout of the tensors within the network. Is there a way to graph-tune the network through Ansor?

merrymercy · July 16, 2020, 9:27pm

@zhiics I summarized all unresolved comments in that PR. (https://github.com/apache/incubator-tvm/pull/5962#issuecomment-657068540). If someone finds there is something missing, he/she can append the item to that list.

We merge the PR first because it seems everyone agrees with the overall architecture. There are a lot of suggestions on details (e.g., names of internal functions). We considered all of them. However, we rejected some of them if we do not think the suggestion is better.

As @jroesch mentioned, we will do a final review of API when we fully land ansor (i.e. when we add the auto-tuning tutorials for most users)

jcf94 · July 17, 2020, 1:39am

@zhiics Yes, since the current code is only a little part of our whole system, we thought it would be better to carry forward if the overall architecture is accepted. Or it may take further several months for us to focuse on some small details and block the later features(e.g. our formal search policy, cost model, relay Integration … which in deed are the more important parts of Ansor).

Thanks to everyone’s comments, you’re quite welcome to initiate a discussion on any part at any time if one has some opinions.

@giuseros To be honest, currently Ansor still have problems cooperating with graph-tuner, but we figured out another approach to get a better overall performance than graph-tuner(the baseline of TVM in our paper used graph-tuner).

Ansor & AutoTVM will continue to coexist before we can find a way to cover all the use case of AutoTVM.

giuseros · July 17, 2020, 2:34pm

Thanks for your answer @jcf94! I have another question: does Ansor introduce optimization levels as well?

Like -O0, -O2, O3 for gcc/clang, every level could provide less tuning knobs, allowing for quicker tuning times (but worse performance).

jcf94 · July 18, 2020, 8:52am

Thanks, @giuseros. That’s an interesting idea.

At the beginning, we just thought about how to automatically generate schedules with the best performance for different kinds of ops. As for the efficiency, we count on a pre trained cost model and evolutionary search to find a good schedule quickly.

@merrymercy Maybe we can try some approaches like this? e.g. Set different level to control the granularity of split factor.

FrozenGene · July 18, 2020, 9:07am

IMO, for quicker tuning time,

I think set n_trial be smaller could solve most problems, for example set n_trial be 100 we maybe could achieve 70~80% performance, but it will cost much less time compared 20000 times.
Pre-trained cost model could also help us to reduce tuning time
Ansor’s task scheduler / micro-to-macro strategy could also help this, even recent utility of clflush (which make us doesn’t need min_repeat_ms, just run several times could also help, say tuning one resnet18 model on x86, we could boost 4.2x times).
Limited tuning knob could also help this, but I think previous ways could help to solve this problem a lot.

mbaret · July 18, 2020, 8:18pm

In my view, if Ansor does intend to replace all the schedules in TOPI it will need to be runnable in a no-tuning mode. In an ahead-of-time flow, I think it’s too strong a restriction to require that the user has access to the hardware as many targets will be cross compiled. I think it’s also important more generally because most developers are not used to the auto-tuning process and should be able to have a decent out-of-the-box experience without having to set up the tuning.

I can see two main ways of doing this via Ansor. One is to train the cost model sufficiently well so that it alone can be queried. Then we can distribute cost models in a TopHub like arrangement.

The other option I see is to write sufficiently many good and general ‘rules’ that the initial schedules that come out of Ansor untuned perform reasonably well.

FrozenGene · July 19, 2020, 12:28am

Yes, we are working to produce a offline cost model and could use it to produce optimal schedules. It is the same as your thought.