The first PR of Ansor has been submitted at https://github.com/apache/incubator-tvm/pull/5883
we do support ascend310 op codegen on AKG side, but not in MindSpore for now.
Is there any details about the TVM + MKLDNN BERT integration work? I would like to take a look to see its potential connection with Ansor.
Hi @xqdan, when you say ânot in MindSpore for nowâ, do you mean AKG is still a standalone codegen toolkit? Or it currently has already been integrated into your internal TensorFlow/PyTorch versions?
Adding â-libs=cblasâ in target and building tvm with MKLDNN will use mkl gemm for dense.
OpenCL Maliâs analog to âsharedâ is âlocalâ IIRC. Nvidia GPU calls âsharedâ a configurable part of L2 and I believe Mali GPU calls the configurable part of L2 as âlocalâ if Iâm not mistaken.
Welcome to the TVM community
Mali doesnât really have an equivalent to Nvidiaâs shared memory, it uses the system RAM backed by an unconfigurable cache. Local is just OpenCLâs term for CUDAâs shared. This means that using explicit cache read/writes to shared/local arenât advised when optimising for Mali.
As to explicitly generating vectorize instructions, that will depend on the architecture in question. Post-Midgard GPUs should not require it (other than perhaps vectorizing load/stores).
@yangjunpro AKG has been integrated into MS now, as a submodule. you can checkout here, https://github.com/mindspore-ai/mindspore.
I got two questions:
- How much time & machines taken to optimize the networks in the paper?(halide auto schedule said âa few hours per appâ)
- Can Ansor handle network with dynamic shape?
- A few hour per network in a single machine.
- Not yet at this moment.
Thanks to everyoneâs co-work, weâve merged our first PR of minimum Ansor system. Update the later upstream plan here.
In each part, we may split it to several small PRs.
- Phase 0: Ansor Minimum System(apache/incubator-tvm#5962)
- Phase 1: Ansor Components: Cost model, Other transform steps
- Namespace renaming (apache/incubator-tvm#6059)
- RPC Runner(apache/incubator-tvm#6077)
- Annotation/ComputeAt/ComputeRoot/ComputeInline steps(apache/incubator-tvm#6073)
- Phase 2: SketchSearchPolicy: As proposed in the paper
- Phase 3: Relay Integration: End to end network support
- Phase 4: API refine, Fully custom sketch support, Documents refine
Thanks for the contribution. I am not trying to block the progress. But I do have one question. As @MarisaKirisame pointed out that there are comments not addressed in the ansor PR, should we fix those first or move forward with more and more code and come back in the end to fix?
Hi all,
First of all, congratulations for the excellent work!
I have a question: how Ansor and graph-tuner are related?
The nice thing about graph-tuner is that we can determine the cost of altering the layout of the tensors within the network. Is there a way to graph-tune the network through Ansor?
@zhiics I summarized all unresolved comments in that PR. (https://github.com/apache/incubator-tvm/pull/5962#issuecomment-657068540). If someone finds there is something missing, he/she can append the item to that list.
We merge the PR first because it seems everyone agrees with the overall architecture. There are a lot of suggestions on details (e.g., names of internal functions). We considered all of them. However, we rejected some of them if we do not think the suggestion is better.
As @jroesch mentioned, we will do a final review of API when we fully land ansor (i.e. when we add the auto-tuning tutorials for most users)
@zhiics Yes, since the current code is only a little part of our whole system, we thought it would be better to carry forward if the overall architecture is accepted. Or it may take further several months for us to focuse on some small details and block the later features(e.g. our formal search policy, cost model, relay Integration ⌠which in deed are the more important parts of Ansor).
Thanks to everyoneâs comments, youâre quite welcome to initiate a discussion on any part at any time if one has some opinions.
@giuseros To be honest, currently Ansor still have problems cooperating with graph-tuner, but we figured out another approach to get a better overall performance than graph-tuner(the baseline of TVM in our paper used graph-tuner).
Ansor & AutoTVM will continue to coexist before we can find a way to cover all the use case of AutoTVM.
Thanks for your answer @jcf94! I have another question: does Ansor introduce optimization levels as well?
Like -O0
, -O2
, O3
for gcc
/clang
, every level could provide less tuning knobs, allowing for quicker tuning times (but worse performance).
Thanks, @giuseros. Thatâs an interesting idea.
At the beginning, we just thought about how to automatically generate schedules with the best performance for different kinds of ops. As for the efficiency, we count on a pre trained cost model and evolutionary search to find a good schedule quickly.
@merrymercy Maybe we can try some approaches like this? e.g. Set different level to control the granularity of split factor.
IMO, for quicker tuning time,
-
I think set
n_trial
be smaller could solve most problems, for example setn_trial
be 100 we maybe could achieve 70~80% performance, but it will cost much less time compared 20000 times. -
Pre-trained cost model could also help us to reduce tuning time
-
Ansorâs task scheduler / micro-to-macro strategy could also help this, even recent utility of
clflush
(which make us doesnât need min_repeat_ms, just run several times could also help, say tuning one resnet18 model on x86, we could boost 4.2x times). -
Limited tuning knob could also help this, but I think previous ways could help to solve this problem a lot.
In my view, if Ansor does intend to replace all the schedules in TOPI it will need to be runnable in a no-tuning mode. In an ahead-of-time flow, I think itâs too strong a restriction to require that the user has access to the hardware as many targets will be cross compiled. I think itâs also important more generally because most developers are not used to the auto-tuning process and should be able to have a decent out-of-the-box experience without having to set up the tuning.
I can see two main ways of doing this via Ansor. One is to train the cost model sufficiently well so that it alone can be queried. Then we can distribute cost models in a TopHub like arrangement.
The other option I see is to write sufficiently many good and general ârulesâ that the initial schedules that come out of Ansor untuned perform reasonably well.
Yes, we are working to produce a offline cost model and could use it to produce optimal schedules. It is the same as your thought.