Feeling confusion about some design decision on topi/arm_cpu/conv2d.py :
‘Direct’ and ‘Winograd’ methods are provided for conv without the popular ‘im2col’. Why?
Is there any theoretical or experimental conclusion on which is better in terms of efficiency?
If I want to try a new method for conv(e.g. im2col) and let auto-tvm do the best choice, what should I do except providing schedule template for the method?
I am not good at python , and the register mechanism in topi/autotvm seems hard for me. It will be nice if someone could make the process clear or provide some samples on it.
In ‘winograd’ method, why do we choose tile_size=4 instead of making it tunable? It seems that some other frameworks choose tile_size=6/2 for different shapes.
im2col can have some benefits for certain layouts. We will welcome a PR that adds an im2col template to autotvm.
To support another algorithm strategy, such as im2col, a few steps are needed in addition to providing the schedule template.
First, you must register the compute declaration (you can borrow this from old im2col code) that describes the computation in addition to the data layout transformations. The example for the direct case is here (this does not have a data layout transformation step).
Then, hook in your schedule template function here.
Basically the steps are to add im2col so that the correct compute declarations/schedule functions fire when that strategy is chosen.
So you mean what I should do is just complete the compute-schedule process for the new strategy using @autotvm.register_topi_xxx. Then the autotvm will list it as a candidate while tuning. Is that correct?
Yes, you can specify a specific template you want to tune like the tutorial does. I think the default behavior if you want to tune from a pre-specified graph (e.g., model defined in NNVM) is to only use the direct template as that is what task extraction produces.
For now, if you just want to try some experiments with templates, you can look at a standalone example if you want to avoid the declaration and schedule boilerplate.
Direct with tuning is better than im2col in most cases. The benefit of im2col is easy utilization of BLAS, however, we don’t use these libraries in tvm.
The tile_size is chosen based on the benchmark on common networks.
We don’t make it tunable because the in current implementation tile size will affect the space size. i.e., tile size 2 and tile size 4 have different tuning spaces. But you can try to eliminate this effect and make it tunable.
However, we also observed some problems related to measurement Improved Direct + Winograd NCHWc CPU implementation, with ResNet-50 results
I think some fixes are required.
I have implemented im2col auto-tvm version on ARM CPU. But I don’t observer performance better than SpatialPack on Mobilenet. So I think we don’t have need to add. But you can try it.
Thanks for such clear explanation. I will check the details you mentioned.
The reason I reach the problem is that I think for winograd F(6x6,3,3) will perform better than F(4x4, 3x3) theoretically since the former reduce more calculations. But the test data told me a different story.
Is there any more factor I should take into account for analysing that?
I’ve noticed that you have done a lot of work about optimization on arm_cpu and now I am on the same road.
I would appreciate that if you could provide some guidance on how to optimize the conv.
I followed your suggestion here (by the way it helps a lot, thx~). Now I could roughly understand how the compute/schedule process works for per operator.
But it’s hard to go deeper (e.g. the lower process) for me since lack of knowledge on compiler, which makes the following work not easy.
Try to understand the concept on previous link, which is the fundamental of convolution optimization. The link also contains lower ir you want to know.
After this, try to understand the existing schedule: spatial pack. Then try to modify and implement your own schedule.