When I use auto-tvm to tune on arm cpu, I find that our depthwise conv2d is not very better. Why we not use winograd to do schedule for depthwise conv2d like conv2d? Or do we have others optimizations for depthwise conv2d? @merrymercy
What are your workloads? I can try to optimize some if I have spare cycles
In terms of “not very better”, do you have any baseline?
@merrymercy i previously compared with ncnn, depthwise is not better than it. model is mobilenet v2. I also port auto tvm into x86, depthwise conv2d is also not better than OpenVINO(MKL-DNN), almost slow 4x. I find depthwise conv2d doesn’t have spatial_pack or winograd like conv2d, so I just want to make sure whether it is this reason.
Yes. I can try some other templates for depthwise conv2d.
You mean spatial_pack or winograd or others? I will also have plan to try, but I want to sync with you and make sure the reason why we don’t use winograd / spatial_pack like conv2d
Spatial pack is fine. One quick thing you can try is to pre-compute the kernel packing by registering alter_op_layout.
I think our old baseline is too weak so we don’t spend time on depthwise conv2d.
precompute in alter_op_layout, you mean suit for depthwise conv2d? If I remember correctly, we use auto-tvm to train under O2 without alter_op_layout pass. So, or you mean train with enabling this pass?
yes, when we compare some good framework, we haven’t advantages. One more thing, when to compared with MKLDNN, conv2d is also not better. My data is about 1.5x slower than it in average under mobilenet model.
Do we have any plan to support GPU in auto-tvm? like Mali or NV?
There is a kernel packing stage in depthwiseconv, which can be computed in advance. But I don’t write alter_op_layout for it. You can modify it by following the logic in spatial pack. (need to rewrite compute declaration)
I will send auto-tvm for mali and NV this week.
On 1080ti we can match mxnet+tensorrt on resnet/vgg, can be faster on mobilenet v1.
Do you try this https://github.com/dmlc/tvm/issues/1585 ?
I think conv2d should be good. But they don’t optimize for depthwise conv2d
I override the conv2d in x86 using auto tvm and haven’t conv2d_common / conv2d_1x1 any more. But it is worthy investigating what it is. Thanks!
I will try it tomorrow and will sync the performance status if we have better result.
I took a look at the schedule and found the schedule didn’t fuse the bias/relu. I have sent the fix https://github.com/dmlc/tvm/pull/1631
Since it doesn’t change the config space, you can use your tuned configs directly. In old schedule the bias stage is even not computed in parallel. So the more cpu cores you have, the more speed you can get by this fix.
I also tried kernel pre-packing, and found it doesn’t help (speed up less than 1ms)
@merrymercy o you have kernel pre-packing code in some branch? I can try it in my environment.
You can change this line
to
s[B1].pragma(c, 'debug_skip_region')
Then tvm will skip the kernel packing. You can use this trick to see the performance. But the output will be wrong
Thanks. It just speed one depthwise layer 0.2ms. Other layers doesn’t have speed up.
I suddenly think of my test model is mobilenet, we have many convolution is 1x1 (kernel). So im2col maybe is better than our current spatial_pack? or we should do special handling in spatial_pack for 1x1 kernel? I noticed that caffe2 use im2col for 1x1 convolution
There are layers where im2col should be better in mobilenet; we just currently do not have templates checked in for mobilenet yet.
I have implemented im2col quickly. It performances better than spatial_pack.
I think you will be interested in https://github.com/dmlc/tvm/issues/1596#issuecomment-415894262