Hello there. I am looking at grouped convolution, and am incurring a massive performance penalty when using it.
In the Relay interface, there are different implementations for conv layers when groups==1, and when groups>1.
Ideally, we would hope for a speedup 2x if we switch from normal convolutional layer, to a grouped convolution with groups==2. However, on several platforms I have tried, there is a ~4x slowdown.
I have been looking into the tvm implementation, but am not yet familiar enough with the design.
Any insights into why the penalty is happening?
You can see this notebook which demonstrates the slowdown.
As I understand it, schedules are stored in topi/python/topi/. And the schedule for conv2d on x86 is in topi/python/topi/x86/conv2d.py.
I’ve been looking at the tutorial Introduction to TOPI, but I’m still trying to understand how to use the various tvm Python decorators to use schedules that I write.
I’m looking to see if I can add x86 and arm_cpu schedules for group_conv2d_nchw.
I imagine I should add a group_conv2d.py to both topi/python/topi/x86 and topi/python/topi/arm_cpu. However, is there anything else that is essential, or docs that might be helpful?
Should my decorators in those group_conv2d.py files be:
In my explorations, I have tried to force usage of topi.nn.group_conv2d_nchw when groups==1 (by commenting out the first if statement case in python/tvm/relay/op/nn/_nn.py:compute_conv2d).
However, the creation of the tvm.compute definition in topi/python/topi/nn/conv2d.py:group_conv2d_nchw fails with RuntimeError("Cannot find workload in attribute of this schedule").
There’s not anything in this function that I can see that would cause this. Unless the tag “tag=‘group_conv2d_nchw’” in the tvm.compute definition is getting picked up somewhere.
For the compute part, add autotvm.register_topi_compute(nn.group_conv2d_nchw, ['cpu'], 'direct', nn.group_conv2d_nchw.fdefault or @autotvm.register_topi_compute(group_conv2d_NCHW, 'cpu', 'direct') if you have a custom compute function.
For the schedule part, add @autotvm.register_topi_schedule(generic.schedule_group_conv2d_nchw, ['cpu'], ['direct]) to your schedule function.
How to improve the grouped convolution performance? Now it is much slower than mxnet inference.
tvm time is 570ms and mxnet is 45ms for a resnet50 using group conv on x86.
You can copy the schedule for conv2d, this will at least bring some improvement. You need to write specific schedule template for group_conv2d if you want further improvement