Autotune significantly slower than pytorch

NarendraX9 · August 11, 2020, 2:49pm

I am trying to convert a custom architecture from Pytorch to TVM. Its a straightforward architecture that simply reduces the dimensionality using 4 layers of MBConv to a total stride of 16. This takes ~0.029 s on my GTX 1660 when running through Pytorch for inference on a single image with size of 512x512. When converted to TVM without autotuning (on cuda) it takes ~0.7s for inference on the same image. I autotuned the model on all the conv2d and dense ops (10 hrs) using the default recommended parameters in the example given here: https://tvm.apache.org/docs/tutorials/autotvm/tune_relay_cuda.html However the inference time improvement is negligible and it still takes (~0.55s per image) which is 25x slower than original pytorch inference. Through my experience, TVM poses a slow down for any custom architecture other those which only use (1) Fully convolutional layers only (2) Are most reported (not most recent or most advanced) architectures like Resnet50/121. What causes this significant slowdown? Are only limited ops supported for Autotuning?

kevinthesun · August 11, 2020, 5:37pm

Have you tried to use graph runtime debugger to break down op execution time?

comaniac · August 11, 2020, 7:54pm

It’s hard to identify the problem without the information of model architecture or tasks you have tuned.

NarendraX9 · August 11, 2020, 8:26pm

I actually took a rather naive approach and converted every single op used individually and compared to pytorch. All of the slow down comes primarily from Grouped Convolutions and then Depthwise Seperable Convolutions.

NarendraX9 · August 11, 2020, 8:31pm

class MobileResidual2D(Module):
def __init__(self, in_channels, expansion_factor=1.5, activation=Relu()):
    super(MobileResidual2D, self).__init__()
    self.repr = f'MobileResidual2D({in_channels}, expansion_factor={expansion_factor})'
    self.activation = activation
    self.pointwise_0 = GroupedPointwiseConv2D(in_channels, int(in_channels * expansion_factor))
    self.norm0 = AutoNorm(int(in_channels * expansion_factor))
    self.depthwise = DepthwiseSeperableConv2D(int(in_channels * expansion_factor),
                                              int(in_channels * expansion_factor)
                                              , padding=1)
    self.norm1 = AutoNorm(int(in_channels * expansion_factor))
    self.pointwise_1 = GroupedPointwiseConv2D(int(in_channels * expansion_factor), in_channels)
    self.norm2 = AutoNorm(in_channels)

def forward(self, x):
    z = self.activation(self.norm0(self.pointwise_0(x)))
    z = self.activation(self.norm1(self.depthwise(z)))
    return x + self.norm2(self.pointwise_1(z))

def __repr__(self):
    return self.repr

where AutoNorm is just GroupNorm with number of groups are largest power of 2. I use this as the main block of the architecture with AvgPool2D to reduce the dimensionality after each mbconv.

The tasks I have tuned are nn.conv2d and nn.dense (for end of architecture).

comaniac · August 11, 2020, 8:36pm

From your architecture, the tunable ops are indeed depthwise conv2d. Could you also post the tuning tasks extracted from the model (i.e., print(tasks)), which should include more useful information such as data layout and dtype?

kevinthesun · August 11, 2020, 9:33pm

PR to improve group conv2d for CPU: https://github.com/apache/incubator-tvm/pull/6137 I think GPU might need similar improvement.