Possible issue with conv transpose (very slow)

Hello. I am pushing a U-Net like model through TVM, after looking online at the impressive benchmarks on the TVM webpage.

I think I am experiencing something similar to [NNVM] conv2d_transpose is particularly slow, though I’m not sure. This is the network I create in pytorch and export to ONNX:

net = nn.Sequential(nn.ConvTranspose2d(in_channels=128, out_channels=128, kernel_size=2, stride=2),
                    nn.ReLU(),
                    nn.ConvTranspose2d(in_channels=128, out_channels=128, kernel_size=2, stride=2),
                    nn.ReLU(),
                    nn.ConvTranspose2d(in_channels=128, out_channels=64, kernel_size=2, stride=2))

Which takes 0.44 seconds in average to compute in pytorch.

When I want to optimize this model in TVM with

target = 'llvm -mcpu=core-avx2'

onnx_model = onnx.load('nodilation.onnx')

input_name = 'input'
input_shape = (1, 128, 128, 128)
shape_dict = {input_name: input_shape}

net, params = relay.frontend.from_onnx(onnx_model, shape_dict)

tasks = autotvm.task.extract_from_program(net, target=target,
                                          params=params, ops=(relay.op.nn.conv2d, relay.op.nn.conv2d_transpose))

I see that tasks[0].workload is None and I am not able to optimize the up-convolutions, which results in a very very slow model (50 seconds per call).

I am new to TVM, so my questions are:

  1. Am I doing something wrong, and if so how can I fix it?
  2. If it has to do with TVM’s support for transposed convs, are you planning to support this soon? How can I help?

Thanks!
Carlos

It looks like our x86 backend doesn’t have a schedule for conv transpose. So it is likely that you are using very slow default schedule (single thread, no vectorization).

Unless you have a good reason to use conv transpose, I recommend using upsampling op instead.

Thanks, that would explain it. We have to check if we can replace the conv transpose operation by upsampling and convolution.

I also realized that I cannot optimize dilated convolutions for CPU. Namely:

net = nn.Sequential(nn.Conv2d(in_channels=128, out_channels=128, kernel_size=2, dilation=(2, 2)),
                    nn.ReLU(),
                    nn.Conv2d(in_channels=128, out_channels=128, kernel_size=2, dilation=(2, 2)),
                    nn.ReLU(),
                    nn.Conv2d(in_channels=128, out_channels=64, kernel_size=2, dilation=(2,2)))

I am getting the following error:

Traceback (most recent call last):
  File "tuning.py", line 134, in <module>
    tune_and_evaluate(tuning_option)
  File "tuning.py", line 108, in tune_and_evaluate
    tune_kernels(tasks, **tuning_opt)
  File "tuning.py", line 76, in tune_kernels
    target=target, template_key='direct')
  File "/usr/tvm/python/tvm/autotvm/task/task.py", line 175, in create
    sch, _ = func(*args)
  File "/usr/tvm/topi/python/topi/x86/conv2d.py", line 279, in _topi_nn_conv2d_NCHWc
    data_layout, out_layout, dtype)
  File "/usr/tvm/topi/python/topi/x86/conv2d.py", line 372, in _declaration_conv_NCHWc
    assert (dh, dw) == (1, 1), "Does not support dilation"

Is this also something known and that we have no workaround for now? We are using dilated filters to achieve larger receptive fields.

if you are compiling under opt_level = 3, which turns on NCHWc convolution, then yes, it seems NCHWc conv op doesn’t support dilation. If opt_level = 2, dilated convolution should work, but I don’t expect it to be fast.

Please understand that people are focusing on Imagenet models, so anything that falls outside of them have limited support at the moment. But it is your chance to contribute :wink: that’s how I ended up adding support for upsampling op etc.

Thanks, that explains what I’m observing.
I’d like to contribute, though right now it may not be the best time.

Can you recommend a PR to look at to find out how to implement support for operations such as dilated conv or conv transpose? I’m still very new to the internals of TVM

I think arm_cpu’s implementation you could refer. https://github.com/dmlc/tvm/tree/master/topi/python/topi/arm_cpu

This seems to be a good item to call for better community support https://github.com/dmlc/tvm/issues/2658

Yeah, @yzhliu and I have also seen this before. For configuration like in_channel=256, out_channel=64, in_height=256, in_width=256, it takes several minutes.