[NNVM] conv2d_transpose is particularly slow

Although conv2d_transpose is intrinsically slower than conv2d, the NNVM difference between the versions is larger than expected. This is possibly because c2d_t doesn’t have a custom schedule (that looks into the OutputPad for the actual conv and input pad)

Here’s some benchmarking code.

sample output:

usec/call

TOPI
conv2d:           0.002746
conv2d_transpose: 0.043117
15x slowdown
--------------------------
NNVM:
conv2d:           0.002280
conv2d_transpose: 0.538102
conv2d_transpose: 0.062680 (with custom schedule)
236x slowdown (27x with custom schedule)
--------------------------
PyTorch
conv2d:           0.005772
conv2d_transpose: 0.054538
conv2d_dx         0.022895
9x slowdown

https://github.com/dmlc/tvm/pull/1075