How to fuse conv2d and following elemwise op?

lixiaoquan · July 30, 2018, 10:45am

# feature format: NCHW
feature = tvm.placeholder((16, 3, 244, 244), name="feature")

# kernel format : output_channel, input_channel, kernel_height, kernel_width
kernel = tvm.placeholder((10, 3, 16, 16), name="kernel")

output_data = topi.nn.conv2d(feature, kernel, 1, 0)

relu_result = topi.nn.elemwise.relu(output_data)

s = tvm.create_schedule(output_data.op)

print(tvm.lower(s, [feature, kernel], simple_mode=True))

Code like this will generate two ‘produce’ blocks, is it possible to fuse them through primitive API?

masahi · July 30, 2018, 12:06pm

Yes, the key is to use compute_at(…). For example, x86 schedule uses it here to fuse convolution and the following operation (bias add, batch norm, relu).

lixiaoquan · July 31, 2018, 3:17am

Thanks for the tip, I have figured out how to use compute_at() to fuse conv2d and elemwise op.

Now I am trying to fuse next conv2d, but tvm reports ‘Invalid schedule’

# feature format: NCHW
feature = tvm.placeholder((16, 3, 244, 244), name="feature")

# kernel format : output_channel, input_channel, kernel_height, kernel_width
kernel = tvm.placeholder((10, 3, 16, 16), name="kernel")

output_data = topi.nn.conv2d(feature, kernel, 1, 0)

relu_result = topi.nn.elemwise.relu(output_data)

conv2_result = topi.nn.conv2d(relu_result, kernel, 1, 0)

ss = tvm.create_schedule(conv2_result.op)

ss[output_data].compute_at(ss[relu_result], relu_result.op.axis[3])

ss[relu_result].compute_at(ss[conv2_result], conv2_result.op.axis[0])

print(tvm.lower(ss, [feature, kernel], simple_mode=True))

Is it feasible? Or I have to split input feature manually?

Thanks a lot

masahi · July 31, 2018, 4:39am

Fusing multiple convolutions is not possible.

Cherry.Han · May 13, 2019, 7:08am

Why fuse multiple convolution is impossible?

masahi · May 13, 2019, 7:50am

Imagine how you would implement fused convolution. Let’s say we target GPU. Before you can start the second convolution on a single pixel, you have to wait neighbor pixels to finish their first convolution. This requires global sync at shared memory boundary. Since we need to store the output of the first convolution to the global memory, we don’t have any benefit from fusing.

For other architectures it might be doable, but at least in TVM we don’t fuse consecutive convolutions.

aca88 · May 13, 2019, 8:09am

I would probably differentiate between

NNVM (at least v1) had fusion rules which prevented an automatic fusioning (at NNVM level) of two neighbouring convolution layers, therefore all automatic generated TVM “tasks” (i.e. composition of stages) had only one conv layer
It is not possible to generate TVM tasks which describe two neighbouring convs
It is not possible to use TVM scheduling primitives to fuse (i.e. tvm.compute_at) two convs

AFAIK

Is true but is a limitation posed from how NNVM (v1?) was used during operator fusion.
Is false. You can check by defining two tvm.compute which describe two conv2ds and use tvm.lower go get a printout

produce conv1_res{
//code which implements conv2d goes here
}
produce conv2_res{
//code which implements conv2d goes here
}

Is undefined (I havent tried it). Conceptually, I think it is possible since there is an obvious produce consumer relation and the tensors shape relations are also known.

lixiuhong · May 14, 2019, 12:41pm

You mention that it requires a global sync, however, the sync is not necessary when we have redundant computations.
There are many examples in the Halide papers.
In fact, where and when to compute the pixels can bring different trade-off between producer-consumer locality, input locality, and redundant computation.
In my opinion, it can also bring a larger exploration space for performance tuning, when fusing two convs.
I think TVM is capable of generating the code that fuses two convs, but it cannot now.