Optimisation pass errors with sparse computation on the GPU

This is a followup/total rewrite of this post, since I can no longer edit it. I’ve totally rewritten this post to make it clearer, and provide working simple examples of my issue, and some of the investigations I’ve done since.

I’m developing sparse versions of conv2d ops for TVM.

I’ve encountered an issue with optimisation levels and the GPU version. Basically, when there is a network with a ReLU layer after a sparse convolutional layer, and the optimisation level is set to greater than 0, then invalid code is generated.

Did you forget to bind?
    Variable `T_relu` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.

The issue is, I can’t identify what optimisation pass is actually responsible for this. If I run at opt_level=0, but manually enable every documented optimisation pass in (as described by build_config in the docs), then the code works fine. Even the pass "OpFusion", which I would assume to be responsible for the issue is okay.

Even if I run at opt_level 3, and pass all optimisations to the disabled_pass argument of build_config, I get the same error.

This suggests there is either 1) an undocumented optimisation pass, or 2) the disabled_pass, and enabled_pass arguments to build_config are being ignored.

I made a simple script (a single conv2d layer + ReLU), that demonstrates that the sparse code is correct when running on the CPU, and the GPU when running at opt_level 0. However it reproduces the error with GPU + opt_level=3.

python3 tvm_sparse_test.py --backend cpu --opt_level 3
python3 tvm_sparse_test.py --backend gpu --opt_level 0
python3 tvm_sparse_test.py --backend gpu --opt_level 3 # error 

All that is needed to run this code is to build my simplified version of the TVM v0.8 code that includes an implementation of sparse direct convolution. This is available as the v0.8-sparse-opt-issue branch of my fork.

The code for that can be found at python/tvm/topi/nn/conv2d_sparse.py. The actual sparse convolution is in the function csr_direct_convolution, and the code is the same for both CPU and GPU versions.

Does anyone have any suggestions as to why I’m experiencing this issue, and how I might understand if hypotheses 1) and 2) are correct.

Hi @Wheest, the problem is that relu is being fused into your computation, but it is not being scheduled. If you mark your function as opaque (reg.register_pattern("nn.conv2d_sparse", OpPattern.OPAQUE)) it should fix the issue. If you still want relu to be fused into the conv2d, you’ll have to update your schedule to handle the case of relu being fused.

Thanks a lot @tkonolige, that works, and this pattern registering isn’t an area of the stack I was familiar with. Do you have an intuition why this fusion was happening even when the optimisation "OpFusion" was explicitly disabled?

Regarding fusing the schedule, I think I understand what I’d do to schedule it, however I’m not sure how I would identify when this fusion was happening.

Is there an IsFused boolean that would be available to my schedules?

Is there an example elsewhere in the codebase that has similar behaviour?

I’m not sure why fusion happens even with OpFusion disabled. Maybe disabled_pass doesn’t work?

Here is an example of detecting fusion: https://github.com/apache/tvm/blob/main/python/tvm/topi/cuda/dense_tensorcore.py#L116. Its a bit confusion though. My understanding is that you have to inspect the Tensors being passed to the scheduling function and see if they are a fused op.

Thanks, this was a good suggestion. Though unfortunately I couldn’t get it to work.

I think there are too many issues with IRBuilder expressions being an external op, so not being interoperable with the normal TVM schedule language.

I’m not sure if down the line this is made irrelevant by Ansor.

But for now it is a barrier.