Questions about Op Fusion in TVM

Hi all, I’m trying to learn the kernel fusion done by tvm, and it’s difference compared to the hand-writing way.

In my knowledge, if I want to fuse two neighbor ops into one, I need first replace the two ops into a custom op, and then I need to give the implementation of the new custom op. So I want to know how tvm does this automatically.

Based my fresh understanding, this mechanism is based on two artifacts:

  1. the relay pass that fused relay functions, which I have found the source in FuseOps.cc I see this would build new functions composed of some small functions, like %2 in the following example.
def @main(%x {virtual_device=VirtualDevice(device_type=2, virtual_device_id=0, target=Target(id=718a4b0, kind='cuda', keys={'cuda', 'gpu'}, attrs={'thread_warp_size': 20, 'arch': "sm_75", 'max_num_threads': 400, 'libs': ["cudnn"]}, host=Target(id=715ce80, kind='llvm', keys={'cpu'})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %y {virtual_device=VirtualDevice(device_type=2, virtual_device_id=0, target=Target(id=718a4b0, kind='cuda', keys={'cuda', 'gpu'}, attrs={'thread_warp_size': 20, 'arch': "sm_75", 'max_num_threads': 400, 'libs': ["cudnn"]}, host=Target(id=715ce80, kind='llvm', keys={'cpu'})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %z {virtual_device=VirtualDevice(device_type=2, virtual_device_id=0, target=Target(id=718a4b0, kind='cuda', keys={'cuda', 'gpu'}, attrs={'thread_warp_size': 20, 'arch': "sm_75", 'max_num_threads': 400, 'libs': ["cudnn"]}, host=Target(id=715ce80, kind='llvm', keys={'cpu'})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, hash="acd8a0974305fc0a", virtual_device=VirtualDevice(device_type=2, virtual_device_id=0, target=Target(id=718a4b0, kind='cuda', keys={'cuda', 'gpu'}, attrs={'thread_warp_size': 20, 'arch': "sm_75", 'max_num_threads': 400, 'libs': ["cudnn"]}, host=Target(id=715ce80, kind='llvm', keys={'cpu'})))) -> Tensor[(1, 2), float32] {
  %2 = fn (%p0: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, Primitive=1, hash="67c547bbbeab2d50") -> Tensor[(1, 2), float32] {
    %0 = add(%p0, %p1) /* ty=Tensor[(1, 2), float32] */;
    %1 = add(%p1, %p2) /* ty=Tensor[(1, 2), float32] */;
    add(%0, %1) /* ty=Tensor[(1, 2), float32] */
  } /* ty=fn (Tensor[(1, 2), float32], Tensor[(1, 2), float32], Tensor[(1, 2), float32]) -> Tensor[(1, 2), float32] */;
  %2(%x, %y, %z) /* ty=Tensor[(1, 2), float32] */
}
  1. The second part is how to generate code for the fused function. In the above example, we have defined the compute and schedule of topi.add, How to generate the compute and schedule for the new fused function %2? Just nest the lambda expression which defines the compute of topi.add? I’m trying to find examples that are easy to understand this process. If you have any suggestions, could you give some help? Or to read the source code or any blogs about this?

@masahi I notice you have talked about similar topics, could you bother give an answer? thank you!

see FuseOps relay pass.

  1. the relay pass that fused relay functions, which I have found the source in FuseOps.cc I see this would build new functions composed of some small functions, like %2 in the following example.

I have seen the codes that does graph level transformations, which generates relay.functions at a high level. But I think the more important thing is how to generate codes for the fused functions. So I’m asking for some suggestion to understand this.

Often, the limitation of performance is memory-bounding. Fuse more operators could save the h2d/d2h time and make kernel more compute-intensive.

I think op with pattern like elewise, broadcast, injective always follows op’s schedule strategy with pattern like comm_reduce, out_elewise_fusable. Cause they are complex to schedule.

You could try another example conv with relu op. And compare it with single conv. To see the difference between them.

One fused op kernel only has one schedule strategy, so the more complex op should be the main strategy, the rests will follow the main strategy.

you could dump the code to see more details.

dev_module = func.imported_modules[0]
print(dev_module.get_source("cu"))

Thank you. But seems op fusion reduces the memory load and store. Also, op fusion has some rule of patterns, I see tvm describes 4 types of fusion pattern.

Are you inferring? or you can find the code that does it? I want to confirm exactly what happened to the compute and schedule, so it can generate templates for the fused function.

U could see LowerTEPass.

Got it. I’ll try to read the code there. Thank you!