Hi all, I’m trying to learn the kernel fusion done by tvm, and it’s difference compared to the hand-writing way.
In my knowledge, if I want to fuse two neighbor ops into one, I need first replace the two ops into a custom op, and then I need to give the implementation of the new custom op. So I want to know how tvm does this automatically.
Based my fresh understanding, this mechanism is based on two artifacts:
- the relay pass that fused relay functions, which I have found the source in FuseOps.cc I see this would build new functions composed of some small functions, like %2 in the following example.
def @main(%x {virtual_device=VirtualDevice(device_type=2, virtual_device_id=0, target=Target(id=718a4b0, kind='cuda', keys={'cuda', 'gpu'}, attrs={'thread_warp_size': 20, 'arch': "sm_75", 'max_num_threads': 400, 'libs': ["cudnn"]}, host=Target(id=715ce80, kind='llvm', keys={'cpu'})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %y {virtual_device=VirtualDevice(device_type=2, virtual_device_id=0, target=Target(id=718a4b0, kind='cuda', keys={'cuda', 'gpu'}, attrs={'thread_warp_size': 20, 'arch': "sm_75", 'max_num_threads': 400, 'libs': ["cudnn"]}, host=Target(id=715ce80, kind='llvm', keys={'cpu'})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %z {virtual_device=VirtualDevice(device_type=2, virtual_device_id=0, target=Target(id=718a4b0, kind='cuda', keys={'cuda', 'gpu'}, attrs={'thread_warp_size': 20, 'arch': "sm_75", 'max_num_threads': 400, 'libs': ["cudnn"]}, host=Target(id=715ce80, kind='llvm', keys={'cpu'})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, hash="acd8a0974305fc0a", virtual_device=VirtualDevice(device_type=2, virtual_device_id=0, target=Target(id=718a4b0, kind='cuda', keys={'cuda', 'gpu'}, attrs={'thread_warp_size': 20, 'arch': "sm_75", 'max_num_threads': 400, 'libs': ["cudnn"]}, host=Target(id=715ce80, kind='llvm', keys={'cpu'})))) -> Tensor[(1, 2), float32] {
%2 = fn (%p0: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, Primitive=1, hash="67c547bbbeab2d50") -> Tensor[(1, 2), float32] {
%0 = add(%p0, %p1) /* ty=Tensor[(1, 2), float32] */;
%1 = add(%p1, %p2) /* ty=Tensor[(1, 2), float32] */;
add(%0, %1) /* ty=Tensor[(1, 2), float32] */
} /* ty=fn (Tensor[(1, 2), float32], Tensor[(1, 2), float32], Tensor[(1, 2), float32]) -> Tensor[(1, 2), float32] */;
%2(%x, %y, %z) /* ty=Tensor[(1, 2), float32] */
}
- The second part is how to generate code for the fused function. In the above example, we have defined the compute and schedule of topi.add, How to generate the compute and schedule for the new fused function %2? Just nest the lambda expression which defines the compute of topi.add? I’m trying to find examples that are easy to understand this process. If you have any suggestions, could you give some help? Or to read the source code or any blogs about this?