[unity] confuse about cuda kernel codegen

The module is doing Linear transform, and i dispatch it to pre-scheduled op in MLC-LLM dolly-v2-3b.

I profile it by using Nsight Compute run the script above. As a result of Nsight Compute,there are 5 kernels named fused_NT_matmul1_add3 generated in the profile report. But in the script, i just do inference once, and the prim_func fused_NT_matmul1_add3 is only called once in relax_func smallLernels.Intuitively there should only be one cuda kernel. It is quite confusing, could anyone help to explain why?

I don’t know exactly why it occurs 5 times, but indeed you may inspect kernel launches using Relax’s VM profiler

Maybe there’s something wrong with Nsight Compute,i will check it. Thank you!

The code snippet above is part of lowered module TVM generated. I think maybe it is because for i1_0 in range(5): and the generated cuda kernel is in the shape of <<40,128>>. Then it would be called 5 times in the cuda stream.

that is right, in this case, the forloop is outside of the kernel launch it seems, causing the kernel to be called five times

above is the Linear transformation i defined in relax

above is the corresponding IR Module after some passes. There we can see Matmul & Add are fused into on kernel(prim_func) by TVM. Above is the corresponding cuda source code. What confuse me is that there are actually two cuda kernel generated, fused_matmul_add_kernel0 is doing matmul and fused_matmul_add_kernel1 is doing add. Would it be contrary to the original intention of Op fusion? Maybe there will be some performance loss?