The module is doing Linear transform, and i dispatch it to pre-scheduled op in MLC-LLM dolly-v2-3b.
I profile it by using Nsight Compute run the script above.
As a result of Nsight Compute,there are 5 kernels named
fused_NT_matmul1_add3 generated in the profile report.
But in the script, i just do inference once, and the prim_func
fused_NT_matmul1_add3 is only called once in relax_func
smallLernels.Intuitively there should only be one cuda kernel.
It is quite confusing, could anyone help to explain why?