The module is doing Linear transform, and i dispatch it to pre-scheduled op in MLC-LLM dolly-v2-3b.
I profile it by using Nsight Compute run the script above.
As a result of Nsight Compute,there are 5 kernels named fused_NT_matmul1_add3
generated in the profile report.
But in the script, i just do inference once, and the prim_func fused_NT_matmul1_add3
is only called once in relax_func smallLernels
.Intuitively there should only be one cuda kernel.
It is quite confusing, could anyone help to explain why?