How do you test the percentage of time spent on several CUDA kernels

Hello! I wrote an op composed of four CUDA kernels, and now I want to optimize the op, so I need to know the time ratio of the four kernels. I tried nvprof but was unable to use it due to permission issues. Is there a similar test function in TVM? My current test code is as follows:

    module = graph_runtime.create(graph, lib, ctx)
    data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype("float16"))
    module.set_input('data', data_tvm)
    module.set_input(**params)
    module.run()

The debugger can provide some time breakdowns for different operations.

However, I’m not sure if it will give you the granularity that you need. For example I have looked into the Conv2D op, and I wanted to get time breakdowns of how much time was spent in padding, packing, convolution, and unpacking. But it only gave the full Conv2D time.

This is complicated by the fact they use the same schedule I guess. But if there are ways of getting a finer-grain breakdown time it would be good to know.