Hello! I wrote an op composed of four CUDA kernels, and now I want to optimize the op, so I need to know the time ratio of the four kernels. I tried nvprof but was unable to use it due to permission issues. Is there a similar test function in TVM? My current test code is as follows:
module = graph_runtime.create(graph, lib, ctx)
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype("float16"))
module.set_input('data', data_tvm)
module.set_input(**params)
module.run()