You can use graph runtime debug mode to dump a breakdown of each CUDA function generated from each op and analyze the bottleneck. You could refer to a previous response for enabling graph runtime debugger (the topic is for CPU, but the profiling approach is the same for all platforms): Profiling a TVM run on CPU