How to improve the auto-tune performance?

comaniac · January 9, 2020, 7:52am

You can use graph runtime debug mode to dump a breakdown of each CUDA function generated from each op and analyze the bottleneck. You could refer to a previous response for enabling graph runtime debugger (the topic is for CPU, but the profiling approach is the same for all platforms): Profiling a TVM run on CPU