Concurrent launching parallel cuda-graph or cuda kernel can definitely cause latency become larger, so results in inaccurate measuring. From my observation, Ansor tuning process is almost cpu bond (please correct me if I’m wrong), the gpu measuring span is only small portion, the most of time is spend on sampling valid schedule from sketch, and this is done by python multiprocessing, by default setting it will utilise all cpu cores, so I doubt is there any benefit to run multiple Ansor process concurrently.
1 Like