I use “get_source()” api to get the source code of the generated cuda code, and compile it using nvcc.
It shows that the kernel’s running time is 640ms. But the time_evaluator tells me, the running time is 20ms.
I wonder why this happens.
I use -O3 as my compile option, am I missing some critical compile options?
Try to set the arch in nvcc, other than that, i cannot tell what is the difference. Note that when you measure time, always skip the first run because it includes JIT loading cost(which time_evaluator already did)