Measure cuda kernel time

I observed that the time reported by time_evaluator doest not match the nvprof kernel timings. Is there a way to measure actual kernel execution time instead of wall clock time?

I tested some kernels. Their results are very close (error < 2%).
Could you try time_evaluator with large number (e.g. 1000)