Is the time measured by tvm using time_evaluator only the kernel time without any overhead time like cudamalloc?
Yes, the intent is to only consider execution time + time to copy data back to host memory. By default I believe time_evaluator will also ignore the time of the first run, as that may include additional overheads such as JIT compilation and allocation.