I try to benchmark the memory transfer time between the host and device in the TVM code. I used Pytorch synchronization method to time the data transfer between the device and host. (example code attached) At the same time, I also profile the code with NvProf. However, the time reported by the Python timer is slower than NvProf’s result. It seems there is at least 5%-15% difference between the time reported by Pytorch timer and the NvProf.
I don’t observe such difference if the code is written in Pytorch. As such, I would like to know what is the best way to benchmark the data transfer time between the CPU and GPU.
Thanks!
a_np = np.random.uniform(size=(1000, 1024)).astype(np.float32)
dev = tvm.cuda()
torch.cuda.synchronize()
start_A_transfer = time.time()
a_tvm = tvm.nd.array(a_np, device=dev)
torch.cuda.synchronize()
end_A_transfer = time.time()
matrix_A_HtoD_time = (end_A_transfer - start_A_transfer)*1000
print("matrix A HtoD time: ", matrix_A_HtoD_time,"ms")