API for measuring memory transfer time from CPU to GPU

I try to benchmark the memory transfer time between the host and device in the TVM code. I used Pytorch synchronization method to time the data transfer between the device and host. (example code attached) At the same time, I also profile the code with NvProf. However, the time reported by the Python timer is slower than NvProf’s result. It seems there is at least 5%-15% difference between the time reported by Pytorch timer and the NvProf.

I don’t observe such difference if the code is written in Pytorch. As such, I would like to know what is the best way to benchmark the data transfer time between the CPU and GPU.

Thanks!

        a_np = np.random.uniform(size=(1000, 1024)).astype(np.float32)
        dev = tvm.cuda()
        torch.cuda.synchronize()
        start_A_transfer = time.time()
        a_tvm = tvm.nd.array(a_np, device=dev)
        torch.cuda.synchronize()
        end_A_transfer = time.time()
        matrix_A_HtoD_time = (end_A_transfer - start_A_transfer)*1000
        print("matrix A HtoD time: ", matrix_A_HtoD_time,"ms")