Asnumpy() funciton costs a lot of time during the whole inference, why dose it slower than use cudamemcpy() directly?

Hello~

My tuned module output_shape is (125,20,6600), the total number of “float type data” is 16 500 000.

I found that when I use asnumpy() after mod.run(), the asnumpy() function costs about 24ms, and it call the cudamemcpy(DeviceToHost) which in the libtvm_runtime.so.

I think it is unacceptble! And I test cudaMemcpy(DeviceToHost) when data size is 125x20x6600, the time is about 4ms.

So, why the cudaMemcpy(DeviceToHost) time in TVM is about 20ms which is very slowly? Dose any method to reduce the unnecessary time waste?