Nothing is wrong. When you call module.run()
, you just put all your cuda kernels into a default cuda stream. And when you call module.get_output(0).asnumpy()
, it will call a cuda memory copy function, which is a synchronized function, so you will wait for all the computation in the dafault cuda stream is done.