Very slow under linux cuda

Nothing is wrong. When you call module.run(), you just put all your cuda kernels into a default cuda stream. And when you call module.get_output(0).asnumpy(), it will call a cuda memory copy function, which is a synchronized function, so you will wait for all the computation in the dafault cuda stream is done.