[OpenCL] async memory transfer and double buffering

Hello, I am working with opencl and I am trying to make a new schedule optimized for a custom device. As I have a big shared memory, I tried using double buffering with the conv2d_direct schedule for CUDA.

When I checked the source code of the generated kernels, I noticed that indeed the memory cost is doubled but there is no asynchronous memory transfer to leverage double buffering. That lead me to wonder: is double buffering completely supported in TVM for opencl?

I have dug into this further and now I understand why there is no asynchronous memory access. TVM was made with GPU in mind (for OpenCL) and GPU change warp if the active one stall due to memory access.

While this is completely justified for CUDA, I think there should be asynchronous memory access for OpenCL as it is meant to target generic devices and people who wants to use the OpenCL backend for their device will likely run into performance issues because of this.

Besides I noticed that TVM doesn’t detect accelerator or custom OpenCL device. Are you interested in a pull request to fix this?