Hello, I am working with opencl and I am trying to make a new schedule optimized for a custom device. As I have a big shared memory, I tried using double buffering with the conv2d_direct schedule for CUDA.
When I checked the source code of the generated kernels, I noticed that indeed the memory cost is doubled but there is no asynchronous memory transfer to leverage double buffering. That lead me to wonder: is double buffering completely supported in TVM for opencl?