I’m trying to generate block-wise sync for GPUs. Correspondingly, “sync_threads()” for CUDA and “barrier(CLK_GLOBAL_MEM_FENCE)” for opencl.
see https://github.com/dmlc/tvm/blob/master/include/tvm/ir.h#L425, we will need to emit tvm_storage_sync(“shared”)