Global Sync across different blocks in IR builder

@were @tqchen Is there a way to do global synchronization in IR builder or the hybrid IR builder on the GPU? If I write gpu code, I would separate the two parts into two kernel functions. But I don’t know how to do that in TVM. Currently I’m experiencing bug after transposing data. Thread 0 reads different values from say temp[1000] compared with thread 1000 if the block size if 256.

1 Like

You can launch multiple kernels in tvm. Just make sure you build two blocks of code, both of them bind to a different thread block as outer most

I see. In order to call barrier(CLK_GLOBAL_MEM_FENCE) in IR builder what shall I do?

In theory you do not need to call global fence, because what will happen is that two kernels will be launched

I meant if I don’t want to separate the kernel into two kernel functions.

@were does hybrid interface already expose barrier/ syncthreads apis to python users?

The global barrier do not work well in opencl due to language limitations, so best way would indeed be use two kernels

For one of my use cases, the results seem acceptable. I experimented it first using manually changed opencl code.

CUDA9 cooperative groups allows global sync across work-groups. Does TVM support this feature?

@milindn TVM hasn’t yet take benefit from global sync in CUDA9, contribution is welcomed

1 Like