Global Sync across different blocks in IR builder

Laurawly · July 18, 2018, 9:31pm

@were @tqchen Is there a way to do global synchronization in IR builder or the hybrid IR builder on the GPU? If I write gpu code, I would separate the two parts into two kernel functions. But I don’t know how to do that in TVM. Currently I’m experiencing bug after transposing data. Thread 0 reads different values from say temp[1000] compared with thread 1000 if the block size if 256.

tqchen · July 18, 2018, 11:10pm

You can launch multiple kernels in tvm. Just make sure you build two blocks of code, both of them bind to a different thread block as outer most

Laurawly · July 19, 2018, 6:35pm

I see. In order to call barrier(CLK_GLOBAL_MEM_FENCE) in IR builder what shall I do?

tqchen · July 19, 2018, 6:38pm

In theory you do not need to call global fence, because what will happen is that two kernels will be launched

Laurawly · July 19, 2018, 7:20pm

I meant if I don’t want to separate the kernel into two kernel functions.

Laurawly · July 19, 2018, 7:30pm

@were does hybrid interface already expose barrier/ syncthreads apis to python users?

tqchen · July 19, 2018, 7:33pm

The global barrier do not work well in opencl due to language limitations, so best way would indeed be use two kernels

Laurawly · July 19, 2018, 7:54pm

For one of my use cases, the results seem acceptable. I experimented it first using manually changed opencl code.

milindn · July 25, 2018, 12:01am

CUDA9 cooperative groups allows global sync across work-groups. Does TVM support this feature?

tqchen · July 25, 2018, 1:00am

@milindn TVM hasn’t yet take benefit from global sync in CUDA9, contribution is welcomed