The global barrier on cuda is implemented using shared memory and atomic add.
In this line each thread adds num_blocks to vid_global_barrier_expect_, which is a variable on shared memory. As result, vid_global_barrier_expect_ == num_blocks * num_threads_per_block after all threads finish this line. Since this is a variable on shared memory, I think this should use atomicAdd or atomicAdd_block?
Can you give an example test case that generates code through this path? I am not familiar with this as most kernels I have seen do not use this synchronization mechanism.
As an aside, I am confused about the meaning of global in the global barrier here. Is it supposed to be a true global barrier, as in all threads in all blocks?
If I understand correctly, the implementation strategy is just to wait on a counter until all threads are accounted for. I thought that CUDA provides no guarantees that this style of implementation does not deadlock, as nothing prevents the execution of blocks from being serialized. But my knowledge of CUDA could just be very stale, so if someone has updated info that would be great.