Bind reduce axis to blocks

I’m following the reduction tutorial to bind reduce axis to gpu blocks, but it seems like impossible, we can only bind them to thread. Do you have any idea how can we do it in tvm?

Do you get correct results when you bind reduction axes to threads? Do you apply the necessary synchronization across threads?

This will also be an issue across blocks; this is likely impossible because you cannot synchronize across blocks on many GPU (CUDA) platforms.

yeah i got correct results when i bind it to threads. I got this code from reduction tutorial.

n = 32
m = 64
A= tvm.placeholder((n,m),name = ‘A’)
k = tvm.reduce_axis((0,m),“k”)
B = tvm.compute((n,),lambda i:tvm.sum(A[i,k],axis=k),name=“B”)
s = tvm.create_schedule(B.op)
ko,ki=s[B].split(B.op.reduce_axis[0],factor=16)
BF = s.rfactor(B,ko)
xo,xi=s[B].split(s[B].op.axis[0],factor=32)
s[B].bind(xo, tvm.thread_axis(“blockIdx.x”))
s[B].bind(xi, tvm.thread_axis(“threadIdx.y”))

#It works when i bind reduce_axis to thread, but it wont work for blocks.
#tx = tvm.thread_axis(“blockIdx.x”)
tx = tvm.thread_axis(“threadIdx.x”)
s[B].bind(s[B].op.reduce_axis[0],tx)

s[BF].compute_at(s[B],s[B].op.reduce_axis[0])
s[B].set_store_predicate(tx.var.equal(0))

I’m not sure about blocks synchronization. We can split normal axis in several blocks and the result matrix in these blocks would be spliced correctly. In the case of split reduce axis, for me its create n(number of blocks) intermediate matrix Br[] in n blocks, we sum Br[] to get the final matrix. I’m not sure if sum will cause synchronization problem and splice won’t.

If TVM don’t support this,do you have an idea where should we rewrite to achieve this features?

Take a look at the generated CUDA code for the reduction when the axis is bound to blocks. There are calls to __syncthreads() so that the accumulation of the temporary array (B.rf) is not a race between different threads. However, there is no corresponding __syncblocks primitive in CUDA as many architectures (if not all of them) do not support synchronization across blocks.

I got same answer form an other guy, I thick that’s the reason. thank you very much eqy!

CUDA do provide various atomic functions, such as atomicAdd and atomicCAS. In principle, it is possible to implement general reduction with atomicCAS and common reductions with atomic*.

Do TVM have a plan to leverage atomic functions for cross-block reduction?

Is synchronizing blocks via a second kernel start (as recommended here: Synchronize all blocks in CUDA - #8 by njuffa - CUDA Programming and Performance - NVIDIA Developer Forums) no option for TVM?