Bind reduce axis to blocks

dongyeye10 · June 12, 2019, 3:43am

I’m following the reduction tutorial to bind reduce axis to gpu blocks, but it seems like impossible, we can only bind them to thread. Do you have any idea how can we do it in tvm?

eqy · June 12, 2019, 8:05pm

Do you get correct results when you bind reduction axes to threads? Do you apply the necessary synchronization across threads?

This will also be an issue across blocks; this is likely impossible because you cannot synchronize across blocks on many GPU (CUDA) platforms.

dongyeye10 · June 13, 2019, 2:47am

yeah i got correct results when i bind it to threads. I got this code from reduction tutorial.

n = 32
m = 64
A= tvm.placeholder((n,m),name = ‘A’)
k = tvm.reduce_axis((0,m),“k”)
B = tvm.compute((n,),lambda i:tvm.sum(A[i,k],axis=k),name=“B”)
s = tvm.create_schedule(B.op)
ko,ki=s[B].split(B.op.reduce_axis[0],factor=16)
BF = s.rfactor(B,ko)
xo,xi=s[B].split(s[B].op.axis[0],factor=32)
s[B].bind(xo, tvm.thread_axis(“blockIdx.x”))
s[B].bind(xi, tvm.thread_axis(“threadIdx.y”))

#It works when i bind reduce_axis to thread, but it wont work for blocks.
#tx = tvm.thread_axis(“blockIdx.x”)
tx = tvm.thread_axis(“threadIdx.x”)
s[B].bind(s[B].op.reduce_axis[0],tx)

s[BF].compute_at(s[B],s[B].op.reduce_axis[0])
s[B].set_store_predicate(tx.var.equal(0))

I’m not sure about blocks synchronization. We can split normal axis in several blocks and the result matrix in these blocks would be spliced correctly. In the case of split reduce axis, for me its create n(number of blocks) intermediate matrix Br[] in n blocks, we sum Br[] to get the final matrix. I’m not sure if sum will cause synchronization problem and splice won’t.

If TVM don’t support this,do you have an idea where should we rewrite to achieve this features?

eqy · June 13, 2019, 3:25am

Take a look at the generated CUDA code for the reduction when the axis is bound to blocks. There are calls to __syncthreads() so that the accumulation of the temporary array (B.rf) is not a race between different threads. However, there is no corresponding __syncblocks primitive in CUDA as many architectures (if not all of them) do not support synchronization across blocks.

dongyeye10 · June 14, 2019, 1:50am

I got same answer form an other guy, I thick that’s the reason. thank you very much eqy!

gaohuazuo · August 28, 2019, 8:49am

CUDA do provide various atomic functions, such as atomicAdd and atomicCAS. In principle, it is possible to implement general reduction with atomicCAS and common reductions with atomic*.

Do TVM have a plan to leverage atomic functions for cross-block reduction?

richard-wwu · July 21, 2022, 9:11am

Is synchronizing blocks via a second kernel start (as recommended here: Synchronize all blocks in CUDA - #8 by njuffa - CUDA Programming and Performance - NVIDIA Developer Forums) no option for TVM?