There are schedule primitives like rfactor() in Halide/TVM. For cuda backend, we can use rfactor() to map a reduction process to GPU threads. But I don’t know how to realize the reduction process in a thread block using shared memory. Would you give me some advice?
Are you talking about this?
https://docs.tvm.ai/tutorials/language/reduction.html#cross-thread-reduction
Thank you, that is what I need.