[SOLVED] Can auto-tuner frameworks like Halide/TVM generate high performance reduction algorithm?

There are schedule primitives like rfactor() in Halide/TVM. For cuda backend, we can use rfactor() to map a reduction process to GPU threads. But I don’t know how to realize the reduction process in a thread block using shared memory. Would you give me some advice?

Are you talking about this?
https://docs.tvm.ai/tutorials/language/reduction.html#cross-thread-reduction

Thank you, that is what I need.