[RFC][Tensorize] Add "reduce_last" property for TensorIntrin to support activation fusion

Yeah, but that’s becuase TE has a restriction that reduction must be presented at the top level of compute, otherwise the compilation would fail.

I understand this is indeed the limitation of reduction. The use case in this example is what we previous ignored - we assume we should use another loop to perform the activation (or in CUDA case this can be fused to the shared-> global phase).

Exactly, IR builder could help represent the reduction loop, if-else clauses and activations. However tir generated by IR builder could not be auto-tuned currently.

While using IR builder to writer the whole kernel is not tunable, we can write only the tensor intrin part with IR builder while keeping the outer loops tunable. You can use autotvm cfg to get the current factor of split, and then use it to declare the tensor intrin on the fly (taking the reduction loop length as argument).

Indeed this would add complexity to the schedule, I’m suggesting an alternative way that minimizes changes and prevents breaking to the reduction semantic