[Tensorize] Support "reduce_last" for TensorIntrin

Hi, All.

Existing TensorIntrin support “reduce_init” and “reduce_body” which could cover most cases, which is very good. However, when I was trying to implement a tensor intrinsic like “matmul_with_relu”, current TensorIntrin is not sufficient to describe it.

The TIR I’m looking for is something like:

if (k == K - 1) {
# call "matmul_with_relu" kernel, currently this part is MISSING.
} else if (k == 0) {
# call "matmul_beta_0" kernel, which is exactly what "reduce_init" is doing.
} else {
# call "matmul_beta_1" kernel, which is exactly what "reduce_update" is doing.
}

Do we have plan to support the “reduce_last” attribute for TensorIntrin.

@tqchen since this feature would change tensorize APIs, I suppose I shouldn’t send a PR directly. Could you bridge me someone who’s interested, to help review the proposal?

@jwfromm @Huyuwei @yzhliu @FrozenGene Welcome for comments!