From the example provided,
c = te.compute((m, n), lambda i, j: te.sum(a[i, k_axis] * b[k_axis, j], axis=k_axis), name="c")
this doesn’t match the intrin body that containsrelu
in the last reduction
Yeah, but that’s becuase TE has a restriction that reduction must be presented at the top level of compute, otherwise the compilation would fail.
we can use IR builder to build the reduction loop, the add the activation in the last iteration.
Exactly, IR builder could help represent the reduction loop, if-else clauses and activations. However tir generated by IR builder could not be auto-tuned currently. From this point of view, I agree we should expect TensorIR and MetaScheduler to solve the problem fundamentally.
Make sense to me. Thank you, @vinx13 .