[Tensorize] Support "reduce_last" for TensorIntrin

@tqchen since this feature would change tensorize APIs, I suppose I shouldn’t send a PR directly. Could you bridge me someone who’s interested, to help review the proposal?