Any plan for rewriting topi with TensorIR?

NortromChiang · February 9, 2023, 4:06pm

Since TensorIR will replace TE in Relax, Want to know the follow-up planning of the topi operators. Will they all be rewritten by TensorIR, or just those with bad performance?

Hzfengsy · February 10, 2023, 3:45am

Hi @NortromChiang, Topi is a good abstraction for opeartor declarations. I don’t think it’s “with bad performance”, and we don’t have such a plan to rewrite it.

junrushao · February 10, 2023, 6:46am

TOPI’s operator definition (“compute”) can be considered as a nice DSL that generates TensorIR, which we believe is good to preserve. TOPI schedule is not used in TensorIR or MetaSchedule anyways.

NortromChiang · February 14, 2023, 2:43pm

Thanks for your quick reply @Hzfengsy @junrushao . My question is a little vague and sorry for that. What I really want to ask/claim is

1) how to use both tensorIR and te in a single op?

For situation that op composed of sequenced sub-ops:

B = topi.op(A)
C = script.op(B)
D = topi.op(C)

tvm.script(TensorIR) is compatible with te since I can use te.extern_primfunc to convert a primfunc written by tvm.script to te(ExternOp), and construct the target op in standard te way.

But for situation that op composed of nested sub-ops:

@T.prim_func
def composed_op(...):
    ...
    if cond:
        B = topi.op1(A)
    else:
        B = topi.op2(A)

I can’t reuse topi.op directly in this case only if I build topi.op seperately and wrap each of them as a call_extern function. But the performance may be compromised. For example if A is a subregion of another buffer in composed_op.

So I’m wondering is writing topi in TensorIR a solution for this problem? Or is there any other solution?

2) op written by te may not be the best choice in certain case

What I mean by “bad performance” is that some implementation of a certain op is hard to write by te for its lack of expression. Therefore it may not always generate code with best performance.

To be more concrete, in customized op, we may have many slice/view/permute sub-ops. And they could have an ssa impl or non-ssa(TensorView style) ones and the best performance implementation depends both on high level info such as workloads or the impl of the previous sub-ops.

For example, the best impl of the following slice -> slice -> slice op should be TensorIR style rather than te style.

te topi impl which generate code with useless buffer copy:

A = te.placeholder(shape=(16, 16, 16), name="A")
B = topi.strided_slice(A, [8, 0, 0], [16, 16, 16], strides=[1, 1, 1], axes=[0, 1, 2])
C = topi.strided_slice(B, [0, 4, 0], [8, 12, 16], strides=[1, 1, 1], axes=[0, 1, 2])
D = topi.strided_slice(C, [0, 0, 0], [8, 8, 8], strides=[1, 1, 1], axes=[0, 1, 2])
func = te.create_prim_func([A, D])

TensorIR impl which reindex the axis by match_buffer:

@tvm.script.ir_module
class SliceModule:
    @T.prim_func
    def main(a: T.handle, d: T.handle) -> None:  # pylint: disable=no-self-argument
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        A = T.match_buffer(a, (16, 16, 16), "float32")
        D = T.match_buffer(d, (8, 8, 8), "float32")
        with T.block("getbuf"):
            B = T.match_buffer(A[8:16,0:16,0:16], (8, 16, 16), "float32", strides=(256,16,1), elem_offset=8*256)
            C = T.match_buffer(B[0:8,4:12,0:16], (8, 8, 16), "float32", strides=(256,16,1), elem_offset=8*256+4*16)
            for i, j, k in T.grid(8, 8, 8):
                with T.block("slice"):
                    D[i, j, k] = C[i, j, k]

And it’s more easy to auto generate the above code with match_buffer api comparing to auto generate a te-externop with old irbuilder.