Support Einsum in Frontend

apuaaChen · August 23, 2021, 1:34am

Hi!

I was trying to convert my pytorch model to relay. The model is from performer_torch. However, I got "NotImplementedError: The following operators are not implemented: [‘aten::einsum’].

I also tried to work around by first converting pytorch model to onnx and then parse the onnx to get relay. However, the same issue happens: “tvm.error.OpNotImplemented: The following operators are not supported for fronend ONNX: Einsum”.

Einsum has become an important operator in the transformer family, as it is convenient for various kind of attention mechanisms. Is there any way to work around this? Or is there any plan to support it in the frontend?

Best Regards

FrozenGene · August 23, 2021, 1:47am

topi supports einsum. So I think what you only need to do is to add relay / frontend support

masahi · August 27, 2021, 1:36am

@apuaaChen Thanks, I think this is a good idea. I’ll put relay/frontend support for einsum in my backlog.

apuaaChen · August 30, 2021, 12:19am

Thank you for your suggestion!

I have tried to support einsum with topi.einsum and the auto-scheduler. It works perfectly on CPU.

However, the story is different on GPU. On one hand, I didn’t find the schedule for einsum in topi. On the other hand, the computational DAG generated by topi.einsum seems not very friendly for auto-scheduling.

@auto_scheduler.register_workload
def einsum(M, N, K):
    x = te.placeholder((M, K), name='x')
    y = te.placeholder((K, N), name='y')
    out = topi.einsum('ij,jk->ik', x, y)
    # out = topi.matmul(x, y)
    return [x, y, out]

For instance, the code above is transformed to

Computational DAG:
x = PLACEHOLDER [1024, 256]
y = PLACEHOLDER [256, 1024]
T_einsum(ax0, ax1) = ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( ..(OMITTED).. 56), 1024), floormod(((ax0*256) + 255), 256)]*y[floormod(floordiv((ax1 + 261120), 1024), 256), floormod((ax1 + 261120), 1024)]))

Whereas the equivalent matmul is transformed to

Computational DAG:
x = PLACEHOLDER [1024, 256]
y = PLACEHOLDER [256, 1024]
T_matmul(ax0, ax1) += (x[ax0, k]*y[k, ax1])

Under the same auto-scheduling configuration, the einsum has fewer eligible programs, and its final throughput is also much slower than the matmul.

apuaaChen · August 30, 2021, 11:10pm

Thanks!

By the way, I found that the compute provided by topi.einsum would unroll the whole reduction dimension. For instance,

@auto_scheduler.register_workload
def einsum(M, N, K):
    x = te.placeholder((M, K), name='x')
    y = te.placeholder((K, N), name='y')
    out = topi.einsum('ij,jk->ik', x, y)
    # out = topi.matmul(x, y)
    return [x, y, out]

produces

Computational DAG:
x = PLACEHOLDER [1024, 256]
y = PLACEHOLDER [256, 1024]
T_einsum(ax0, ax1) = ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( ..(OMITTED).. 56), 1024), floormod(((ax0*256) + 255), 256)]*y[floormod(floordiv((ax1 + 261120), 1024), 256), floormod((ax1 + 261120), 1024)]))

The inner product dimension is completely unrolled. As methods like auto-scheduling in ansor construct the search based on the axis, this compute greatly limits the size of valid search space.

apuaaChen · September 2, 2021, 11:39pm

I tried to add the einsum support. What I have done are

Rewritten the compute with te instead of tir, therefore it supports GPU under auto-scheduling
Add the einsum to relay ops
Add the einsum op to pytorch frontend.

masahi · September 3, 2021, 12:31am

Great! You are welcome to send a PR when you think it’s ready.