Good work!
There’s a known issue that TVM’s dense
op and batch_matmul
op with Y = X * W^T
does have bad performance in some models.
There’re several matmul
& batch_matmul
ops in bert that takes data tensor as both input and weight(exp. those in multi-head attentions) rather than use const parameter as weight. In such situation, we would see some explicit transpose inserted when the model are imported from TensorFlow or Pytorch(they use X * W
for matmul
by default). For the MXNet, as far as I know, it uses X * W^T
by default.
The PR you found looks like creating a special schedule for dense + transpose
, I’m not sure if that’s the key of the performance improving you got because it is written for AutoTVM and AutoScheduler will never use these manual schedule. You can have a more detailed analyse among those dense
/batch_matmul
ops’ layout and shape.
I would agree with @comaniac that the miss conversion from dense
to batch_matmul
caused some waste of computation before.