[Performance] TVM - pytorch BERT on CPU

Good work!

There’s a known issue that TVM’s dense op and batch_matmul op with Y = X * W^T does have bad performance in some models.

There’re several matmul & batch_matmul ops in bert that takes data tensor as both input and weight(exp. those in multi-head attentions) rather than use const parameter as weight. In such situation, we would see some explicit transpose inserted when the model are imported from TensorFlow or Pytorch(they use X * W for matmul by default). For the MXNet, as far as I know, it uses X * W^T by default.

The PR you found looks like creating a special schedule for dense + transpose, I’m not sure if that’s the key of the performance improving you got because it is written for AutoTVM and AutoScheduler will never use these manual schedule. You can have a more detailed analyse among those dense/batch_matmul ops’ layout and shape.

I would agree with @comaniac that the miss conversion from dense to batch_matmul caused some waste of computation before.