Why tvm's bmm is slower than pytorch's bmm?

I use the bmm from the blog:Redirecting…, but the tvm’s bmm is slower than pytorch’s. the results from nvvp is show blew.left is pytorch and right is tvm.

Is there a better tvm’s bmm for attention used in transformer?