@vinx13@masahi Thanks alot, and quite interesting, I think I need to rollback tvm and do test again. And I implemented cutlass dp4a permutation with tvm which can even be faster than cutlass about 3~4% in nn layout, so I believe we can do the same thing with tensorcore.