Regarding the imperfect tiling : I think cause of that problem is tensorization is happening as part of ScheduleOps and before the LoopPartition pass.
There has been good discussion about this problem and solutions suggested were
- Auto-Tensorization 2) Having a separte pass that happens much later after the
ScheduleOpsand all the necessary IR trasnformations.
You can find the discussion here :