Yes, however it brings branching in the loop which prevents tensorization from working. Maybe the following step should be decompose_padding
, but I am still wondering how I could generate tensorized instructions for the cases at the boundaries.
Edit: I am marking this one as solved and opened a new post dedicated to the new question