Glad to see a proposal with such functionality.
I have to admit that I had also done something similar to what you are proposing (at least at the TE level).
One problem I had was enlargment of the tensor iteration domains ([TVM Scheduling] Split factors missmatch to original extent - #2 by aca88). This had the problem of many inner loops being mostly “out of original domain” which was really not performant or a huge explotion of “program code” to statically solve all those ifs.
Another problem you would face are the limitations of FuseOps. Without any change here, I dont know how you would “automatically” get those composed stages you will need in order to schedule them.
Nonetheless, for specific configurations of layers, I think what you propose is a reasonable way of processing. Rolling buffers would make it even sweeter.
side note
This kind of reminds me of the “graph tunner” infrastructure.
Maybe there should be a more general infrastructure for doing “graph level tunning” (I think the current one only tunes w.r.t. different layout transformations).