[RFC] 'Cascade' Scheduling

Glad to see a proposal with such functionality.

I have to admit that I had also done something similar to what you are proposing (at least at the TE level).

One problem I had was enlargment of the tensor iteration domains ([TVM Scheduling] Split factors missmatch to original extent - #2 by aca88). This had the problem of many inner loops being mostly “out of original domain” which was really not performant or a huge explotion of “program code” to statically solve all those ifs.

Another problem you would face are the limitations of FuseOps. Without any change here, I dont know how you would “automatically” get those composed stages you will need in order to schedule them.

Nonetheless, for specific configurations of layers, I think what you propose is a reasonable way of processing. Rolling buffers would make it even sweeter.

side note

This kind of reminds me of the “graph tunner” infrastructure.

Maybe there should be a more general infrastructure for doing “graph level tunning” (I think the current one only tunes w.r.t. different layout transformations).