Thanks for the feedback Tiling the output computations +
compute_at
is actually exactly what I’ve been doing to prototype this - and you’re right that for a sufficiently large tile the recompute isn’t particularly bad. I think the rolling buffers aren’t immediately essential, but they would be a very beneficial future optimization.
In our testing/prototyping we have found profitable cascades of 5+ ops, particularly in both mobilenet-type architectures and super-resolution networks. Determining whether continuing a cascade is profitable would be one of the jobs of the cascading algorithm.
My major concern integrating this is that convolution-type operations are always on their own in primitive functions. For my experiments I’m currently lowering the whole graph to a single TE but this will not work alongside the current TOPI integration which expects ‘master ops’ to determine the schedule. In essence I would like to do a hierarchical scheduling, first of the cascades and the second of the ops themselves.