Halide provides the scheduling primitive
store_at (as well as
store_root) to move where the storage of a tensor happens independent of the compute. This is very useful for when we want to make use of the sliding window optimization and create rolling buffers - both of which can be critical in reducing memory usage on memory constrained devices. For examples of how this is used in Halide, you can reference the tutorial on multi-stage pipelines: https://halide-lang.org/tutorials/tutorial_lesson_08_scheduling_2.html.
Is this something we can emulate in TVM with the existing scheduling intrinsics? And if not, is this something the design of TE would permit? In the latter case, I’d be interested to know whether it would currently be worth implementing in TE given the change of approach in TensorIR, or whether it would be better to wait.