Thanks for the clarification! I concur that such a primitive should be useful and would allow more flexible compute movements.
Regarding the full graph, I agree that relay (along with optimization) being very useful. I was thinking whether there would be a benefit of lowering the full graph to tensorIR post relay optimization rather than lowering each primitive function. I guess this has to do with how AutoTVM/Ansor will allow the exploration of schedules but I got a feeling that could be scoped via the “blocks” that would otherwise lead to explosion of search space. (Looking from an AoT angle here).
Moreover, may be that could lay a foundation to inter-primitive function optimizations later.