Considering the following scenario that I have met, assume that we have a single input buffer A of the primfunc, and in this primfunc we have two blocks to consume A:
in consumer block A, we can use cache_read primitive to cache the buffer A, like:
What the problem that I got is some computations of Consumer Block B can not be done in buffer A local. For example, if A uses tensorize primitive to tensorize the stage of shared to local, it sometimes will change the buffer local and made it not consumable for another blocks, so the better way I think should be :
But I didn’t find any primitives to implement it, or we can get any other ideas of this solution?