The key to use prefetch is the distance.
If the distance is too large, the value fetched will be flushed before actually using.
If the distance is too small, the value fetching request is already ongoing but not responded.
However, in most cases, it just slows down the program, because it needs to issue one more instruction.
Some CPU just regard it as a noop.
The target processor (Hexagon) has both a DSP and a vector multiplier (HVX), both of which benefited from pre-fetching in Halide, doing convolutions for example. It would be nice to use the TVM-based function to do something similar.
Yes, the distance will be critical. The folks working on Halide found behavior as you describe. But they found a significant seed-up for well-chosen parameters.
I have not yet looked at the timing improvements using (your?) TVM prefetch() function. I am still trying to work out how it behaves. I see for example that if s is a schedule built at B, and s[B].prefetch(A,axis,offset)
where either A == B or B is contingent on A, then the TVM IR is of the form prefetch(some address in A, 0, 3, 1).
Can you tell me how the TVM-level call is translated into this IR (depending on the relationship between A and B)? And what does the 0,3,1 represent?