Prefetch shared memory to registers

Hi, everyone

For cuda target, I first fetch data from global memory to shared memory, then I want to achieve software pipeline by prefetching some data from shared memory to registers since shared memory request may consume tens of cycles and sometimes even longer.

However, the underlying prefetch pass will check the prefetched data in the buf_map_ or not (storage_flatten.cc). If I understand correctly, the buf_map_ only contains entries which have “global” scope. So, how can I achieve prefetching from shared memory?