Adding manual scratchpad management code in custom TIR pass

I have been playing with a custom TIR pass and it is really fun! I wonder what its limitation is. For one, I am trying to figure out if the below is possible. My embedded device has a fast on-chip scratchpad, and I want to DMA the data to the scratchpad.

So from this code:

for (int i = 0; i < 100; ++i) {
    a[i] = b[i] + 1;
}

I want to add optimizing code like this:

float *spm = 0x0000C000;  // Address to scratchpad memory
DMA_IN(spm, b, 100); // Copys b[0]--b[99] to spm[0] -- spm[99]
for (int i = 0; i < 100; ++i) {
    a[i] = spm[i] + 1;
}

I think I can add call (DMA_IN) to the external function with tvm.tir.call_extern. However, I am not sure if I can do the other part, especially where I set up float *spm and use it. Is what I am trying to do even possible with tir pass? Since float *spm is in general not a tensor or anything, I think this might be not achievable without a hack. I wonder if there is a way to do this, e.g., by declaring a dummy tensor spm or using decl_buffer or something (I am not sure what decl_buffer does). Is there any way that I can achieve what I intend with a custom TIR pass?

Thank you!

cache_read, cache_write

Thanks for the reply. I looked at cache_read and cache_write. However, this is what is generated if I use cache_read:

float b_global[100];
for (int bx=0;bx<100;++bx)
    b_global[ax] = b[ax];
for (int i=0;i<100;++i)
    a[i] = b_global[i] + 1;

This is close, indeed, but not exactly what I wanted. I want to (1) copy to a specific buffer location (0x0000c000) that is not a local array, and (2) want to init with an external DMA function instead of the auto-generated for loop.

Is there a way to modify the cache_read/write behavior to do what I intend? Can e.g., custom codegen help?

Thank you!

Hello, is your problem solved now? I also encountered a similar problem.