I have been playing with a custom TIR pass and it is really fun! I wonder what its limitation is. For one, I am trying to figure out if the below is possible. My embedded device has a fast on-chip scratchpad, and I want to DMA the data to the scratchpad.
So from this code:
for (int i = 0; i < 100; ++i) {
a[i] = b[i] + 1;
}
I want to add optimizing code like this:
float *spm = 0x0000C000; // Address to scratchpad memory
DMA_IN(spm, b, 100); // Copys b[0]--b[99] to spm[0] -- spm[99]
for (int i = 0; i < 100; ++i) {
a[i] = spm[i] + 1;
}
I think I can add call (DMA_IN) to the external function with tvm.tir.call_extern. However, I am not sure if I can do the other part, especially where I set up float *spm and use it. Is what I am trying to do even possible with tir pass? Since float *spm is in general not a tensor or anything, I think this might be not achievable without a hack. I wonder if there is a way to do this, e.g., by declaring a dummy tensor spm or using decl_buffer or something (I am not sure what decl_buffer does). Is there any way that I can achieve what I intend with a custom TIR pass?
Thank you!