I would like to know how would TVM manage the following scenario:
I am trying to integrate an accelerator, which has instructions to load/store/calc. For example, mvin moves data from the dram to the internal memory of the accelerator, and mvout moves data from the internal memory to the dram.
By following the tutorials of the VTA, I am currently modeling this the following way. In order to model the input of data to the accelerator, I have written the following (this is just a minimal example):
A = te.placeholder((4, 4, 1, 8), name="A", dtype="uint8")
A_buf = te.compute((4, 4, 1, 8), lambda *i: A(*i), "A_buf")
s = te.create_schedule(A_buf.op)
When I print this schedule using tvm.lower, I get the following:
@main = primfn(A_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(int8), int8, [1, 1, 1, 8], [])}
buffer_map = {A_1: A} {
allocate(A_buf: Pointer(global int8), int8, [8]), storage_scope = global;
for (i3: int32, 0, 8) {
A_buf[i3] = (int8*)A_2[i3]
}
}
This makes sense. Now, in order to tag that the A_buf buffer is internal to the accelerator, and in order to tag this data copy from one buffer to the other as a mvin operation, I do the following:
s[A_buf].set_scope("local")
s[A_buf].pragma(s[A_buf].op.axis[0], "mvin")
Now, if I print using tvm.lower:
@main = primfn(A_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(int8), int8, [1, 1, 1, 8], [])}
buffer_map = {A_1: A} {
allocate(A_buf: Pointer(local int8), int8, [8]), storage_scope = local;
attr [IterVar(i0: int32, (nullptr), "DataPar", "")] "pragma_mvin" = 1;
for (i3: int32, 0, 8) {
A_buf[i3] = (int8*)A_2[i3]
}
}
As you can see, now the buffer has a “local” scope, and the for loop is tagged with the mvin pragma. Now, it is easy to write a transform pass to replace the for loop with the mvin pragma by a call_extern, which will call the specific mvin c function, and pass the address of the 2 buffers as parameters.
But here comes the problem: the A_buf should NOT be allocated in the host memory. It actually should be a pointer to a specific address inside the local memory of the accelerator. If I generate the code based on that last schedule using tvm.build, I get an allocation of A_buf in host memory, and of course its address is in the DRAM memory space.
I know I can declare new memory regions by using @tvm.register_func(“tvm.info.mem…”), but I am failing to see how this integrates with the above example, or if there is a way to really do what I am needing.
As far as I understand, this can not be solved using the USMP, because it creates workspaces that eventually will live in the host memory (DRAM). Perhaps I could create a workspace called SRAM to model the internal accelerator memory, and then assign this A_buf to this workspace somehow. Then in runtime A_buf would have an address pointing to this SRAM workspace. Would this be the correct way to do it?