How to integrate accelerator local addresses (SRAM, scratchpads, accumulators, etc)

I would like to know how would TVM manage the following scenario:

I am trying to integrate an accelerator, which has instructions to load/store/calc. For example, mvin moves data from the dram to the internal memory of the accelerator, and mvout moves data from the internal memory to the dram.

By following the tutorials of the VTA, I am currently modeling this the following way. In order to model the input of data to the accelerator, I have written the following (this is just a minimal example):

A = te.placeholder((4, 4, 1, 8), name="A", dtype="uint8")
A_buf = te.compute((4, 4, 1, 8), lambda *i: A(*i), "A_buf")
s = te.create_schedule(A_buf.op)

When I print this schedule using tvm.lower, I get the following:

@main = primfn(A_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {A: Buffer(A_2: Pointer(int8), int8, [1, 1, 1, 8], [])}
  buffer_map = {A_1: A} {
  allocate(A_buf: Pointer(global int8), int8, [8]), storage_scope = global;
  for (i3: int32, 0, 8) {
    A_buf[i3] = (int8*)A_2[i3]
  }
}

This makes sense. Now, in order to tag that the A_buf buffer is internal to the accelerator, and in order to tag this data copy from one buffer to the other as a mvin operation, I do the following:

s[A_buf].set_scope("local")
s[A_buf].pragma(s[A_buf].op.axis[0], "mvin")

Now, if I print using tvm.lower:

@main = primfn(A_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {A: Buffer(A_2: Pointer(int8), int8, [1, 1, 1, 8], [])}
  buffer_map = {A_1: A} {
  allocate(A_buf: Pointer(local int8), int8, [8]), storage_scope = local;
  attr [IterVar(i0: int32, (nullptr), "DataPar", "")] "pragma_mvin" = 1;
  for (i3: int32, 0, 8) {
    A_buf[i3] = (int8*)A_2[i3]
  }
}

As you can see, now the buffer has a “local” scope, and the for loop is tagged with the mvin pragma. Now, it is easy to write a transform pass to replace the for loop with the mvin pragma by a call_extern, which will call the specific mvin c function, and pass the address of the 2 buffers as parameters.

But here comes the problem: the A_buf should NOT be allocated in the host memory. It actually should be a pointer to a specific address inside the local memory of the accelerator. If I generate the code based on that last schedule using tvm.build, I get an allocation of A_buf in host memory, and of course its address is in the DRAM memory space.

I know I can declare new memory regions by using @tvm.register_func(“tvm.info.mem…”), but I am failing to see how this integrates with the above example, or if there is a way to really do what I am needing.

As far as I understand, this can not be solved using the USMP, because it creates workspaces that eventually will live in the host memory (DRAM). Perhaps I could create a workspace called SRAM to model the internal accelerator memory, and then assign this A_buf to this workspace somehow. Then in runtime A_buf would have an address pointing to this SRAM workspace. Would this be the correct way to do it?

So this is what I have until now:

I have created a transform pass similar to the InjectDMAIntrin pass of the VTA to replace the for loop tagged with the mvin pragma with my intrinsic function, which does the move of data into the hardware accelerator. This is how it looks after applying this pass:

@main = primfn(A_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {A: Buffer(A_2: Pointer(int8), int8, [1, 1, 1, 8], [])}
  buffer_map = {A_1: A} {
  allocate(A_buf: Pointer(local int8), int8, [8]), storage_scope = local;
  @tir.call_extern("mvin", @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), A_2, 0, 8, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), A_buf, 0, 8, 2, dtype=handle), dtype=)
}

This looks good, but now I am stuck trying to create the pass to replace the allocate with a constant value (remember, the A_buf lives actually inside the accelerator memory, and therefore the A_buf needs a specific address). I created this pass, which replaces the allocates with a new Var. The output of that pass looks like this:

@main = primfn(A_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {A: Buffer(A_2: Pointer(int8), int8, [1, 1, 1, 8], [])}
  buffer_map = {A_1: A} {
  let A_buf: uint8 = 0u8
  @tir.call_extern("mvin", @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), A_2, 0, 8, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), A_buf, 0, 8, 2, dtype=handle), dtype=)
}

But now I get an error in the build saying that “the Load/Store of buffer A_buf (0x2474b50) occurred before its declaration” (which makes sense, the A_buf buffer is trying to be accessed in the call_extern statement).

Any tips on how to implement this?

cc @manupa-arm @MJKlaiber @aca88 @SebastianBoblestETAS @PhilippvK in case they can get to this before i have cycles

1 Like

Well, apparently this is achieved using the LowerDeviceStorageAccessInfo transform pass, or at least it seems that thats how the VTA achieves something similar. Although there is not enough documentation on this pass to exactly understand how to use it.