In our application, we’re utilizing UMA to offload specific operations (e.g. conv2d
) to a custom accelerator. We’re also utilizing USMP to specify two WorkspaceMemoryPools
called l2_mem
and act_mem
. l2_mem
is accessible by Target("c")
, while act_mem
is accessible by both Target("c")
and Target("accel")
. Using a relay pass, we add a few layout transforms and extra custom operations to ensure compatibilty between operations run on C
and accel
backends.
Certain operations require inputs/outputs to be in specific memory pools:
-
accel_input_fetcher()
: Input →l2_mem
, output →act_mem
-
accel_conv2d()
: Input & Output →act_mem
Currently, the codegen looks like this for a network with 1 conv layer:
// default_lib1.c
TVM_DLL int32_t tvmgen_default___tvm_main__(int8_t* data_buffer_var, int8_t* output_buffer_var, uint8_t* act_mem_0_var, uint8_t* l2_mem_1_var, uint8_t* wei_mem_2_var) {
void* constant_0_let = (&(wei_mem_2_var[0]));
void* sid_1_let = (&(l2_mem_1_var[0])); // >>>> L2_MEM - OK
void* sid_3_let = (&(l2_mem_1_var[0])); // >>>> L2_MEM - NOT OK - Would like it to be in ACT_MEM
if (tvmgen_default_fused_layout_transform(data_buffer_var, sid_1_let, ...) != 0 ) return -1;
if (tvmgen_default_accel_main_0(sid_1_let, constant_0_let, sid_3_let, ...) != 0 ) return -1;
if (tvmgen_default_fused_layout_transform_strided_slice(sid_3_let, output_buffer_var, ...) != 0 ) return -1;
return 0;
}
// default_lib2.c
TVM_DLL int32_t tvmgen_default_accel_main_0(int8_t* accel_0_i0, int8_t* tvm_var_extract_const_0, int8_t* accel_conv2d, uint8_t* act_mem_6_var, uint8_t* l2_mem_7_var, uint8_t* wei_mem_8_var) {
void* input_fetcher_let = (&(act_mem_6_var[0])); // >>>> ACT_MEM - OK
accel_input_fetcher(accel_0_i0, accel_input_fetcher_let, ...);
accel_conv2d(accel_input_fetcher_let, tvm_var_extract_const_0, accel_conv2d, ...);
return 0;
}
I’ve tried adding a tir_pass
which captures tir.Allocate
ops and add the annotation “candidate_memory_pools”, but since I’m registering the tir_pass
using UMA’s register_tir_pass()
, it’s only triggering for the offloaded function (in default_lib2.c
), and I only capture the tir.Allocate
for the input_fetcher_let
buffer. Ideally I would like to capture the buffer allocates for the “main” function as well.
How can I proceed? Is there a way to achieve what I need?