[USMP][UMA] Pin buffer in "main" to a specific memory pool

eibrahim · September 29, 2023, 3:27pm

In our application, we’re utilizing UMA to offload specific operations (e.g. conv2d) to a custom accelerator. We’re also utilizing USMP to specify two WorkspaceMemoryPools called l2_mem and act_mem. l2_mem is accessible by Target("c"), while act_mem is accessible by both Target("c") and Target("accel"). Using a relay pass, we add a few layout transforms and extra custom operations to ensure compatibilty between operations run on C and accel backends.

Certain operations require inputs/outputs to be in specific memory pools:

accel_input_fetcher(): Input → l2_mem, output → act_mem
accel_conv2d(): Input & Output → act_mem

Currently, the codegen looks like this for a network with 1 conv layer:

// default_lib1.c
TVM_DLL int32_t tvmgen_default___tvm_main__(int8_t* data_buffer_var, int8_t* output_buffer_var, uint8_t* act_mem_0_var, uint8_t* l2_mem_1_var, uint8_t* wei_mem_2_var) {
  void* constant_0_let = (&(wei_mem_2_var[0]));
  void* sid_1_let = (&(l2_mem_1_var[0]));    // >>>> L2_MEM - OK
  void* sid_3_let = (&(l2_mem_1_var[0]));    // >>>> L2_MEM - NOT OK - Would like it to be in ACT_MEM
  if (tvmgen_default_fused_layout_transform(data_buffer_var, sid_1_let, ...) != 0 ) return -1;
  if (tvmgen_default_accel_main_0(sid_1_let, constant_0_let, sid_3_let, ...) != 0 ) return -1;
  if (tvmgen_default_fused_layout_transform_strided_slice(sid_3_let, output_buffer_var, ...) != 0 ) return -1;
  return 0;
}

// default_lib2.c
TVM_DLL int32_t tvmgen_default_accel_main_0(int8_t* accel_0_i0, int8_t* tvm_var_extract_const_0, int8_t* accel_conv2d, uint8_t* act_mem_6_var, uint8_t* l2_mem_7_var, uint8_t* wei_mem_8_var) {
  void* input_fetcher_let = (&(act_mem_6_var[0]));    // >>>> ACT_MEM - OK
  accel_input_fetcher(accel_0_i0, accel_input_fetcher_let, ...);
  accel_conv2d(accel_input_fetcher_let, tvm_var_extract_const_0, accel_conv2d, ...);
  return 0;
}

I’ve tried adding a tir_pass which captures tir.Allocate ops and add the annotation “candidate_memory_pools”, but since I’m registering the tir_pass using UMA’s register_tir_pass(), it’s only triggering for the offloaded function (in default_lib2.c), and I only capture the tir.Allocate for the input_fetcher_let buffer. Ideally I would like to capture the buffer allocates for the “main” function as well.

How can I proceed? Is there a way to achieve what I need?

MJKlaiber · October 4, 2023, 8:06am

Hi @eibrahim ,

This seems like a very usful usecase for UMA. Unfortunately this is not supported in UMA (yet). Short-term you could try to modify lower.py

github.com

apache/tvm/blob/main/python/tvm/relay/backend/contrib/uma/api/lower.py#L135


    mod : tvm.ir.IRModule
        This is the Relay module.


    Returns
    -------
    mod : tvm.ir.IRModule
        The Relay module with scheduled NPU external functions.
    """
    mod = _ffi_api.OutlineCompilerFunctions(self.target_name)(mod)
    for gvar, func in mod.functions.items():
        if "Compiler" in func.attrs and func.attrs["Compiler"] == self.target_name:
            func = self._lower_relay_to_tir(func)
            func = self._lower_stir_to_nstir(func)
            mod.update_func(gvar, func)
    return mod


def register(self) -> None:
    """Register all relevant relay-to-tir functions."""
    tvm._ffi.register_func(f"relay.ext.uma.{self.target_name}.relay_to_tir", self.relay_to_tir)
    for op, strategy, plevel in self._operator_strategies:
        register_strategy(op, strategy, plevel)

I would see this a useful extension to UMA that could be added to the API.

Maybe @paulpb or @cgerum has an idea how we this could be achieved.

eibrahim · October 5, 2023, 9:43am

Hi @MJKlaiber, Thanks for your reply!

I had already tried commenting out the if "Compiler" in .... statement, but when I do I run into an error when lowering mod["main"] in the line func = self._lower_relay_to_tir(func):

TVMError: Primitive Functions can not contain nested functions.

I’ve also tried registering the tir pass during lowering by passing "tir.add_lower_pass" in the context

Here I run into an assertion error in this line: tvm/python/tvm/relay/backend/contrib/uma/api/lower.py at 958c27123a45a9629e57cee20dbca28263c836bd · apache/tvm · GitHub
I thought what if I try commenting that out as well? Unfortunately in this case the tir_pass only captured the input_fetcher_let buffer allocate as well. The difference is that it visited the tvmgen_default_fused_layout_transform(...) and tvmgen_default_fused_layout_transform_strided_slice(...) as well, which don’t have any tir.Allocate ops. The tir_pass still didn’t visit the main function

I can still sandwich the accel_conv2d() function with operations that read/write into l2_mem, but I would like to be able to explicitly pin the buffer locations of accelerator input/output just in case USMP does something unexpected