The current TVM ACL BYOC uses NEON backend of Arm Compute Library. I am working on using CL backend for the same, and since paddings are added in most of the layers’ configure functions & since most of the operations are subgraphs on their own, I had to actually copy the inputs and outputs for each subgraph which leads to higher inference time. import_memory() api of Arm Compute Library for CLTensors doesn’t directly work due to the padding added in configure functions. Is there a simple and generic way to either
(i) Create a generic continous subgraph for all ops running on ACL backend, so that we don’t have to copy inputs & outputs of each layer as the current existing design or
(ii) To import memory into a padded CLTensor without actually copying it, so that copy overhead is reduced.
At the moment most of the OpenCL backend layers in Compute Library introduce implicit padding for achieving better performance. We are already in the process of removing this restriction (already done for most of the operators) as we go forward especially for the native NHWC layout execution, same as we did for the NEON ones.
Delegating sub-graphs through BYOC indeed could alleviate the need of copying on every layer to just copying the input/output of the subgraphs. @mbaret can provide more insights on how/if this can be done and what its complexity is.
On your other question, the Compute Library allows to import external memory, so yes you could import memory into a padded tensor, but note if the memory does not account for the padding required then it will lead to memory related issues. And even if it does, in the case of mixed TVM/ComputeLibrary execution this needs to be expressed at the boundary levels to avoid causing incorrect results.
We’re not currently exploring subgraph offloading for Compute Library, but I can appreciate why this would be useful for OpenCL. It would add quite a bit of additional complexity into the integration so option ii) would be preferred if possible.