Hello,
The current TVM ACL BYOC uses NEON backend of Arm Compute Library. I am working on using CL backend for the same, and since paddings are added in most of the layers’ configure functions & since most of the operations are subgraphs on their own, I had to actually copy the inputs and outputs for each subgraph which leads to higher inference time. import_memory() api of Arm Compute Library for CLTensors doesn’t directly work due to the padding added in configure functions. Is there a simple and generic way to either (i) Create a generic continous subgraph for all ops running on ACL backend, so that we don’t have to copy inputs & outputs of each layer as the current existing design or (ii) To import memory into a padded CLTensor without actually copying it, so that copy overhead is reduced.
@ramana-arm @dmitriy-arm Please suggest your opinions on the same.
Thanks in advance!