Arm compute library segv with inception-v1, squeezenet

I’ve started to experiment with the Arm Compute Library (thanks for adding!) When using the inception-v1, or the squeezenet models from the google model zoo, tvm segvs.

Looking at the core file

Core was generated by `python3 ./squeezenet-acl-float.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000ffffa742bd0c in tvm::runtime::contrib::ACLRuntime::BuildEngine() () from /home/debian/tvm/build/libtvm.so
[Current thread is 1 (Thread 0xffffb1ef5010 (LWP 2013))]

#0 0x0000ffffa742bd0c in tvm::runtime::contrib::ACLRuntime::BuildEngine() () from /home/debian/tvm/build/libtvm.so

[Current thread is 1 (Thread 0xffffb1ef5010 (LWP 2013))]

(gdb) bt

#0 0x0000ffffa742bd0c in tvm::runtime::contrib::ACLRuntime::BuildEngine() () from /home/debian/tvm/build/libtvm.so
#1 0x0000ffffa742c164 in tvm::runtime::contrib::ACLRuntime::Init(tvm::runtime::Array<tvm::runtime::NDArray, void> const&) ()
from /home/debian/tvm/build/libtvm.so
#2 0x0000ffffa74253b8 in tvm::runtime::json::JSONRuntimeBase::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#4}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const () from /home/debian/tvm/build/libtvm.so
#3 0x0000ffffa74255a4 in std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::json::JSONRuntimeBase::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#4}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&) () from /home/debian/tvm/build/libtvm.so
#4 0x0000ffffa7448844 in tvm::runtime::MetadataModuleNode::InitSubModule(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/debian/tvm/build/libtvm.so
#5 0x0000ffffa7449b74 in tvm::runtime::MetadataModuleNode::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&) () from /home/debian/tvm/build/libtvm.so
#6 0x0000ffffa744b66c in tvm::runtime::ModuleNode::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) () from /home/debian/tvm/build/libtvm.so
#7 0x0000ffffa74a0228 in tvm::runtime::GraphRuntime::CreateTVMOp(tvm::runtime::TVMOpParam const&, std::vector<DLTensor, std::allocator<DLTensor> > const&, unsigned long) () from /home/debian/tvm/build/libtvm.so
#8 0x0000ffffa74a2ef8 in tvm::runtime::GraphRuntime::SetupOpExecs() () from /home/debian/tvm/build/libtvm.so
#9 0x0000ffffa74a32ec in tvm::runtime::GraphRuntime::Init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::Module, std::vector<DLContext, std::allocator<DLContext> > const&) () from /home/debian/tvm/build/libtvm.so
#10 0x0000ffffa74ade24 in tvm::runtime::GraphRuntimeFactory::RuntimeCreate(std::vector<DLContext, std::allocator<DLContext> > const&) ()
from /home/debian/tvm/build/libtvm.so
#11 0x0000ffffa74ae140 in std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::GraphRuntimeFactory::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&) () from /home/debian/tvm/build/libtvm.so
#12 0x0000ffffa742ff04 in TVMFuncCall () from /home/debian/tvm/build/libtvm.so
#13 0x0000ffffb0bffdcc in ffi_call_SYSV () from /lib/aarch64-linux-gnu/libffi.so.6
#14 0x0000ffffb0c006f4 in ffi_call () from /lib/aarch64-linux-gnu/libffi.so.6
#15 0x0000ffffb0c24fbc in _ctypes_callproc () from /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-aarch64-linux-gnu.so
#16 0x0000ffffb0c1c070 in ?? () from /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-aarch64-linux-gnu.so
#17 0x000000000043d240 in _PyObject_FastCallKeywords ()
#18 0x0000000000429ce0 in _PyEval_EvalFrameDefault ()
#19 0x00000000004df2bc in _PyEval_EvalCodeWithName ()
#20 0x000000000043c8d0 in _PyFunction_FastCallDict ()
#21 0x000000000043e104 in _PyObject_Call_Prepend ()
#22 0x000000000048df58 in ?? ()
#23 0x000000000043d240 in _PyObject_FastCallKeywords ()
#24 0x0000000000427858 in _PyEval_EvalFrameDefault ()
#25 0x00000000004df2bc in _PyEval_EvalCodeWithName ()
#26 0x00000000004df600 in PyEval_EvalCode ()
#27 0x0000000000512430 in PyRun_FileExFlags ()
#28 0x000000000051260c in PyRun_SimpleFileExFlags ()
#29 0x0000000000431be8 in ?? ()
#30 0x0000000000431e18 in _Py_UnixMain ()

@giuseros @dmitriy-arm @ramana-arm

Also cc @lhutton1 @matt-arm

Thanks @tgall_foo for reporting this! Would you be able to print out the relay that is partitioned for ACL? My initial thought is that we’re offloading an operator with certain attributes that aren’t supported.

The issue confirmed. The compilation process for ACL runtime does not utilise MergeCompilerRegions pass.This leads to annotated tuples be promoted as empty ACL compile units which ACL Runtime does not understand. An example is below. Note erroneous blocks @arm_compute_lib_2 and @arm_compute_lib_3:

def @main(%a: Tensor[(100, 100), float32], %b: Tensor[(100, 100), float32]) -> (Tensor[(100, 100), float32], Tensor[(100, 100), float32]) { %0 = maximum(%a, %b) /* ty=Tensor[(100, 100), float32] */; (%0, %0) }

translated to

 def @main(%a: Tensor[(100, 100), float32], %b: Tensor[(100, 100), float32]) -> (Tensor[(100, 100), float32], Tensor[(100, 100), float32]) {
  %0 = @arm_compute_lib_0(%a, %b) /* ty=Tensor[(100, 100), float32] */;
  %1 = @arm_compute_lib_2(%0) /* ty=Tensor[(100, 100), float32] */;
  %2 = @arm_compute_lib_3(%0) /* ty=Tensor[(100, 100), float32] */;
  (%1, %2)
}
def @arm_compute_lib_0(%arm_compute_lib_0_i0: Tensor[(100, 100), float32], %arm_compute_lib_0_i1: Tensor[(100, 100), float32], global_symbol="arm_compute_lib_0", Primitive=1, Compiler="arm_compute_lib", Inline=1) -> Tensor[(100, 100), float32] {
  maximum(%arm_compute_lib_0_i0, %arm_compute_lib_0_i1) /* ty=Tensor[(100, 100), float32] */
}
def @arm_compute_lib_2(%arm_compute_lib_2_i0: Tensor[(100, 100), float32], global_symbol="arm_compute_lib_2", Primitive=1, Compiler="arm_compute_lib", Inline=1) -> Tensor[(100, 100), float32] {
  %arm_compute_lib_2_i0
}
def @arm_compute_lib_3(%arm_compute_lib_3_i0: Tensor[(100, 100), float32], global_symbol="arm_compute_lib_3", Primitive=1, Compiler="arm_compute_lib", Inline=1) -> Tensor[(100, 100), float32] {
  %arm_compute_lib_3_i0
}

@dmitriy-arm thanks for the investigation. Would you or @lhutton1 add MergeCompilerRegion pass to the ACL partition pipeline (i.e., partition_for_arm_compute_lib) and send a PR?

The ACL integration actually doesn’t want subgraphs being offloaded but rather single operators (it doesn’t do any graph level analysis/optimisation), and this is I believe the motivation behind not calling MCR. My suggestion would be that the annotation of tuples in AnnotateTarget should be made optional as annotating tuples is only valid if you subsequently intend to use MCR.

Even the underlying implementation executes the graph op-by-op, it is still beneficial to merge subgraphs to reduce kernel launching and data transfer overheads. In-subgraph tensors can also be totally managed by ACL instead of graph runtime.