Segmentation fault in relay.build()

svetter · November 8, 2022, 3:48pm

I want to compile a model built using Tensorflow Keras, but am getting consistent segfaults. I have narrowed my code down to this minimal example (completely useless, untrained model):

import tensorflow.keras as keras
import tvm
import tvm.relay as relay

model = keras.Sequential()
model.add(keras.Input(shape=(2,), name='input'))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(1))

mod, params_dict = relay.frontend.from_keras(model, {'input': (None, 2, 1, 1)})

with tvm.transform.PassContext(opt_level=3):
    factory_module = relay.build(mod, target="llvm", params=params_dict)

My TVM installation is built from source, v0.10.0 with the default cmake configuration. Tensorflow is at v2.10.0 (is that the problem?).

GDB gives the following: 0x00007fff7bfcc096 in non-virtual thunk to tvm::tir::StmtExprMutator::VisitExpr(tvm::PrimExpr const&) () from /scratch/sv/buildsite/tvm/build/libtvm.so.

svetter · November 10, 2022, 1:22pm

I found the problem after changing opt_level to 0, which got me the following error message:

Traceback (most recent call last):
  File ".../minimal_segf.py", line 13, in <module>
    factory_module = relay.build(mod, target="llvm", params=params_dict)
  File ".../tvm/python/tvm/relay/build_module.py", line 364, in build
    graph_json, runtime_mod, params = bld_mod.build(
  File ".../tvm/python/tvm/relay/build_module.py", line 161, in build
    self._build(
  File ".../tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  14: TVMFuncCall
  13: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::relay::backend::RelayBuildModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  12: tvm::relay::backend::RelayBuildModule::BuildRelay(tvm::IRModule, tvm::runtime::String const&)
  11: tvm::relay::backend::ExecutorCodegen::Codegen(tvm::IRModule, tvm::relay::Function const&, tvm::runtime::String)
  10: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::relay::backend::GraphExecutorCodegenModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#2}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  9: tvm::relay::backend::GraphExecutorCodegen::Codegen(tvm::IRModule, tvm::relay::Function, tvm::runtime::String)
  8: tvm::relay::GraphPlanMemory(tvm::relay::Function const&)
  7: tvm::relay::StorageAllocator::Plan(tvm::relay::Function const&)
  6: tvm::relay::ExprVisitor::VisitExpr(tvm::RelayExpr const&)
  5: tvm::relay::transform::DeviceAwareExprVisitor::VisitExpr_(tvm::relay::FunctionNode const*)
  4: tvm::relay::StorageAllocaBaseVisitor::DeviceAwareVisitExpr_(tvm::relay::FunctionNode const*)
  3: tvm::relay::StorageAllocaBaseVisitor::CreateToken(tvm::RelayExprNode const*, bool)
  2: tvm::relay::StorageAllocator::CreateTokenOnDevice(tvm::RelayExprNode const*, tvm::VirtualDevice const&, bool)
  1: tvm::relay::StorageAllocator::TokenAllocator1D::GetMemorySize(tvm::relay::StorageToken*) [clone .isra.0]
  0: _ZN3tvm7runtime6deta
  File ".../tvm/src/relay/backend/graph_plan_memory.cc", line 411
TVMError: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (pval != nullptr) is false: Cannot allocate memory symbolic tensor shape [(nullptr), 2, 1, 1]

Replacing None with 1 in the input shape description which goes into from_keras solved the issue.

However, I presume that a segfault when setting opt_level=3 is unintended behaviour and this issue might still be in need of a fix.

masahi · November 10, 2022, 8:15pm

relay.build(...) is only for completely static graph. For dynamic shape, we need to use vm.compile(...). But yes, we should never segfault… I’m looking into what is causing it.

masahi · November 14, 2022, 7:17am

The segfault is happening because somewhere in the pipeline we try to simplify the expression “None”. Dynamic shape is supposed to be represented by Any, but you are using None:

ef @main(%input: Tensor[(None, 2, 1, 1), float32] /* ty=Tensor[(None, 2, 1, 1), float32] */, %v_param_1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %v_param_2: Tensor[(1), float32] /* ty=Tensor[(1), float32] */) -> Tensor[(None, 1), float32] {
  %0 = transpose(%input, axes=[0, 2, 3, 1]) /* ty=Tensor[(None, 1, 1, 2), float32] */;
  %1 = nn.batch_flatten(%0) /* ty=Tensor[(None, 2), float32] */;
  %2 = nn.dense(%1, %v_param_1, units=1) /* ty=Tensor[(None, 1), float32] */;
  nn.bias_add(%2, %v_param_2) /* ty=Tensor[(None, 1), float32] */
}

I didn’t know that it is possible to have None in shape… anyway, you need to replace None with relay.Any'( to specify that the batch dimension be dynamic.

But our Keras frontend probably doesn’t support dynamic input shape. It is not really actively developed or maintained anyway, so I suggest converting the model to ONNX and use our ONNX frontend.

svetter · November 14, 2022, 7:49am

Thanks, I’ll try that.