[Error] TimeEvaluator for Relay add op (in ResNet50) with batch size of 16

MadFunMaker · March 12, 2021, 9:18pm

Hi, I was trying to measure execution time of Relay add operator in ResNet-50 with batch size of 16 using TimeEvaluator (tvm.contrib.graph_runtime.GraphModule.module.time_evaluator). While I measure any other operators in ResNet50 with no error, I got the following error only for Relay add operator with a batch size of 16. The same code works well for add op with a batch size of 1. Could you enlighten me as to how to solve this error?

[build_module.cc] Relaybuild module
[build_module.cc] BuildRelay                                                                                                           [build_module.cc] Fuse Ops
[Fused Pass] Transform                                                                                                                 [fused_ops.cc] VisitExpr:FunctionNode                                                                                                          [inline.cc] main
[build_module.cc] Before Graph CodeGen
[graph_runtime_codegen.cc] GraphCodeGen()
[graph_runtime_codegen.cc] Before VisitExpr
[graph_runtime_codegen.cc] Handle external function
[graph_runtime_codegen.cc] Relay function: Homogeneous? Hetero?
[graph_runtime_codegen.cc] lowered_funcs
        [Compile_engie.cc] LowerInternal
        [Compile_engie.cc] LowerInternal - Before Create Sched
[Compile_engie.cc] ScheduleGetter::Create
                FunctionNode([Var(p0, ty=TensorType([16, 256, 56, 56], float32)), Var(p1, ty=TensorType([16, 256, 56, 56], floa
t32))], TensorType([16, 256, 56, 56], float32), CallNode(Op(add), [Var(p0, ty=TensorType([16, 256, 56, 56], float32)), Var(p1, ty=Tenso
rType([16, 256, 56, 56], float32))], (nullptr), [TensorType([16, 256, 56, 56], float32), TensorType([16, 256, 56, 56], float32)]), [],
{"Primitive": 1}) ==> Possibly fused function!
[Compile_engie.cc] ScheduleGetter, Before VisitExpr
[Compile_engie.cc] VisitExpr -> call lower_call()
[Compile_engie.cc] Before lowering: add
[Compile_engine.py] lower_call add
Select Implementation:  add
[Compile_engie.cc] LoweredOutput
[Compile_engie.cc] After lowering
[Compile_engie.cc] anchor_op: Op(add)
[Compile_engie.cc] ScheduleGetter, After VisitExpr
        [Compile_engie.cc] LowerInternal - After Create Sched
        [Compile_engie.cc] Call relay.backend.lower: fused_add
                [build_module.py] lower
        [Compile_engie.cc] Done relay.backend.lower: fused_add
        [graph_runtime_codegen.cc] GraphAddCallNode: fused_add // fused_add
[graph_runtime_codegen.cc] After VisitExpr
[graph_runtime_codegen.cc] Before lower external function
[graph_runtime_codegen.cc] After lower external function
[build_module.cc] After Graph CodeGen
[build_module.cc] Lower func
[build_module.cc] Generate a placeholder func
[build_module.cc] tvm::build!
[driver_api.cc] Build for heterogeneous exec2
[driver_api.cc] Build for heterogeneous exec
[build_module.cc] Get External Modules

raceback (most recent call last):
  File "record.py", line 144, in <module>
log_network_backend_ops_perf_on_target(backendop_lib, Target.TVM_GPU, 'resnet-50', batch_size = 16)
  File "record.py", line 43, in log_network_backend_ops_perf_on_target
relay.analysis.post_order_visit(mod['main'], lambda expr: log_backend_op_perf(b_op_lib, expr, target))
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/python/tvm/relay/analysis/analysis.py", line 59, in post_order_visi
t
return _ffi_api.post_order_visit(expr, fvisit)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (8) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(tvm::relay::ExprVisitor::VisitExpr(tvm::RelayExp
r const&)+0x8b) [0x7f7d8986642b]
  [bt] (7) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(tvm::relay::ExprFunctor<void (tvm::RelayExpr con
st&)>::VisitExpr(tvm::RelayExpr const&)+0x6f) [0x7f7d898140cf]
  [bt] (6) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(tvm::relay::ExprApplyVisit::VisitExpr(tvm::Relay
Expr const&)+0x105) [0x7f7d8986e0a5]
  [bt] (5) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(tvm::relay::ExprVisitor::VisitExpr(tvm::RelayExp
r const&)+0x8b) [0x7f7d8986642b]
  [bt] (4) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(tvm::relay::ExprFunctor<void (tvm::RelayExpr con
st&)>::VisitExpr(tvm::RelayExpr const&)+0x6f) [0x7f7d898140cf]
  [bt] (3) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(tvm::relay::ExprVisitor::VisitExpr_(tvm::relay::
CallNode const*)+0x158) [0x7f7d89862b78]
  [bt] (2) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(tvm::relay::ExprApplyVisit::VisitExpr(tvm::Relay
Expr const&)+0x11a) [0x7f7d8986e0ba]
  [bt] (1) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(+0x130cec0) [0x7f7d89862ec0]
  [bt] (0) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(+0x55d4d1) [0x7f7d88ab34d1]
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 81, in cfun
rv = local_pyfunc(*pyargs)
  File "record.py", line 43, in <lambda>
relay.analysis.post_order_visit(mod['main'], lambda expr: log_backend_op_perf(b_op_lib, expr, target))
  File "record.py", line 23, in log_backend_op_perf
res = get_optimal_backendop(b_op_lib, expr, pattern, [target])
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/backend_op.py", line 183, in get_optimal_backendop
cost = op.get_cost(subgraph)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/backend_op.py", line 76, in get_cost
cost_info = cost_func(self._name, expr, self._target)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/target.py", line 99, in measure_cost
return measure(ftimer)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/target.py", line 27, in measure
ftimer(*args)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/python/tvm/runtime/module.py", line 226, in evaluator
blob = feval(*args)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
raise get_last_ffi_error()
  [bt] (3) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7f7d89998413]
  [bt] (2) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMAr
gs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr_<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFun
c, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runti
me::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0
xbf) [0x7f7d89a34c6f]
  [bt] (1) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::r
untime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x5c7) [0x7f7d89a34707]
  [bt] (0) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(+0x14d9ea8) [0x7f7d89a2fea8]
File "/home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 81, in cfun
rv = local_pyfunc(*pyargs)
  File "record.py", line 43, in <lambda>
relay.analysis.post_order_visit(mod['main'], lambda expr: log_backend_op_perf(b_op_lib, expr, target))
  File "record.py", line 23, in log_backend_op_perf
res = get_optimal_backendop(b_op_lib, expr, pattern, [target])
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/backend_op.py", line 183, in get_optimal_backendop
cost = op.get_cost(subgraph)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/backend_op.py", line 76, in get_cost
cost_info = cost_func(self._name, expr, self._target)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/target.py", line 99, in measure_cost
return measure(ftimer)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/target.py", line 27, in measure
ftimer(*args)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/python/tvm/runtime/module.py", line 226, in evaluator
blob = feval(*args)
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
raise get_last_ffi_error()
  [bt] (3) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7f7d89998413]
  [bt] (2) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMAr
gs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr_<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFun
c, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runti
me::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0
xbf) [0x7f7d89a34c6f]
  [bt] (1) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::r
untime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x5c7) [0x7f7d89a34707]
  [bt] (0) /home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/build/libtvm.so(+0x14d9ea8) [0x7f7d89a2fea8]
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/src/runtime/cuda/cuda_module.cc", line 190
  File "/home/byungsoj/backend-aware-graph-opt/backend_operator/tvm/src/runtime/library_module.cc", line 78
TVMError:
---------------------------------------------------------------
An internal invariant was violated during the execution of TVM.
Please read TVM's error reporting guidelines.
More details can be found here: https://discuss.tvm.ai/t/error-reporting/7793.
---------------------------------------------------------------

  Check failed: ret == 0 (-1 vs. 0) : TVMError: CUDALaunch Error: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
 grid=(256,1,1),  block=(1024,1,1)
// func_name=fused_add_kernel0
// CUDA Source
// -----------
extern "C" __global__ void fused_add_kernel0(float* __restrict__ T_add, float* __restrict__ placeholder, float* __restrict__ placeholde
r1) {
  for (int ax0_ax1_fused_ax2_fused_ax3_fused_outer = 0; ax0_ax1_fused_ax2_fused_ax3_fused_outer < 49; ++ax0_ax1_fused_ax2_fused_ax3_fus
ed_outer) {
T_add[((((ax0_ax1_fused_ax2_fused_ax3_fused_outer * 262144) + (((int)blockIdx.x) * 1024)) + ((int)threadIdx.x)))] = (placeholder[((
((ax0_ax1_fused_ax2_fused_ax3_fused_outer * 262144) + (((int)blockIdx.x) * 1024)) + ((int)threadIdx.x)))] + placeholder1[((((ax0_ax1_fu
sed_ax2_fused_ax3_fused_outer * 262144) + (((int)blockIdx.x) * 1024)) + ((int)threadIdx.x)))]);
  }
}            [build_module.cc] Relaybuild module
        [build_module.cc] BuildRelay                                                                                                           [build_module.cc] Fuse Ops
        [Fused Pass] Transform                `Preformatted text`                                                                                                 [fused_ops.cc] VisitExpr:FunctionNode                                                                                                          [inline.cc] main
        [build_module.cc] Before Graph CodeGen
[graph_runtime_codegen.cc] GraphCodeGen()
[graph_runtime_codegen.cc] Before VisitExpr
        [graph_runtime_codegen.cc] Handle external function
        [graph_runtime_codegen.cc] Relay function: Homogeneous? Hetero?
        [graph_runtime_codegen.cc] lowered_funcs
                [Compile_engie.cc] LowerInternal
                [Compile_engie.cc] LowerInternal - Before Create Sched
        [Compile_engie.cc] ScheduleGetter::Create
                        FunctionNode([Var(p0, ty=TensorType([16, 256, 56, 56], float32)), Var(p1, ty=TensorType([16, 256, 56, 56], floa
t32))], TensorType([16, 256, 56, 56], float32), CallNode(Op(add), [Var(p0, ty=TensorType([16, 256, 56, 56], float32)), Var(p1, ty=Tenso
rType([16, 256, 56, 56], float32))], (nullptr), [TensorType([16, 256, 56, 56], float32), TensorType([16, 256, 56, 56], float32)]), [],
{"Primitive": 1}) ==> Possibly fused function!
        [Compile_engie.cc] ScheduleGetter, Before VisitExpr
[Compile_engie.cc] VisitExpr -> call lower_call()
[Compile_engie.cc] Before lowering: add
        [Compile_engine.py] lower_call add
        Select Implementation:  add
[Compile_engie.cc] LoweredOutput
[Compile_engie.cc] After lowering
[Compile_engie.cc] anchor_op: Op(add)
        [Compile_engie.cc] ScheduleGetter, After VisitExpr
                [Compile_engie.cc] LowerInternal - After Create Sched
                [Compile_engie.cc] Call relay.backend.lower: fused_add
                        [build_module.py] lower
                [Compile_engie.cc] Done relay.backend.lower: fused_add
                [graph_runtime_codegen.cc] GraphAddCallNode: fused_add // fused_add
[graph_runtime_codegen.cc] After VisitExpr
[graph_runtime_codegen.cc] Before lower external function
[graph_runtime_codegen.cc] After lower external function
        [build_module.cc] After Graph CodeGen
        [build_module.cc] Lower func
        [build_module.cc] Generate a placeholder func
        [build_module.cc] tvm::build!
[driver_api.cc] Build for heterogeneous exec2
[driver_api.cc] Build for heterogeneous exec
        [build_module.cc] Get External Modules

tqchen · March 15, 2021, 10:55pm

It seems was due to the setting of maximum threads exceeded. Could it due to the grid or thread side? One quick test is to write a cuda kernel that runs different grid or thread size to find out