[VM] Slow Compilation of TF Object Detection Models

anijain2305 · August 3, 2020, 8:29am

We have observed that TF object detection models have large compilation time. For example, SSD mobilenet takes around 15 min, and faster RCNN takes around 88 min on EC2 C5.9x machine (Skylake).

To pinpoint the issue, I profiled all the passes. Following is the breakdown for SSD mobilenet

#############
Top 10 passes
#############
LambdaLift	1	0.42128	0.42128
Inline	2	0.450635	0.2253175
AlterOpLayout	1	0.472836	0.472836
FuseOps	4565	2.517395652999993	0.0005514557837897027
SimplifyExpr	1	9.70095	9.70095
ToANormalForm	4562	18.614626426999994	0.004080365284305128
InferType	10143	56.48113732500007	0.005568484405501338
ManifestAlloc	3	134.5319	44.84396666666667
FoldConstant	5	478.08437000000004	95.61687400000001
EtaExpand	420	484.24537720000006	1.152965183809524
#############
Parser	194.3445200920105
Total (including parser)	947.3799571990967

Second column is number of invocations, 3rd is total time, and 4th is average time.

As you can see FoldConstant and EtaExpand take majority of time. FoldConstant is a function pass, which is called for every func in the module. Each FoldConsant called Interpreter, which calls EtaExpand. So, EtaExpand is counted twice.

The main culprit is CreateInterpreter in FoldConstant pass. CreateInterpreter makes a copy of almost whole mod. TF SSD models are pretty big, and cause performance overhead. But, the real slowdown comes from calling CreateInterpreter again and again, once for each func in the module.

github.com

apache/incubator-tvm/blob/d892881c4cc8c9a29bc03233aeac2b1532a9c689/src/relay/transforms/fold_constant.cc#L343


Expr FoldConstant(const Expr& expr, const IRModule& mod) {
  using tvm::transform::PassContext;
  DLContext ctx;
  ctx.device_type = kDLCPU;
  ctx.device_id = 0;
  Target target = Target::Create("llvm");
  // use a fresh build context
  // in case we are already in a build context.
  With<PassContext> fresh_build_ctx(PassContext::Create());

  return ConstantFolder(CreateInterpreter(mod, ctx, target), mod).Mutate(expr);
}

namespace transform {

Pass FoldConstant() {
  runtime::TypedPackedFunc<Function(Function, IRModule, PassContext)> pass_func =
      [=](Function f, IRModule m, PassContext pc) {
        return Downcast<Function>(FoldConstant(f, m));
      };
  return CreateFunctionPass(pass_func, 2, "FoldConstant", {});

@zhiics @haichen @kevinthesun @masahi

zhiics · July 31, 2020, 7:57pm

Thanks @anijain2305 for the investigation.

cc @jroesch @MarisaKirisame @weberlo Not sure if you have seen the same problem or if you have any comment.

masahi · July 31, 2020, 9:32pm

The slow parsing time could be due to repeated calls to infer_value and infer_type in the frontend. See the discussion in Incremental Type Propagation

kevinthesun · July 31, 2020, 9:36pm

The major potion of time still comes from VM compilation.

masahi · July 31, 2020, 9:45pm

Yes, I don’t know what is happening in vm, but I wonder if we are not computing the same constant or inferring the same type over and over again (similar to what is happening in the frontend)

anijain2305 · July 31, 2020, 9:54pm

@masahi what you are suggesting might be possible here, I think we are making a copy and performing type inference again and again

github.com

apache/incubator-tvm/blob/ce7202c0412f44dddbf405ca7be1f78b1c83349d/src/relay/transforms/type_infer.cc#L730


  auto main = mod->GetGlobalVar("main");
  auto inferencer = TypeInferencer(mod, main);
  auto e = inferencer.Infer(expr);
  CHECK(WellFormed(e));
  auto free_tvars = FreeTypeVars(e, mod);
  CHECK(free_tvars.size() == 0) << "Found unbound type variables in " << e << ": " << free_tvars;
  EnsureCheckedType(e);
  return e;
}

Function InferType(const Function& func, const IRModule& mod, const GlobalVar& var) {
  CHECK(mod.defined()) << "internal error: module must be set for type inference";
  Function func_copy = Function(make_object<FunctionNode>(*func.operator->()));
  func_copy->checked_type_ = func_copy->func_type_annotation();
  mod->AddUnchecked(var, func_copy);
  Expr func_ret = TypeInferencer(mod, var).Infer(func_copy);
  mod->Remove(var);
  CHECK(WellFormed(func_ret));
  auto free_tvars = FreeTypeVars(func_ret, mod);
  CHECK(free_tvars.size() == 0) << "Found unbound type variables in: " << std::endl
                                << AsText(func, true) << std::endl

VM ConstantFold subgraphs are big. They might share large portions and we might be throwing away all that TypeInference work while adding a new func in the mod.

anijain2305 · August 3, 2020, 8:36am

Update the numbers and the text in the original post. Now, the passes are measured once per each module (earlier they were measured for each func in the module, causing total invocations to blow up).

Pushed a PR that cuts down the compilation time significantly, but it needs discussion if there is some other higher-level issues - https://github.com/apache/incubator-tvm/pull/6195

anijain2305 · August 3, 2020, 6:36pm

Thanks all. This major slowdown is addressed in the above PR.

Overall, the situation has improved considerably, but the compilation is still slow from a TVM user perspective. For example, mask RCNN and faster RCNN are taking over 20 minutes. This time, the bottlenecks are pretty clear. Top 2 contributors are TF parser and ManifestAlloc pass (possibly because it is in python). Printing the stats for mask_rcnn

#############
Top 10 passes
#############
Inline	2	1.069975	0.5349875
EliminateCommonSubexpr	1	1.27641	1.27641
tir.MakePackedAPI	265	1.746511539	0.006590609581132075
FuseOps	7894	3.5031981949999964	0.0004437798574866983
FoldConstant	5	15.117545	3.023509
EtaExpand	7890	18.173398643999974	0.0023033458357414414
ToANormalForm	7890	19.812965483999815	0.002511148984030395
SimplifyExpr	2	35.767915336	17.883957668
InferType	18823	175.7159563250003	0.009335172731498713
ManifestAlloc	3	249.8752	83.29173333333334
#############
Parser	442.8740212917328
Total (including parser)	1257.1966462135315

masahi · September 16, 2020, 12:34am

Interestingly, compiling faster rcnn and mask rcnn from PyTorch, enabled by the PR https://github.com/apache/incubator-tvm/pull/6449, takes less than 3 min on my laptop. I wonder where the difference in compilation time between TF and PyTorch comes from.

kevinthesun · September 16, 2020, 6:29pm

Graph of TF OD is much larger than PT OD.