We have observed that TF object detection models have large compilation time. For example, SSD mobilenet takes around 15 min, and faster RCNN takes around 88 min on EC2 C5.9x machine (Skylake).
To pinpoint the issue, I profiled all the passes. Following is the breakdown for SSD mobilenet
Second column is number of invocations, 3rd is total time, and 4th is average time.
As you can see FoldConstant and EtaExpand take majority of time. FoldConstant is a function pass, which is called for every func in the module. Each FoldConsant called Interpreter, which calls EtaExpand. So, EtaExpand is counted twice.
The main culprit is CreateInterpreter in FoldConstant pass. CreateInterpreter makes a copy of almost whole mod. TF SSD models are pretty big, and cause performance overhead. But, the real slowdown comes from calling CreateInterpreter again and again, once for each func in the module.
Yes, I don’t know what is happening in vm, but I wonder if we are not computing the same constant or inferring the same type over and over again (similar to what is happening in the frontend)
@masahi what you are suggesting might be possible here, I think we are making a copy and performing type inference again and again
VM ConstantFold subgraphs are big. They might share large portions and we might be throwing away
all that TypeInference work while adding a new func in the mod.
Update the numbers and the text in the original post. Now, the passes are measured once per each module (earlier they were measured for each func in the module, causing total invocations to blow up).
Thanks all. This major slowdown is addressed in the above PR.
Overall, the situation has improved considerably, but the compilation is still slow from a TVM user perspective. For example, mask RCNN and faster RCNN are taking over 20 minutes. This time, the bottlenecks are pretty clear. Top 2 contributors are TF parser and ManifestAlloc pass (possibly because it is in python). Printing the stats for mask_rcnn
Interestingly, compiling faster rcnn and mask rcnn from PyTorch, enabled by the PR https://github.com/apache/incubator-tvm/pull/6449, takes less than 3 min on my laptop. I wonder where the difference in compilation time between TF and PyTorch comes from.