Can TVM now support batched inference? Autotvm runs twice as long as tensorflow

adobay · April 17, 2020, 2:19am

I have a tensorflow model. The cpu inference performance is poor when the batch is 500 online. After using autoTVM optimization, the performance of 500 times is much worse than tensorflow 500 batch. Can TVM support batch inference?

With 50 times 1000 batch, tensorflow cost 8.62s on my mac, while autotvm cost 15.85s.

comaniac · April 17, 2020, 5:50am

You can try to use batch 1 for tuning and 500 for inference. The time should be just around (batch size) * (single batch inference time). Current TVM HCHW/NHWC conv2d does not tune the batch size, but some work is ongoing.

adobay · April 17, 2020, 9:59am

My model does not contain conv2d, the most time-consuming op is nn.dense. Do you mean using optimized history to build the relay using batch 500 and then do inference?

comaniac · April 17, 2020, 3:30pm

Dense is another issue tho. In this case you have to tune the model with batch size 500. Did you try graph tuner after tuning each op? Another option is enabling cBLAS for dense ops by setting target=llvm -lib=cblas

adobay · April 17, 2020, 3:48pm

Thank you very much. Tonight I will try what you said. The graph tuner throwed an exception, so i only tuned each op…

adobay · April 20, 2020, 2:30am

Thanks @comaniac, with batch size 500 and llvm -mcpu=haswell -libs=cblas, compared with tensorflow, tvm gets 2~3X performance improvement. But the graph tuner will still throw an exception. https://github.com/apache/incubator-tvm/issues/5369

comaniac · April 20, 2020, 5:07pm

I am not sure if graph tuner is still applicable when cBLAS is used. Maybe @kevinthesun could provide more details about it.

kevinthesun · April 20, 2020, 5:44pm

You don’t need graph tuning while using cblas.

adobay · April 21, 2020, 1:58am

Thanks, @kevinthesun, @comaniac