Can TVM now support batched inference? Autotvm runs twice as long as tensorflow

I have a tensorflow model. The cpu inference performance is poor when the batch is 500 online. After using autoTVM optimization, the performance of 500 times is much worse than tensorflow 500 batch. Can TVM support batch inference?

With 50 times 1000 batch, tensorflow cost 8.62s on my mac, while autotvm cost 15.85s.

You can try to use batch 1 for tuning and 500 for inference. The time should be just around (batch size) * (single batch inference time). Current TVM HCHW/NHWC conv2d does not tune the batch size, but some work is ongoing.

1 Like

My model does not contain conv2d, the most time-consuming op is nn.dense. Do you mean using optimized history to build the relay using batch 500 and then do inference?

Dense is another issue tho. In this case you have to tune the model with batch size 500. Did you try graph tuner after tuning each op? Another option is enabling cBLAS for dense ops by setting target=llvm -lib=cblas

Thank you very much. Tonight I will try what you said. The graph tuner throwed an exception, so i only tuned each op…

Thanks @comaniac, with batch size 500 and llvm -mcpu=haswell -libs=cblas, compared with tensorflow, tvm gets 2~3X performance improvement. But the graph tuner will still throw an exception.

I am not sure if graph tuner is still applicable when cBLAS is used. Maybe @kevinthesun could provide more details about it.

You don’t need graph tuning while using cblas.

Thanks, @kevinthesun, @comaniac