I have a tensorflow model. The cpu inference performance is poor when the batch is 500 online. After using autoTVM optimization, the performance of 500 times is much worse than tensorflow 500 batch. Can TVM support batch inference?
With 50 times 1000 batch, tensorflow cost 8.62s on my mac, while autotvm cost 15.85s.
You can try to use batch 1 for tuning and 500 for inference. The time should be just around (batch size) * (single batch inference time). Current TVM HCHW/NHWC conv2d does not tune the batch size, but some work is ongoing.
My model does not contain conv2d, the most time-consuming op is nn.dense. Do you mean using optimized history to build the relay using batch 500 and then do inference?
Dense is another issue tho. In this case you have to tune the model with batch size 500. Did you try graph tuner after tuning each op? Another option is enabling cBLAS for dense ops by setting target=llvm -lib=cblas
Thanks @comaniac, with batch size 500 and llvm -mcpu=haswell -libs=cblas, compared with tensorflow, tvm gets 2~3X performance improvement. But the graph tuner will still throw an exception.
https://github.com/apache/incubator-tvm/issues/5369