Question
Modifing and running tvm/app/benchmark/gpu_imagenet_bench.py and tutorial/from_mxnet.py in loop of 1000 times for testing ResNet-18 speedup.
The average time without first several loop looks good, while the first several trials have real high time cost.
Platform
i7 + 1080Ti
tvm with CUDA + cudnn + cublas
CUDA version: 8.0
Result
average
benchmark: 1.39 ms
from_mxnet: 1.4 ms
mxnet 1.4 + cudnn: 10.49 ms
first two loop
from_mxnet: 7.53 sec, 18.7 ms
mxnet 1.4 + cudnn: 0.097s, 12 ms
Note
We can see that the first two warm up loop in tvm really need long time, while the mxnet looks ok. Is this normal? Or how can I optimize this part? Or is there any place in TVM to optimize? Thanks a lot!