When we run TVM model in a new created docker, it costs over 3800 ms for first loop, and the average running time is only 30 ms for 1000 loops.
If I commit the docker image after running a model and use the new image, the first running time of this model is correct, but it still costs 3800 ms when I run another model first time.
I check the cache but didn’t find something.
We check and do timing at C++ code. exec_op(i) cost 4000ms at index of 6. I stuck at here with no idea of exec_op.
My driver is 470.63 and cuda version is 11.1. Is this problem raised by cuda ptx?
Is this common or it’s only a small case? Is there anything I can do to avoid warming up?