In a new created docker, TVM costs extreme long time at first loop

When we run TVM model in a new created docker, it costs over 3800 ms for first loop, and the average running time is only 30 ms for 1000 loops.

If I commit the docker image after running a model and use the new image, the first running time of this model is correct, but it still costs 3800 ms when I run another model first time.

I check the cache but didn’t find something.

We check and do timing at C++ code. exec_op(i) cost 4000ms at index of 6. I stuck at here with no idea of exec_op.

My driver is 470.63 and cuda version is 11.1. Is this problem raised by cuda ptx?

Is this common or it’s only a small case? Is there anything I can do to avoid warming up?