Methods to obtain operator execution time

I want to obtain the execution time of each operator in the TVM model. The execution time obtained through the debugger is unstable. Therefore, I measure the execution time of each operator in C++ as shown below.

void GraphRuntime::Run() {
    for (size_t i = 0; i < op_execs_.size(); ++i) {
      if (op_execs_[i]) {
        LOG(INFO) << "Executing " << i <<": " << nodes_[i].name<< "...";
        // warm up
        for(size_t j = 0; j < 10; j++) {
          op_execs_[i]();
        }

        auto start = std::chrono::high_resolution_clock::now();
        for(size_t j = 0; j < 100; j++) {
          op_execs_[i]();
        }
        auto stop = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
        LOG(INFO) << "Time taken by function (--" << nodes_[i].name << "--): --" << duration.count() / 100.0 << "-- microseconds";
      }
    }
  } 
}

After obtaining the average execution time for all operators, I compared the sum of average executioin times for all operators with the executhion time of whole model. I found the former is much smaller than the latter. For example, the former is 43ms while the latter is 117ms.

The method for measuring the execution time of whole model is as follows.

repeat = 100
start = time.time()
for i in range(repeat):
      model.run()
end = time.time()
whole_model_exec_time = (end-start) / repeat

I want to know the possible reasons for this situation. The tvm version is 0.8dev0.

Actually, the model is executed heterogeneously (some operators on CPU, while others on GPU). If a cuda operator is asynchronous, the method above can only measure the kernel launch time, which is smaller than real execution time. Are there asynchronous operators in TVM?