OpenCL gpu running time abnormal

Part of my .cc code:


tvm::runtime::PackedFunc run = gmod.GetFunction(“run”);
for (int i = 0; i < run_loops; i++){
gettimeofday(&start, NULL);
run();
gettimeofday(&end, NULL);
run_cost = ((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec)) * 1.0 / 1000.f;
LOG(INFO) << i <<"th Run cost average " << run_cost << “ms.”;
}


The result running on OpenCL shows that the time cost by the 0th iter is much bigger than the others:


[20:06:45] test_mobilenetV2_graph.cc:92: 0th Run cost average 2010.35ms.
[20:06:45] test_mobilenetV2_graph.cc:92: 1th Run cost average 0.741ms.
[20:06:45] test_mobilenetV2_graph.cc:92: 2th Run cost average 0.609ms.
[20:06:45] test_mobilenetV2_graph.cc:92: 3th Run cost average 0.517ms.
[20:06:45] test_mobilenetV2_graph.cc:92: 4th Run cost average 0.547ms.
[20:06:45] test_mobilenetV2_graph.cc:92: 5th Run cost average 0.543ms.


It’s normal since compilation of the opencl kernels happen on the first their invocations.

Roger that. But now I want to make some performance testing on mobile gpu, how to achieve that without rpc? Thanks a lot.

There are several questions in regard of original question context.

If you ask how to measure performance taking into account fact that compilation happens in the first run - just make one inference call manually beffore measuring of performance.

The question how to measure performance on android - you can use either RPC approach or can create your own native (c++) application. For RPC you can use Android RPC or tvm_rpc. Both approaches will work.

Do you have any certain issue with RPC approach? Which one?

Thanks, as you suggest, what I need to do is just eliminate the 1st iter in my loop iters. But I still have some doubt about the numbers above. The network I used is Mobilenet, the fact that it costs less than 1ms on mobile gpu seems abnormal.

What exact hardware do you measure on? I observed similar situation on Kirin platform that I had to put TVMSynchronize call after each run to get proper timings. it seems execution was postponed until I requested output. Putting call of TVMSynchronize after each run, we request to finish each run. I do not know if that was a specific of Huawei opencl software stack or ARM Mali, have not verified on other Mali devices.

And there is no such problem on Adreno.

Thanks a lot. I can solve this problem with your instruction.
And I have another way to solve it, which is using //TVMArrayCopyToBytes to copy data from gpu to cpu after each call of run().
But I have another problem, which is that I find that time is longer measured by this way compared with time result of tvm.benchmark() for the same .so model.

Thanks. BTW, is there any method to save such opencl kernels into a bin file during the first invocation? As far as I’m concerned, it will save time of init and 1st running.

In theory yes, at least there is such mechanism in OpenCL. On the other hand, 1) the binary will depend on the OpenCL compiler, I.e. there is no guaranty to reuse this kernel from one device to another 2) OpenCL compiler works on device, this is not the current TVM comilatoin flow and probably it will not be ever extended for such way.

On the other hand we can invent something only for runtime part to cache compiled kernels. But it will require research and RFC