I have tested some inferences - squeezenet1.1, resnet18_v1 and inceptionv3 using MALI GPU, and measured the performance, and compared CPU and GPU performance.
While measuring the performance on GPU, I found out that GPU operations aren’t completed at run().
Instead, it seems the operations are completed at TVMArrayCopyFromTo(gpu_y, cpu_y,. .).
Is there any API to make sure to wait for the completion of the all GPU operations?
BTW, I used c++ code on device so I cannot use ctx.sync(). Is there c++ based sync API?
I see TVMSynchronize function but it seems creating runtime stream is required. Is there any example about this?