TVM's get_output function is time-consuming with Mali openCL on RK3399

I encounter the issue of huge time processing for get_output function when i deploy mobinet_ssd’s model on RK3399 board with target openCL. Here is the part of my script:

m = graph_runtime.create(graph, lib, ctx) m.set_input(**params)

begin = time.time() m.set_input(‘data’, tvm.nd.array(x.astype(dtype))) m.run() end = time.time() print('Run time: '+str(end-begin))

_, outshape = nnvm.compiler.graph_util.infer_shape(graph, shape={“data”: dshape})

begin2 = time.time() tvm_output_0 = m.get_output(0, tvm.nd.empty(tuple(outshape[0]),‘float32’)) # shape: (1, 21, 1917) tvm_output_1 = m.get_output(1, tvm.nd.empty(tuple(outshape[1]),‘float32’)) #shape: (1, 7668) tvm_output_2 = m.get_output(2, tvm.nd.empty(tuple(outshape[2]),‘float32’)) # shape: (1, 1917, 4) end4 = time.time() print('get output time: '+str(end4-begin2)) begin5 = time.time() final_output = get_multibox_detection_output_tvm(tvm_output_0, tvm_output_1, tvm_output_2, 1, num_anchor, 21) end5 = time.time() print('get get_multibox_detection_output_tvm: '+str(end5-begin5))

And the output:

  • runtime:0.02s
  • get output time: 1.95s
  • get_multibox_detection_output_tvm time: 1s

This was reported here 979. Is there any update for this issue or any other way to get result from TVM?

OpenCL kernels are lazy exectued. So the your time evaluation of run tells you nothing.
We will sync the stream when calling get_output

To make time evaluation correct, we have to time both run, get_output and do some warm up runs

2 Likes

Actually i tested with multiple images, and it was the result after several loop.
i also tested this code in my PC with nvidia-GTX1080, the get output time after several loops is only 0.0007s

I believe that mobilenet ssd cannot run in 50ms on RK3399. You should time both run and get_output. This is the real time cost, which is 0.05 + 1.95s in your case.

BTW, if you use RPC, you should also remove time cost due to network.

i checked again and actually the run time is about 0.02. Please note that this run time doesn’t include multibox_detection part- which run on cpu. So the real time cost is: run time + get_output time + multibox_detection time You can check my full script here

0.02s is impossible. You can see my benchmark result here https://github.com/dmlc/tvm/wiki/Benchmark#mobile-gpu. Mobilenet with highly tuned kernels takes about 80ms on rk3399 board.
So we are sure that the run is lazily executed.

In your script, you use ‘opencl’ as target. Then tvm will use the kernels optimized for cuda backend.
For mali GPU, you should use "opencl -device=mali" as target. We have different kernel implementation optimized for mali. Finally, to get the best performance, you have to follow the auto tuning tutorial (https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_mobile_gpu.html)

1 Like

sorry, this script for running on my PC. I use “target=tvm.target.mali, ctx = tvm.cl(0)” when deploy in RK3399. I changed target=“opencl -device=mali” but the time is the same. You can see the log here:

model compiled.

[11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing… [11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing… [11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing… [11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing… [11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing… [11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing… [11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing… [11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing…

=======================================================

======================================================= [11:11:54] /home/firefly/tvm_openCL_support_priorBox/src/runtime/opencl/opencl_device_api.cc:242: Initialize OpenCL platform ‘ARM Platform’ [11:11:54] /home/firefly/tvm_openCL_support_priorBox/src/runtime/opencl/opencl_device_api.cc:251: No OpenCL device any device matched given the options: accelerator mode [11:11:54] /home/firefly/tvm_openCL_support_priorBox/src/runtime/opencl/opencl_device_api.cc:269: opencl(0)=‘Mali-T860’ cl_device_id=0x7f93d40dd8 run time: 46.999, getout_time: 2.03895, detection_time: 0.687, total time: 49.725 run time: 0.018, getout_time: 1.91659, detection_time: 1.128, total time: 3.063 run time: 0.02, getout_time: 1.93148, detection_time: 1.118, total time: 3.069 run time: 0.02, getout_time: 1.91593, detection_time: 1.093, total time: 3.029 run time: 0.019, getout_time: 1.93582, detection_time: 1.14, total time: 3.095 run time: 0.018, getout_time: 1.93465, detection_time: 1.119, total time: 3.072 run time: 0.018, getout_time: 1.97412, detection_time: 1.143, total time: 3.135 run time: 0.019, getout_time: 1.92758, detection_time: 1.077, total time: 3.024 run time: 0.019, getout_time: 1.92943, detection_time: 1.156, total time: 3.104 run time: 0.019, getout_time: 1.9279, detection_time: 1.135, total time: 3.082 run time: 0.02, getout_time: 1.93719, detection_time: 1.136, total time: 3.093 run time: 0.019, getout_time: 1.97357, detection_time: 1.145, total time: 3.138 run time: 0.018, getout_time: 1.92685, detection_time: 1.138, total time: 3.083 run time: 0.02, getout_time: 1.96257, detection_time: 1.133, total time: 3.116 run time: 0.019, getout_time: 1.95402, detection_time: 1.134, total time: 3.107 run time: 0.019, getout_time: 1.92661, detection_time: 1.147, total time: 3.093 run time: 0.02, getout_time: 1.9345, detection_time: 1.141, total time: 3.095 run time: 0.018, getout_time: 1.92304, detection_time: 1.135, total time: 3.076 run time: 0.018, getout_time: 1.9267, detection_time: 1.139, total time: 3.084 run time: 0.018, getout_time: 1.92407, detection_time: 1.135, total time: 3.077 run time: 0.02, getout_time: 1.93869, detection_time: 1.116, total time: 3.075 run time: 0.018, getout_time: 1.91643, detection_time: 1.153, total time: 3.087 run time: 0.019, getout_time: 1.92286, detection_time: 1.101, total time: 3.043 run time: 0.019, getout_time: 1.95793, detection_time: 1.113, total time: 3.09 run time: 0.018, getout_time: 1.92309, detection_time: 1.074, total time: 3.015 run time: 0.019, getout_time: 1.91833, detection_time: 1.079, total time: 3.016 run time: 0.02, getout_time: 1.92622, detection_time: 1.097, total time: 3.043 run time: 0.018, getout_time: 1.93256, detection_time: 1.096, total time: 3.047 run time: 0.018, getout_time: 1.91949, detection_time: 1.086, total time: 3.023 run time: 0.018, getout_time: 1.94478, detection_time: 1.083, total time: 3.046 run time: 0.018, getout_time: 1.96931, detection_time: 1.076, total time: 3.063 run time: 0.021, getout_time: 1.91958, detection_time: 1.111, total time: 3.052 run time: 0.018, getout_time: 1.99231, detection_time: 1.07, total time: 3.08 run time: 0.019, getout_time: 1.92992, detection_time: 1.071, total time: 3.02 run time: 0.018, getout_time: 2.04414, detection_time: 1.067, total time: 3.129 run time: 0.018, getout_time: 1.92568, detection_time: 1.083, total time: 3.027 run time: 0.018, getout_time: 1.9227, detection_time: 1.082, total time: 3.023 run time: 0.019, getout_time: 1.92214, detection_time: 1.095, total time: 3.036 run time: 0.018, getout_time: 1.95153, detection_time: 1.066, total time: 3.036 run time: 0.019, getout_time: 1.92528, detection_time: 1.075, total time: 3.019 run time: 0.019, getout_time: 1.93487, detection_time: 1.062, total time: 3.016 run time: 0.02, getout_time: 1.92637, detection_time: 1.077, total time: 3.023 run time: 0.02, getout_time: 1.93431, detection_time: 1.066, total time: 3.02 run time: 0.02, getout_time: 1.93129, detection_time: 1.054, total time: 3.005 run time: 0.022, getout_time: 1.93616, detection_time: 1.064, total time: 3.022 run time: 0.022, getout_time: 1.90587, detection_time: 1.088, total time: 3.016 run time: 0.019, getout_time: 1.92704, detection_time: 1.078, total time: 3.024 run time: 0.021, getout_time: 1.97368, detection_time: 1.069, total time: 3.064 run time: 0.019, getout_time: 1.9386, detection_time: 1.087, total time: 3.045 run time: 0.02, getout_time: 1.91252, detection_time: 1.051, total time: 2.984 run time: 0.018, getout_time: 1.94096, detection_time: 1.061, total time: 3.02 run time: 0.02, getout_time: 1.93779, detection_time: 1.053, total time: 3.011 run time: 0.019, getout_time: 1.97773, detection_time: 1.07, total time: 3.067 run time: 0.02, getout_time: 1.93013, detection_time: 1.069, total time: 3.019 run time: 0.018, getout_time: 1.94176, detection_time: 1.101, total time: 3.061 run time: 0.019, getout_time: 1.92462, detection_time: 1.079, total time: 3.023 run time: 0.022, getout_time: 1.9222, detection_time: 1.076, total time: 3.02 run time: 0.021, getout_time: 1.91133, detection_time: 1.048, total time: 2.98 run time: 0.019, getout_time: 1.91965, detection_time: 1.067, total time: 3.006 run time: 0.021, getout_time: 1.92235, detection_time: 1.105, total time: 3.048 run time: 0.021, getout_time: 1.92866, detection_time: 1.161, total time: 3.111 run time: 0.019, getout_time: 1.91241, detection_time: 1.086, total time: 3.017 run time: 0.02, getout_time: 1.93633, detection_time: 1.052, total time: 3.008 run time: 0.019, getout_time: 1.91661, detection_time: 1.073, total time: 3.009 run time: 0.019, getout_time: 1.90531, detection_time: 1.059, total time: 2.983 run time: 0.021, getout_time: 1.92044, detection_time: 1.067, total time: 3.008 run time: 0.038, getout_time: 1.92036, detection_time: 1.042, total time: 3.0 run time: 0.018, getout_time: 1.91986, detection_time: 1.055, total time: 2.993 run time: 0.019, getout_time: 1.95779, detection_time: 1.087, total time: 3.064 run time: 0.019, getout_time: 1.92522, detection_time: 1.068, total time: 3.012 run time: 0.02, getout_time: 1.93092, detection_time: 1.045, total time: 2.996 run time: 0.02, getout_time: 1.91828, detection_time: 1.059, total time: 2.997 run time: 0.019, getout_time: 1.9287, detection_time: 1.081, total time: 3.029 run time: 0.02, getout_time: 1.92668, detection_time: 1.068, total time: 3.015 run time: 0.02, getout_time: 1.94282, detection_time: 1.109, total time: 3.072 run time: 0.02, getout_time: 1.92563, detection_time: 1.064, total time: 3.01 run time: 0.038, getout_time: 1.92589, detection_time: 1.049, total time: 3.013 run time: 0.034, getout_time: 1.92938, detection_time: 1.051, total time: 3.014 run time: 0.018, getout_time: 1.977, detection_time: 1.078, total time: 3.073 run time: 0.018, getout_time: 1.931, detection_time: 1.053, total time: 3.002 run time: 0.02, getout_time: 1.93045, detection_time: 1.049, total time: 2.999 run time: 0.019, getout_time: 1.91125, detection_time: 1.068, total time: 2.998 run time: 0.018, getout_time: 1.93437, detection_time: 1.049, total time: 3.001 run time: 0.031, getout_time: 1.91328, detection_time: 1.065, total time: 3.009 run time: 0.019, getout_time: 1.9211, detection_time: 1.049, total time: 2.989 run time: 0.041, getout_time: 1.90842, detection_time: 1.058, total time: 3.007 run time: 0.036, getout_time: 1.92608, detection_time: 1.055, total time: 3.017 run time: 0.022, getout_time: 1.92604, detection_time: 1.047, total time: 2.995 run time: 0.045, getout_time: 1.92545, detection_time: 1.043, total time: 3.013 run time: 0.035, getout_time: 1.91599, detection_time: 1.026, total time: 2.977 run time: 0.021, getout_time: 1.96068, detection_time: 1.067, total time: 3.049 run time: 0.028, getout_time: 1.92769, detection_time: 1.066, total time: 3.022 run time: 0.035, getout_time: 1.94109, detection_time: 1.036, total time: 3.012 run time: 0.019, getout_time: 1.92071, detection_time: 1.051, total time: 2.991 run time: 0.019, getout_time: 1.92616, detection_time: 1.084, total time: 3.029 run time: 0.028, getout_time: 1.92764, detection_time: 1.061, total time: 3.017 run time: 0.018, getout_time: 1.93644, detection_time: 1.068, total time: 3.022 run time: 0.02, getout_time: 1.95426, detection_time: 1.045, total time: 3.019 run time: 0.045, getout_time: 1.91813, detection_time: 1.029, total time: 2.992 run time: 0.018, getout_time: 1.92311, detection_time: 1.041, total time: 2.982 run time: 0.018, getout_time: 1.91577, detection_time: 1.046, total time: 2.98 run time: 0.038, getout_time: 1.92422, detection_time: 1.048, total time: 3.01 run time: 0.021, getout_time: 1.97752, detection_time: 1.072, total time: 3.071 run time: 0.018, getout_time: 1.9339, detection_time: 1.045, total time: 2.997 run time: 0.02, getout_time: 1.93187, detection_time: 1.046, total time: 2.998 run time: 0.019, getout_time: 1.92066, detection_time: 1.075, total time: 3.015 run time: 0.019, getout_time: 1.92475, detection_time: 1.055, total time: 2.999 run time: 0.043, getout_time: 1.90995, detection_time: 1.047, total time: 3.0 run time: 0.03, getout_time: 1.91711, detection_time: 1.097, total time: 3.044 run time: 0.045, getout_time: 1.9019, detection_time: 1.061, total time: 3.008 run time: 0.037, getout_time: 1.91884, detection_time: 1.081, total time: 3.037 run time: 0.021, getout_time: 1.89945, detection_time: 1.075, total time: 2.995 run time: 0.036, getout_time: 1.92068, detection_time: 1.048, total time: 3.005 run time: 0.029, getout_time: 1.9183, detection_time: 1.053, total time: 3.0 run time: 0.019, getout_time: 1.94357, detection_time: 1.08, total time: 3.043 run time: 0.018, getout_time: 1.90703, detection_time: 1.04, total time: 2.965 run time: 0.018, getout_time: 1.95019, detection_time: 1.06, total time: 3.028 run time: 0.019, getout_time: 1.92096, detection_time: 1.078, total time: 3.018 run time: 0.019, getout_time: 1.92293, detection_time: 1.084, total time: 3.026 run time: 0.02, getout_time: 1.92731, detection_time: 1.099, total time: 3.046 run time: 0.018, getout_time: 1.96959, detection_time: 1.075, total time: 3.063 run time: 0.019, getout_time: 1.92001, detection_time: 1.088, total time: 3.027 run time: 0.018, getout_time: 1.97402, detection_time: 1.12, total time: 3.112 run time: 0.019, getout_time: 1.9416, detection_time: 1.05, total time: 3.011 run time: 0.021, getout_time: 1.92424, detection_time: 1.124, total time: 3.069

according to the output, the run time is mostly stable from second loop and ~0.02s. What is the input size of your mobilenet model? @merrymercy

"tvm.target.mali" is okay. It is the same as "opencl -device=mali".

As the contributor of mali backend, I am pretty sure you are wrong…

run only launches kernels but does not sync the stream.
We only sync the stream when we want to get some data from GPU to CPU. (i.e. get_output to cpu tvm array or call asnumpy on gpu tvm array)

1 Like

Yes. I know you are big contributor of TVM. I already referenced your documents and scripts. Something went wrong here…
You can measure yourself with my script? i already uploaded my models here
Anyway, What is the input size of your mobilenet model?

The output of your script is correct and reasonable.
But the time of run does not mean the time cost of running all opencl kernels.
So your conclusion get_output is time-consuming is not correct.

When you call ‘run’, the CPU puts kernels into a queue but does not wait them to finish.
In the meanwhile GPU fetch commands from the queue and execute them.
When you want to get results back from GPU (where you call get_output), you wait until all commands in that queue finish. So the correct measurement should include run + get_output. In this way we measure the running cost of all kernels.

The benchmark uses 224x224x3 imagenet input

2 Likes

try adding ctx.sync() after run()

Thanks for your clear answer!
@masahi i add this, now the run time is about 1.95s and get_output time is 0.005s

Out of curiosity, is there a reason time evaluator is not being used here?

That is the typical way we usually do performance measurement, as it allows for customizing the number of trials/repetitions and also uses high precision timers.

1 Like

@eqy Yes! Time_evaluator is exactly what i need. Thank you!

@merrymercy Can we sync stream before transferring data from GPU to CPU? This seems necessary for heterogeneous execution. Thanks.

Can ctx.sync() work for you?

I deployed my model (mobilenet_ssd_300x300) with RPC on RK3399 but the processing time is still too bad (1884.47ms, std dev=6.09ms) compare to the recorded result (78.39 ms , std dev=1.33ms). Do you know the reason? @merrymercy

  1. First make sure you can reproduce our benchmark results We lock the frequency of mali gpu and close gui by
sudo /etc/init.d/lightdm stop
sudo -i
echo performance > /sys/class/misc/mali0/device/devfreq/ff9a0000.gpu/governor
  1. Then you need auto-tuning for your own network.
1 Like

@merrymercy Yes. I successfully reproduced your benchmark result (84.47ms for mobilenet and 181.75ms for resnet-18)
The model that i deployed here doesn’t include multibox_detection part. I removed multibox_prior part of this model and the result seemed to be more reasonable (230.95ms, std dev=8.69ms). So may be current openCL’s schedule for multibox_prior is not good?

I think this is probably not the copy issue. I would suggest you measure each of the workload/op and see if there is a timing problem. My problem was caused the inappropriate handling a certain convolution shape, it was largely solved by PR #1709.

1 Like