I tried tuning a model using auto schedule for 10+ hours in ARM CPU. However, found that there is a big gap between the whole network evaluated by the tune Estimated total latency (7.501 ms) and the actual running latency (12.75 ms), as shown below.
| ID | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
| 0 | 0.228 | 36.75 | 7168 |
| 1 | 0.001 | 5.64 | 64 |
| 2 | 0.047 | 0.09 | 3136 |
| 3 | 0.057 | 37.04 | 3840 |
| 4 | 0.005 | 24.47 | 192 |
| 5 | 0.114 | 36.85 | 1216 |
| 6 | 0.001 | 4.69 | 64 |
| 7 | 0.001 | 3.47 | 64 |
| 8 | 0.057 | 36.79 | 640 |
| 9 | 0.286 | 27.54 | 3008 |
| 10 | 0.019 | 34.25 | 256 |
| 11 | 0.114 | 36.77 | 320 |
-------------------------------------------------
Estimated total latency: 7.501 ms Trials: 19968 Used time : 25604 s Next ID: 2
Evaluate inference time cost...
Mean inference time (std dev): 12.75 ms (0.03 ms)
the evaluate time I got:
print("Evaluate inference time cost...")
ftimer = module.module.time_evaluator("run", ctx, repeat=10, min_repeat_ms=500)
prof_res = np.array(ftimer().results) * 1000 # convert to millisecond
global result
result = "Mean inference time (std dev): %.2f ms (%.2f ms) " % (
np.mean(prof_res), np.std(prof_res))
print(result)
Is such a big gap acceptable?