We are deploying a SSD model with virtual machine. The backbone of the model is offloaded to Tensorrt while other operators are left to CPU, which seems to be fastest way within different kinds of target allocation strategies.
The mean inference time is
115ms, while the Tensorflow takes less than
80ms even Tensorrt is not used. When we use the
VirtualMachineProfiler to profile the compiled model, the several InvokedOp that takes most of the inference time are
#OpName #InvokeCount #Duration(us): Sum/Mean/Min/Max tensorrt_2 1 53173.5/53173.5/53173.5/53173.5 fused_vision_non_max_suppression 90 22677.2/251.968/230.512/301.215 fused_vision_get_valid_counts 90 21406.4/237.849/207.425/256.47 fused_expand_dims_concatenate_expand_dims 90 6885.56/76.5063/68.671/87.876
get_valid_counts are called 90 times in sequence, which is the number of object classes.
So if is there a way to invoke ops parallelly in
VirtualMachine to improve the performance?