Hi:
We are deploying a SSD model with virtual machine. The backbone of the model is offloaded to Tensorrt while other operators are left to CPU, which seems to be fastest way within different kinds of target allocation strategies.
The mean inference time is 115ms
, while the Tensorflow takes less than 80ms
even Tensorrt is not used. When we use the VirtualMachineProfiler
to profile the compiled model, the several InvokedOp that takes most of the inference time are
#OpName #InvokeCount #Duration(us): Sum/Mean/Min/Max
tensorrt_2 1 53173.5/53173.5/53173.5/53173.5
fused_vision_non_max_suppression 90 22677.2/251.968/230.512/301.215
fused_vision_get_valid_counts 90 21406.4/237.849/207.425/256.47
fused_expand_dims_concatenate_expand_dims 90 6885.56/76.5063/68.671/87.876
The nms
and get_valid_counts
are called 90 times in sequence, which is the number of object classes.
So if is there a way to invoke ops parallelly in VirtualMachine
to improve the performance?