lp6m
July 21, 2021, 5:11am
1
I want to compile an ONNX model with a NonMaxSuppression layer in TVM.
To compile this model without using relay VM, I need to make all layers static shape.
By setting freeze_params=True, all layers except vision.all_class_non_max_suppression() becomes static shape.
relay.frontend.from_onnx converts ONNX::NonMaxSuppression layer to tvm.relay.vision.all_class_non_max_suppression .
main
← masahi:all-class-nms-final
opened 06:00AM - 05 Apr 21 UTC
This PR adds a new variant of NMS that better supports NMS in ONNX and combined … NMS in TF than the existing NMS operator in TOPI/Relay. For now, I'm calling this variant of NMS "All class NMS".
https://github.com/onnx/onnx/blob/master/docs/Operators.md#NonMaxSuppression
https://www.tensorflow.org/api_docs/python/tf/image/combined_non_max_suppression
The biggest difference between our NMS and "All class NMS" is that in our NMS, a single box is associated with a single class, while in the latter case a single box has scores for all classes, and NMS is performed for each class separately. `max_out_size` parameter is also applied per class, rather than to all boxes as in our implementation.
Until now, we've been supporting this variant of NMS via complicated encodings using our implementation of NMS. It kind of "works" in practice, but there are many problems to it:
* ONNX NMS converter https://github.com/apache/tvm/pull/6839 is extremely complicated, and performance is bad because it does small NMS repeatedly inside Relay while loop. It also easily introduces the "zero box problem", because "All class NMS" encoded via one-class NMS is more likely to result in zero detection. We needed to add an ad hoc patch like https://github.com/apache/tvm/pull/7691 to workaround this problem.
* https://github.com/apache/tvm/pull/7520 has a bug in `max_out_size` handling. Since in our NMS `max_out_size` is applied to all boxes, we cannot translate "All class NMS" into a single call to our NMS.
For these reasons, I decided it is better to introduce a new variant of NMS to overcome these pains. This breaks our general "one operator to support all frameworks" philosophy, but the two variants of NMS are so different it doesn't make sense to call them the same op.
The result is significant: using the new NMS, I got speedup of **1 second** on mlperf SSD resnet34, running on vk + amd. This is an extreme case in that the existing approach that calls `get_valid_counts` and `non_maximum_surpression` 80 times in a while loop is extremely slow on vk + amd for some reason, taking literally 1 second. Now it is only **5.7 milli second**.
## Implementation details
The new NMS implementation consists of the following steps:
* Sort scores and return *both* sorted scores and indices. The existing NMS only uses sorted indices, while I also used sorted scores to do binary search, next.
* Do binary search on sorted scores to find the index of the box whose score is just below `score_threshold`. This gives what we call `valid_count[i]` in the existing NMS, computed by `get_valid_counts`.
* Do NMS, parallelized over batch * class. The inner loop uses the same NMS IR as the existing one.
* After the previous step, we end up with indices of size `(batch * num_class, num_boxes)` and a tensor `num_detections` of size `(batch * num_class,)` holding the number of survived boxes per row. We need to copy `num_detections[i]` indices from each row into a one linear output. This is a perfect application of exclusive scan: Doing ex scan on `num_detections` gives row offsets to write into for each row.
The efficiency of the new implementation comes from:
* `get_valid_count` call is replaced with binary search
* Per class independent NMS are done in parallel across different blocks on GPU. This alone gives `num_class`x speedup over the existing encoding in ONNX / TF frontend.
* The ONNX NMS frontend is now trivial and there is no triple nested loop etc. It seems overhead on the host side is very large on vk + amd.
Currently, the output of vision.all_class_non_max_suppression is in dynamic shape because the number of bboxes is determined dynamically.
In vision.non_max_suppression , the output can be made static shape by setting return_indices=False.
Similarly, is it possible to extend the implementation of all_class_non_max_suppression to make the output static shape?
lp6m
July 21, 2021, 5:32am
2
More precisely, the question is whether it is possible to make the output of strided_slice inserted after vision.all_class_non_max_suppression() a static shape when ONNX NonMaxSuppression is converted to relayIR.
return x
max_output_boxes_per_class = conditionally_squeeze_scalar(max_output_boxes_per_class)
iou_threshold = conditionally_squeeze_scalar(iou_threshold)
score_threshold = conditionally_squeeze_scalar(score_threshold)
nms_out = _op.vision.all_class_non_max_suppression(
boxes, scores, max_output_boxes_per_class, iou_threshold, score_threshold
)
return _op.strided_slice(nms_out[0], _op.const([0], dtype="int64"), nms_out[1])
class ATen(OnnxOpConverter):
"""Operator converter for Pytorch ATen ops."""
@classmethod
def _op_dispatch(cls, operator, inputs, attr, params):
op_map = {
"size": cls._size,
"arange": cls._arange,
fPecc
June 1, 2022, 2:53pm
3
Hi @lp6m , did you found a solution for this issue? I am facing exactly the same problem trying to use an ONNX model!
No, NMS op in ONNX returns dynamic shape, so we can’t simply drop strided_slice
there. The output of all_class_non_max_suppression
itself is of static shape.
fPecc
June 2, 2022, 6:55am
5
I was looking to make it a static shape because I thought that was causing my relay.build to break, but after hardcoding the stride in order to get rid of the dynamic stride op, I was still having the issue, so I guess I am still having another problem. I will create a new post with that problem.
oh yes, for models containing dynamic shape, you cannot use relay.build
. You need to use the VM compiler and runtime. See https://github.com/apache/tvm/blob/main/gallery/how_to/deploy_models/deploy_object_detection_pytorch.py#L130-L139