Grouped Mathematical Differentiable NMS For Object Detection
20220375237 · 2022-11-24
Assignee
Inventors
- Xiaoming Liu (Okemos, MI, US)
- Abhinav KUMAR (East Lansing, MI, US)
- Garrick BRAZIL (East Lansing, MI, US)
Cpc classification
G06V10/255
PHYSICS
G06V10/26
PHYSICS
International classification
G06V10/26
PHYSICS
G06V10/762
PHYSICS
Abstract
An end-to-end trainable grouped mathematically differentiable non-maximal suppression (NMS) technique is presented for monocular 3D object detection. First, formulate NMS as a matrix operation and then group and mask the boxes in an unsupervised manner to obtain a simple closed-form expression of the NMS. This technique addresses the mismatch between training and inference pipelines and, therefore, forces the network to select the best 3D box in a differentiable manner. As a result, the proposed technique achieves state-of-the-art monocular 3D object detection results on the KITTI benchmark dataset performing comparably to monocular video-based methods.
Claims
1. A computer-implemented method for detecting objects in an image, comprising: receiving, by a computer processor, a set of predicted bounding boxes from at least one of a two-dimensional classification head or a three-dimensional classification head; receiving, by the computer processor, scores for each of the predicted bounding boxes in the set of predicted bounding boxes, where the scores for each of the predicted bounding boxes are in form of a vector; computing, by the computer processor, a set of intersection over union (IoU) measures for the set of predicted bounding boxes; grouping, by the computer processor, boxes in the set of predicted bounding boxes into one or more groups of predicted bounding boxes; for each group in the one or more groups of predicted bounding boxes, calculating, by the computer processor, rescores for each of the predicted bounding boxes in the set of predicted bounding boxes by performing matrix operations on the vector of scores in accordance with a closed-form expression; and selecting a subset of boxes from the set of predicted bounding boxes for each object in the image using the rescores for each of the predicted bounding boxes.
2. The method of claim 1 further comprises capturing image data for a scene using an imaging device; and determining the set of predicted bounding boxes from the image data.
3. The method of claim 1 further comprises sorting the scores for each of the predicted bounding boxes in the set of predicted bounding boxes in a descending order prior to the step of grouping boxes in the set of predicted bounding boxes.
4. The method of claim 3 further comprises sorting the IoU measures in the set of IoU measures using permutation of the scores for each of the predicted bounding boxes, where the IoU measures are arranged in a IoU matrix.
5. The method of claim 4 wherein grouping boxes in the set of predicted bounding boxes further comprises identifying a box with highest score in the set of predicted bounding boxes; grouping boxes from the set of predicted bounding boxes with highest overlap with the identified box into a given group; deleting boxes in the given group from the set of predicted bounding boxes; and repeating these steps until there are no boxes in the set of predicted bounding boxes.
6. The method of claim 5 further comprises calculating rescores for each of the predicted bounding boxes according to
≈└(
−
⊙
)
┐. (8) where I is an identity matrix, M is a mask corresponding to a given group, and P is a prune matrix which is obtained by element-wise operation of a pruning function on the IoU matrix, and the pruning function decides whether to keep a box in a set of final predictions based on overlap of IoU measures.
7. The method of claim 6 wherein the multiplication of the prune matrix with the mask only keeps a column of the prune matrix corresponding to a box having the highest score while assigning zero to remaining columns of the prune matrix.
8. The method of claim 6 wherein the pruning function is linear.
9. The method of claim 1 wherein calculating rescores for each of the predicted bounding boxes is performed using a single layer neural network.
10. A non-transitory computer-readable medium having computer-executable instructions residing thereon and that, upon execution of the instructions by a processor of a computer, cause the computer to perfume the steps of: receiving image data for a scene from an imaging device; determining a set of predicted bounding boxes from the image data; receiving scores for each of the predicted bounding boxes in the set of predicted bounding boxes, where the scores for each of the predicted bounding boxes are in form of a vector; computing a set of intersection over union (IoU) measures for the set of predicted bounding boxes; grouping boxes in the set of predicted bounding boxes into one or more groups of predicted bounding boxes; for each group in the one or more groups of predicted bounding boxes, calculating rescores for each of the predicted bounding boxes in the set of predicted bounding boxes by performing matrix operations on the vector of scores in accordance with a closed-form expression; and selecting a subset of boxes for each object in the image from the set of predicted bounding boxes using the rescores for each of the predicted bounding boxes.
11. The non-transitory computer-readable medium of claim 10 further comprises sorting the scores for each of the predicted bounding boxes in the set of predicted bounding boxes in a descending order prior to the step of grouping boxes in the set of predicted bounding boxes.
12. The non-transitory computer-readable medium of claim 11 further comprises sorting the IoU measures in the set of IoU measures using permutation of the scores for each of the predicted bounding boxes, where the IoU measures are arranged in a IoU matrix.
13. The non-transitory computer-readable medium of claim 12 wherein grouping boxes in the set of predicted bounding boxes further comprises identifying a box with highest score in the set of predicted bounding boxes; grouping boxes from the set of predicted bounding boxes with highest overlap with the identified box into a given group; deleting boxes in the given group from the set of predicted bounding boxes; and repeating these steps until there are no boxes in the set of predicted bounding boxes.
14. The non-transitory computer-readable medium of claim 13 further comprises calculating rescores for each of the predicted bounding boxes according to
≈└(
−
⊙
)
┐. (8) where I is an identity matrix, M is a mask corresponding to a given group, and P is a prune matrix which is obtained by element-wise operation of a pruning function on the IoU matrix, and the pruning function decides whether to keep a box in a set of final predictions based on overlap of IoU measures.
15. The non-transitory computer-readable medium of claim 14 wherein the multiplication of the prune matrix with the mask only keeps a column of the prune matrix corresponding to a box having the highest score while assigning zero to remaining columns of the prune matrix.
16. The non-transitory computer-readable medium of claim 14 wherein the pruning function is linear.
17. The non-transitory computer-readable medium of claim 10 wherein calculating rescores for each of the predicted bounding boxes is performed using a single layer neural network.
18. A computer-implemented method for detecting objects in an image, comprising: capturing, by an imaging device, image data for a scene determining, by a computer processor, a set of predicted bounding boxes from the image data receiving, by the computer processor, scores for each of the predicted bounding boxes in the set of predicted bounding boxes, where the scores for each of the predicted bounding boxes are in form of a vector; sorting the scores for each of the predicted bounding boxes in the set of predicted bounding boxes in a descending order; computing, by the computer processor, a set of intersection over union (IoU) measures for the set of predicted bounding boxes; grouping, by the computer processor, boxes in the set of predicted bounding boxes into one or more groups of predicted bounding boxes; for each group in the one or more groups of predicted bounding boxes, calculating, by the computer processor, rescores for each of the predicted bounding boxes in the set of predicted bounding boxes by performing matrix operations on the vector of scores in accordance with a closed-form expression; and selecting, by the computer processor, one or more boxes for each object in the image from the set of predicted bounding boxes using the rescores for each of the predicted bounding boxes.
19. The method of claim 18 further comprises navigating a controlled object through the scene based on the selected one or more boxes.
Description
DRAWINGS
[0012] The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure.
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020] Corresponding reference numerals indicate corresponding parts throughout the several views of the drawings.
DETAILED DESCRIPTION
[0021] Example embodiments will now be described more fully with reference to the accompanying drawings.
[0022]
[0023] Let B={b.sub.i}.sup.n.sub.i=1 denote the set of boxes or proposals b.sub.i from an image. Let s={s.sub.i}.sup.n.sub.i=1 and r={r.sub.i}.sup.n.sub.i=1 denote their scores (before NMS) and rescores (after NMS) respectively such that r.sub.i, s.sub.i≥0∀i. D denotes the subset of B after the NMS. Let O=[o.sub.ij] denote the n×n matrix with o.sub.ij denoting the IoU.sub.2D of b.sub.i and b.sub.j. The pruning function p decides how to rescore a set of boxes D based on IoU.sub.2D overlaps of its neighbors, sometimes suppressing boxes entirely. In other words, p(o.sub.i)=1 denotes the box b.sub.i is suppressed while p(o.sub.i)=0 denotes b.sub.i is kept in D. N.sub.t denotes the NMS threshold while T denotes the temperature.
[0024] B is partitioned into different groups G={G.sub.k}. B.sub.Gk denotes the subset of B belonging to group k. Thus, B.sub.Gk={b.sub.i}∀b.sub.i E G.sub.k and B.sub.Gk∩B.sub.Gl=Ø∀|k≠l. G.sub.k in the subscript of a variable denotes its subset corresponding to B.sub.Gk. Thus, s.sub.Gk and r.sub.Gk denote the scores and the rescores of B.sub.Gk, respectively.
[0025] V denotes the logical OR while {x} denotes clipping of x in the range. Formally,
|s| denotes the number of elements in s. in the subscript denotes the lower triangular version of the matrix without the principal diagonal. ⊙ denotes the element-wise multiplication. I denotes the identity matrix.
[0026] NMS is one of the building blocks in 2D and 3D object detection whose high-level goal is to iteratively suppress boxes which have too much Intersection over Union (IoU) with a nearby high-scoring box. Classical NMS uses the idea that a box which has a high IoU.sub.2D overlap with any of the already selected boxes should be suppressed to zero. In other words, it uses a hard pruning function p without any temperature T. Soft-NMS makes this pruning soft using temperature T. Classical and Soft-NMS thus only differ in the choice of p. Algorithm 1 below set forth an example of classical/soft NMS.
TABLE-US-00001 Algorithm 1: Classical/Soft-NMS [6] Input: s: scores, O: IoU.sub.2D matrix, N.sub.t: NMS threshold, p: pruning function, τ: temperature Output: d: box index after NMS, r: scores after NMS 1 begin 2 | d ←{} 3 | t ← range All box indices | (|s|) 4 | r ← s 5 | while t ≠ 6 | empty do v ← argmax r[t]
Top scored box 7 | | d ← d ∪ v
Add to valid box | | index 8 | | t ← t − v
Remove from t 9 | | for i ← 1 |t| do 10 | | r.sub.i ← (1 − p.sub.r(O[v,i]))r.sub.i
Rescore 11 | | end 12 | | | | | | | end 13 end
[0027] Classical NMS greedily calculates its rescores r.sub.i over the sorted set of boxes B and, is thus not parallelizable or differentiable. This disclosure aims to find a smooth approximation of the NMS in closed-form to include it in the training pipeline.
[0028]
[0029] The rescoring process of the classical NMS is greedy set-based and only takes the overlap with unsuppressed boxes into account. One can generalize this into a matrix formulation by accounting for the effect of all (suppressed and unsuppressed) boxes as
using the relaxation of logical OR V operator as Σ. The presence of r.sub.j on the RHS of equation (2) prevents suppressed boxes from influencing other boxes hugely. When p outputs discretely as {0, 1} as in classical NMS, scores s.sub.i are guaranteed to be suppressed to r.sub.i=0 or left unchanged r.sub.i=s.sub.i thereby implying r.sub.i≤s.sub.i∀.sub.i. One can write the rescores r as
[0030] The above two equations are written compactly as
r=max(s−Pr,0), (5)
where P, called the prune matrix, is obtained by elementwise operation of the pruning function p on O. Maximum operation makes equation (5) non-linear and, thus, difficult to solve. However, to avoid recursion, one can use
r≈└(I+P).sup.−1s┐ (6)
as the solution to equation (5) with I being the identity matrix. Intuitively, if the matrix inversion is considered division in equation (6) and the boxes have overlaps, the rescores are the scores divided by a number greater than one and are, therefore, lesser than scores. If the boxes do not overlap, the division is by one and rescores equal scores. Note that the I+P in equation (6) is a lower triangular matrix with ones on the principal diagonal. Hence, I+P is always full rank and, therefore, always invertible.
[0031] Next, observe that the object detectors output multiple boxes for an object in an image, and a good detector outputs boxes wherever it finds some objects in the monocular image. Therefore, the boxes in an image are clustered in an unsupervised manner based on IoU.sub.2D overlaps to obtain the groups G. Grouping thus mimics the grouping of classical NMS, but it does not rescore the boxes. Since clustering limits the interactions to intra-group interactions among the boxes, one can write equation (6) as
r.sub.k≈└(
+
).sup.−1s
(7)
Grouping hence helps in taking smaller matrix inverses in equation (7) compared to equation (6).
[0032] In one embodiment, a simplistic grouping algorithm is used where a group Gk is formed with boxes which have high IoU.sub.2D overlap with the top-ranked box. As the group size is limited by α, choose a minimum of a and the number of boxes in G.sub.k. Next, delete all the boxes of this group and iterate until one runs out of boxes. Also, grouping uses IoU.sub.2D since one can achieve meaningful clustering in 2D. This unsupervised grouping method is set forth in Algorithm 3 below.
TABLE-US-00002 Algorithm 3: Grouping of boxes Input: O: sorted IoU.sub.2D matrix, N.sub.t: NMS threshold, x: group size Output: G: Groups 1 begin 2 | G ← {} 3 | t ← range All box | (O. shape [0]) indices 4 | while t ≠ 5 | empty do u ← O[:,0]> N.sub.t
High overlap | | indices 6 | | v ← t[u]
New group 7 | | n.sub.Gk← min | | (|v|,a) 8 | | G.insert (v[:
Insert new | | n.sub.Gk]) group 9 | | w ← O[:,0]≤ N.sub.t
low overlap | | indices 10 | | t ← t[w]
Keep w | | indices in w 11 | | O ← O[w][:,w]
Keep w | | indices in O 12 | end | 13 end
Other grouping methods are also contemplated with the broader aspects of this disclosure.
[0033] Classical NMS considers the IoU.sub.2D of the top-scored box with other boxes. This consideration is equivalent to only keep the column of O corresponding to the top box while assigning the rest of the columns to be zero. This is implemented through masking of P.sub.Gk. Let M.sub.Gk denote the binary mask corresponding to group G.sub.k. Then, only one of the columns in M.sub.Gk⊙P.sub.Gk is non-zero. Thus, I.sub.Gk+M.sub.Gk⊙P.sub.Gk now becomes a Frobenius matrix (Gaussian transformation) and is, therefore, inverted by simply subtracting the second term. In other words, (I.sub.Gk+M.sub.Gk⊙P.sub.Gk).sup.−1=I.sub.Gk−M.sub.Gk⊙P.sub.Gk. Hence, equation (7) is further simplified to
r.sub.k≈└(
−
⊙
)s
.sub.k┐ (8)
Thus, masking allows one to bypass the computationally expensive matrix inverse operation altogether.
[0034] Based on equation (8), an improved non-maximal suppression technique is presented in Algorithm 2.
TABLE-US-00003 Algorithm 2: GrooMeD-NMS Input: s: scores, O: IoU.sub.2D matrix, N.sub.t: NMS threshold, p: pruning function, v: valid box threshold, x: group size Output: d: box index after NMS, r: scores after NMS 1 begin 2 | S, index ←sort (s, Sort s | descending = True) 3 | O←O[index][:, index]
Sort O 4 | O.sub.Δ← lower (O)
Lower Δular matrix 5 | P ←p(O.sub.Δ)
Prune matrix 6 | I ← Identity (|s|)
Identity matrix 7 | G ← group (O, N.sub.t, x)
Group boxes B 8 | for k ←1: |G| do 9 | | M.sub.Gk ← zeros(|G.sub.K|,|G.sub.K|)
Prepare mask 10 | | M.sub.Gk [:, G.sub.K [0]]←1
First col of M.sub.Gk 11 | | r.sub.Gk ← └(l.sub.Gk −M.sub.Gk ⊙P.sub.Gk)S.sub.Gk┐
Rescore 12 | | | | | | | end 13 | d←index{r> = v]
Valid box index | | | | 14 end
This technique is referred to herein as Grouped Mathematically Differentiable Non-Maximal Suppression or GrooMeD-NMS.
[0035]
[0036] As explained above, the pruning function p decides whether to keep the box in the final set of predictions D or not based on IoU.sub.2D overlaps. In other words, p(o.sub.i)=1 denotes the box bi is suppressed while p(o.sub.i)=0 denotes bi is kept in D.
[0037] Classical NMS uses the threshold as the pruning function, which does not give useful gradients. This disclosure considered three different functions for p: linear, a temperature(T)-controlled exponential, and a sigmoidal function. The linear pruning function is p(o)=o. The exponential pruning function is p(o)=1−exp (−o.sup.2/T). The sigmoidal pruning function is p(o)=σ(o−N.sub.t/T) with σ denoting the standard sigmoid. Sigmoidal function appears as the binary cross entropy relaxation of the subset selection problem. A comparison of these pruning functions is shown in
[0038] GrooMeD-NMS does soft pruning to get r but uses hard sorting of s and O (lines 2-3 of Algorithm 2). Permutation of o are needed to sort O. Most soft sorting methods apply the soft permutation to the same vector. Two known methods can apply the soft permutation to another vector: Vert. “Differentiable ranks and sorting using optimal transport” described by Cuturi et al in In NeurIPS, 2019; and “Softsort: A continuous relaxation for the argsort operator” described by Prillo et. al. in In ICML, 2020. Both these methods use O(n2) computations for soft sorting. In this disclosure, it was found out that these methods are overly dependent on temperature T to break out the ranks, and its gradients were too unreliable to train the model. Hence, GrooMeD-NMS preferably employs hard sorting of s and O although soft sorting may be suitable in some embodiments.
[0039] Although no NMS has been proposed for the monocular 3D object detection, GrooMeD-NMS is compared with the NMS proposed for 2D object detection, 2D pedestrian detection, and 2D salient object detection in Table 1. No method described in Table 1 has a matrix-based closed-form mathematical expression of the NMS. Classical and Soft-NMS are used at the inference time, while GrooMeD-NMS is used during both training and inference. QUBO-NMS, Point-NMS, and MAP-NMS are not used in end-to-end training. The Structured-SVM based NMS rely on structured SVM to obtain the rescores. The neural network based NMS (denoted by NN-NMS) uses a separate neural network containing multiple layers and/or message-passing to approximate the NMS and does not use the pruning function. Unlike these methods, GrooMeD-NMS uses a single layer and does not require multiple layers or message passing. The algorithm is parallel upto group (denoted by G). However, |G| is, in general, <<|B| in the NMS.
[0040]
[0041] Next, a set of IoU measures are computed at 43 for the set of predicted bounding boxes.
[0042] Boxes in the set of predicted bounding boxes are then grouped at 45 into one or more groups of predicted bounding boxes. In an example embodiment, groups of boxes are formed with boxes having the highest overlap with a top-ranked box. That is, boxes are grouped by identifying a box with highest score in the set of predicted bounding boxes; grouping boxes from the set of predicted bounding boxes with highest overlap with the identified box into a given group; deleting boxes in the given group from the set of predicted bounding boxes; and reiterating these steps until there are no boxes in the set of predicted bounding boxes. Other grouping techniques also fall within the broader aspects of this disclosure.
[0043] Prior to grouping the boxes, scores for each of the predicted bounding boxes in the set of predicted bounding boxes are sorted at 44 in a descending order. The IoU measures in the set of IoU measures are also sorted using permutation of the scores for each of the predicted bounding boxes.
[0044] For each group in the one or more groups of predicted bounding boxes, rescores are calculated at 46 for each of the predicted bounding boxes in the set of predicted bounding boxes by performing matrix operations on the vector of scores in accordance with a closed-form expression, for example as set forth above in equation (7).
[0045] Lastly, a box is selected at 47 for each object in the image using the rescore for each of the predicted bounding boxes. In one embodiment, a subset of boxes are selected for each object. Selected boxes have a rescore which exceeds a threshold for valid boxes. The threshold is preferably derived empirically. If more than one box has a rescore which exceeds the threshold, then all those boxes are kept.
[0046] The technique for detecting objects in image data is suitable for use in many different applications. In one example, the image data represents a scene and the selected boxes are indicative of objects in the scene. Image data for the scene may be captured using a camera or an imaging device and then serve as input to a computer processor. Based on the selected boxes, a controlled object, such a robot, medical device or an autonomous vehicle, is navigated through the scene. Techniques for plotting a path and issuing commands to the controlled object in accordance with the path are readily known in the art.
[0047] Grouped mathematically differentiable NMS consists of M3D-RPN and uses binning and self-balancing confidence. The boxes' self-balancing confidence are used as scores s and these are passed through the GrooMeD-NMS layer to obtain their rescores r. The rescores are used to signal the network if the best box has not been selected for a particular object.
[0048] The notion of the best 2D box can be extended to 3D. The best box has the highest product of IoU.sub.2D and gIoU.sub.3D with ground truth g.sub.l, and the product is greater than a certain threshold and is assigned a positive label. Mathematically,
with q(b.sub.j, g.sub.l)=IoU.sub.2D(b.sub.j; g.sub.l) (1+gIoU.sub.3D(b.sub.j; g.sub.l)/2). gIoU.sub.2D is known to provide signal even for non-intersecting boxes, where the usual IoU.sub.2D is always zero. Therefore, one can use gIoU.sub.3D instead of regular IoU.sub.3D for figuring out the best box in 3D as many boxes in 3D have a zero IoU.sub.3D overlap with the ground truth. For calculating gIoU.sub.3D, first calculate the volume V and hull volume V.sub.hull of the 3D boxes. V.sub.hull is the product of gIoU.sub.2D in Birds Eye View (BEV), removing the rotations and hull of the Y dimension. gIoU.sub.3D is then given by
[0049] In general, the number of best boxes is less than the number of ground truths in an image (there could be some ground boxes for which no box is predicted). The tiny number of best boxes introduces a far-heavier skew than the foreground-background classification. Therefore, one can use the modified AP-Loss as the loss after NMS since AP-Loss does not suffer from class imbalance.
[0050] Vanilla AP-Loss treats boxes of all images in a minibatch equally, and the gradients are back-propagated through all the boxes. Remove this condition and rank boxes in an image-wise manner. In other words, if the best boxes are correctly ranked in one image and are not in the second, then the gradients only affect the boxes of the second image. Call this modification as Imagewise APLoss. In other words,
where r(.sup.m) and B(.sup.m) denote the rescores and the boxes of the m.sup.th image in a mini-batch respectively. This is different from previous NMS approaches, which use classification losses. Ablation studies show that the Imagewise AP-Loss is better suited to be used after NMS than the classification loss.
[0051] The overall loss function is thus given by L=L.sub.before+λ L.sub.after where L.sub.before denotes the losses before the NMS including classification, 2D and 3D regression as well as confidence losses, and L.sub.after denotes the loss term after the NMS is the Imagewise AP-Loss with λ being the weight.
[0052] Experiments were conducted using the most widely used KITTI autonomous driving dataset. The publicly available PyTorch code of Kinematic-3D was modified. Kinematic-3D used DenseNet-121 trained on Imagenet as the backbone and n.sub.h=1024 using 3D-RPN settings. Kinematic-3D is a video-based method while GrooMeD-NMS is an image based method; this disclosure uses the best image model of Kinematic-3D henceforth called Kinematic (Image) as the baseline for a fair comparison. Kinematic (Image) is built on M3D-RPN and uses binning and self-balancing confidence.
[0053] Training images are augmented using random flipping with probability 0.5. Adam optimizer is used with batch size 2, weight-decay 5×10.sup.−4 and gradient clipping of 1. Training is done in two stages—warmup and full. Warmup takes 80 k mini-batches and starts with a learning rate 4×10.sup.−3 following a poly learning policy with power 0.9. Then initialize the model with the confidence prediction branch from warmup weights and fine tune for 50 k mini-batches using the self-balancing loss and Imagewise AP-Loss after GrooMeD-NMS. The weight λ is kept at 0:05. Unless otherwise stated, use p as the Linear function (this does not require T) with α=100. N.sub.t; v and B are set to 0.4; 0.3 and 0.3, respectively.
[0054] For inference, multiply the class and predicted confidence to get the box's overall score in inference.
[0055] There are three commonly used data splits of the KITTI dataset. The grouped mathematically differentiable NMS approach is evaluated on all three datasets: Test Split, Val 1 Split and Val 2 Split. Test Split: Official KITTI 3D benchmark consists of 7,481 training and 7,518 testing images. Val 1 Split—partitions the 7,481 training images into 3,712 training and 3,769 validation images. Val 2 Split—partitions the 7,481 training images into 3,682 training and 3,799 validation images.
[0056] KITTI uses AP.sub.3D|R40 metric to evaluate object detection. KITTI evaluation is done on three object categories: easy, moderate and hard. Each object is assigned to a category based on its occlusion, truncation, and height in the image space. The AP.sub.3D|R40 performance on the Moderate category compares different models in the benchmark. Experiments focused on the car class.
[0057] Table 2 summarizes the results of 3D object detection and BEV evaluation on KITTI Test Split. The results in Table 2 show that GrooMeD-NMS outperforms the baseline M3D-RPN by a significant margin and several other SoTA methods on both the tasks. GrooMeD-NMS also outperforms augmentation based approach GAD and depth-convolution based D4LCN. Despite being an image-based method, GrooMeD-NMS performs competitively to the video-based method Kinematic (Video), outperforming it on the most-challenging Hard set.
[0058] Table 3 summarizes the results of 3D object detection and BEV evaluation on KITTI Val 1 Split at two IoU.sub.3D thresholds of 0.7 and 0.5. The results in Table 3 show that GrooMeD-NMS outperforms the baseline of M3D-RPN and Kinematic (Image) by a significant margin. Interestingly, GrooMeD-NMS (an image-based method) also outperforms the video-based method Kinematic (Video) on most of the metrics. Thus, GrooMeDNMS performs best on 6 out of the 12 cases (3 categories×2 tasks×2 thresholds) while second-best on all other cases. The performance is especially impressive because the biggest improvements are shown on the Moderate and Hard set, where objects are more distant and occluded.
[0059] Next, the AP.sub.3D performance of GrooMeDNMS and Kinematic (Image) were compared on linear and log scale for objects at different depths of meters and IoU.sub.3D matching criteria of 0.3.fwdarw.0.7 in
[0060] GrooMeDNMS technique was also compared with the other inference-based NMS-classical and Soft-NMS techniques as shown in Table 4. The results show that NMS inclusion in the training pipeline benefits the performance. Training with GrooMeD-NMS helps because the network gets an additional signal through the GrooMeD-NMS layer whenever the best-localized box corresponding to an object is not selected. Interestingly, Table 4 also suggests that replacing GrooMeD-NMS with the classical NMS in inference does not affect the performance.
[0061] The scores are further correlated with IoU.sub.3D after NMS of the model with two baselines—M3D-RPN and Kinematic (Image) and also the Kinematic (Video) in
[0062] Lastly, the training and inference times of including GrooMeDNMS in the pipeline were compared. Warmup takes about 13 hours to train on a single 12 GB GeForce GTX Titan-X GPU. Full of Kinematic (Image) and GrooMeD-NMS takes about 8 and 8.5 hours, respectively. The inference time per image using classical and GrooMeD-NMS is 0.12 and 0.14 ms respectively. Table 4 suggests that changing the NMS from GrooMeD to classical during inference does not alter the performance. Then, the inference time of our method is the same as 0.12 ms.
[0063] Table 5 summarizes the results of 3D object detection and BEV evaluation on KITTI Val 2 Split at two IoU.sub.3D thresholds of 0.7 and 0.5. Again, M3D-RPN and Kinematic (Image) are used as baselines. The released model of M3D-RPN is evaluated using the KITTI metric. The results in Table 5 show that GrooMeD-NMS performs best in all cases. This is again impressive because the improvements are shown on Moderate and Hard set, consistent with Table 2 and 3.
[0064] Table 6 compares the modifications of this approach on KITTI Val 1 Cars. Using a confidence head (Conf+No NMS) proves beneficial compared to the warmup model (No Conf+No NMS). Moreover, GrooMeD-NMS on classification scores (denoted by No Conf+NMS) is detrimental as the classification scores are not suited for localization. Training the warmup model and then fine tuning also works better than training without warmup since the warmup phase allows the GrooMeD-NMS to carry meaningful grouping of the boxes. In addition to Linear, two other functions for p were compared: Exponential and Sigmoidal. Exponential and Sigmoidal do not perform as well as the Linear p possibly because they have vanishing gradients close to overlap of zero or one. Grouping and masking both help the model to reach a better minimum. Imagewise AP loss is better than the Vanilla AP loss since it treats boxes of two images differently. Imagewise AP also performs better than the binary cross-entropy (BCE) loss. Class confidence does not work better since it does not have the localization information while the self-balancing confidence gives the localization information without consideration of whether the box belongs to foreground or background.
[0065] The techniques described herein may be implemented by one or more computer programs executed by one or more processors. The computer programs include processor-executable instructions that are stored on a non-transitory tangible computer readable medium. The computer programs may also include stored data. Non-limiting examples of the non-transitory tangible computer readable medium are nonvolatile memory, magnetic storage, and optical storage.
[0066] Some portions of the above description present the techniques described herein in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. These operations, while described functionally or logically, are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times to refer to these arrangements of operations as modules or by functional names, without loss of generality.
[0067] Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
[0068] Certain aspects of the described techniques include process steps and instructions described herein in the form of an algorithm. It should be noted that the described process steps and instructions could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
[0069] The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a computer selectively activated or reconfigured by a computer program stored on a computer readable medium that can be accessed by the computer. Such a computer program may be stored in a tangible computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
[0070] The algorithms and operations presented herein are not inherently related to any particular computer or other apparatus. Various systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatuses to perform the required method steps. The required structure for a variety of these systems will be apparent to those of skill in the art, along with equivalent variations. In addition, the present disclosure is not described with reference to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present disclosure as described herein.
[0071] The foregoing description of the embodiments has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure. Individual elements or features of a particular embodiment are generally not limited to that particular embodiment, but, where applicable, are interchangeable and can be used in a selected embodiment, even if not specifically shown or described. The same may also be varied in many ways. Such variations are not to be regarded as a departure from the disclosure, and all such modifications are intended to be included within the scope of the disclosure.
APPENDIX
[0072]
TABLE-US-00004 TABLE 1 Overview of NMS algorithms. [Key: Train = End-to-end Trainable, L = Number of layers, Prune = Pruning function, Par = Parallelizable] Algorithm Train Rescore L Prune Par Classical X X — Hard O(|G|) Soft-NMS X X — Soft O(|G|) QUBO-NMS X Optimization — X — Point-NMS X Point Process — X — MAP-NMS X MAP — X — Structured-NMS ✓ SSVM — X O(|1|) NN-NMS ✓ Neural Net >1 X O(|1|) GrooMeD-NMS ✓ Matrix 1 Soft O(|G|)
TABLE-US-00005 TABLE 2 AP.sub.3D|R.sub.
TABLE-US-00006 TABLE 3 AP.sub.3D|R.sub.
TABLE-US-00007 TABLE 4 AP.sub.3D|R.sub.
TABLE-US-00008 TABLE 5 AP.sub.3D|R.sub.
TABLE-US-00009 TABLE 6 Ablation studies of our method on KITTI Val 1 Cars. Change from IoU.sub.3D ≥ 0.7 IoU.sub.3D ≥ 0.5 GrooMeD-NMS model AP.sub.3D|R.sub.