System and method for tracking occluded objects
11625905 · 2023-04-11
Assignee
Inventors
- Pavel V. Tokmakov (San Francisco, CA, US)
- Rares A. Ambrus (San Francisco, CA, US)
- Wolfram Burgard (Mountain View, CA, US)
- Adrien David Gaidon (Mountain View, CA, US)
Cpc classification
G06V10/255
PHYSICS
G06V10/469
PHYSICS
International classification
G06V10/46
PHYSICS
Abstract
A method for tracking an object performed by an object tracking system includes encoding locations of visible objects in an environment captured in a current frame of a sequence of frames. The method also includes generating a representation of a current state of the environment based on an aggregation of the encoded locations and an encoded location of each object visible in one or more frames of the sequence of frames occurring prior to the current frame. The method further includes predicting a location of an object occluded in the current frame based on a comparison of object centers decoded from the representation of the current state to object centers saved from each prior representation associated with a different respective frame of the sequence of frames occurring prior to the current frame. The method still further includes adjusting a behavior of an autonomous agent in response to identifying the location of the occluded object.
Claims
1. A method for tracking occluded objects performed by an object tracking system, comprising: encoding locations of visible objects in an environment captured in a current frame of a sequence of frames; generating a representation of a current state of the environment based on an aggregation of the encoded locations and an encoded location of each object visible in one or more frames of the sequence of frames occurring prior to the current frame; predicting a location of an object occluded in the current frame based on a comparison of object centers decoded from the representation of the current state to object centers saved from each prior representation associated with a different respective frame of the sequence of frames occurring prior to the current frame; and adjusting a behavior of an autonomous agent in response to predicting the location of the object occluded in the current frame.
2. The method of claim 1, further comprising decoding, from the generated representation of the current state, a location in the environment of each object center for each visible object in the current frame, a bounding box size for each visible object in the current frame, and a displacement vector for each visible object in the current frame.
3. The method of claim 2, further comprising: dividing the current frame into a plurality of locations; and assigning a value to each location of the plurality of locations based on whether the location comprises an object center, a value of a location comprising the object center being different than a value of a location without the object center.
4. The method of claim 2, further comprising storing, for each prior representation, a location in the environment of each object center, a displacement vector, and a bounding box size corresponding to each different respective visible object in a frame associated with a respective prior representation.
5. The method of claim 4, wherein predicting the location of the object occluded in the current frame comprises: comparing the location in the environment of each object center for each visible object in the current frame with the location in the environment of each object center for each prior representation; identifying an object center of a prior representation that is not visible in the current frame based on comparing the location in the environment of each object center for each visible object in the current frame with the location in the environment of each object center for each prior representation; determining an object corresponding to the identified object center is occluded in the current frame; and predicting the location of the object occluded in the current frame based on a location in the environment of the identified object center and a velocity predicted based on a displacement vector of the object corresponding to the identified object center.
6. The method of claim 1, further comprising capturing the sequence of frames via one or more sensors of the autonomous agent, wherein the sequence of frames comprises a plurality of consecutive frames.
7. The method of claim 1, further comprising training the object tracking system with a combination of synthetic data and real data.
8. An apparatus for tracking an object at an autonomous agent via an object tracking system, comprising: a processor; a memory coupled with the processor; and instructions stored in the memory and operable, when executed by the processor, to cause the apparatus to: encode locations of visible objects in an environment captured in a current frame of a sequence of frames; generate a representation of a current state of the environment based on an aggregation of the encoded locations and an encoded location of each object visible in one or more frames of the sequence of frames occurring prior to the current frame; predict a location of an object occluded in the current frame based on a comparison of object centers decoded from the representation of the current state to object centers saved from each prior representation associated with a different respective frame of the sequence of frames occurring prior to the current frame; and adjust a behavior of the autonomous agent in response to predicting the location of the object occluded in the current frame.
9. The apparatus of claim 8, wherein execution of the instructions further causes the apparatus to decode, from the generated representation of the current state, a location in the environment of each object center for each visible object in the current frame, a bounding box size for each visible object in the current frame, and a displacement vector for each visible object in the current frame.
10. The apparatus of claim 9, wherein execution of the instructions further causes the apparatus to: divide the current frame into a plurality of locations; and assign a value to each location of the plurality of locations based on whether the location comprises an object center, a value of a location comprising the object center being different than a value of a location without the object center.
11. The apparatus of claim 9, wherein execution of the instructions further causes the apparatus to store, for each prior representation, a location in the environment of each object center, a displacement vector, and a bounding box size corresponding to each different respective visible object in a frame associated with a respective prior representation.
12. The apparatus of claim 11, wherein execution of the instructions to predict the location of the object occluded in the current frame further causes the apparatus to: compare the location in the environment of each object center for each visible object in the current frame with the location in the environment of each object center for each prior representation; identify an object center of a prior representation that is not visible in the current frame based on comparing the location in the environment of each object center for each visible object in the current frame with the location in the environment of each object center for each prior representation; determine an object corresponding to the identified object center is occluded in the current frame; and predict the location of the object occluded in the current frame based on a location in the environment of the identified object center and a velocity predicted based on a displacement vector of the object corresponding to the identified object center.
13. The apparatus of claim 8, wherein execution of the instructions further causes the apparatus to capture the sequence of frames via one or more sensors of the autonomous agent, wherein the sequence of frames comprises a plurality of consecutive frames.
14. The apparatus of claim 8, wherein execution of the instructions further causes the apparatus to train the object tracking system with a combination of synthetic data and real data.
15. A non-transitory computer-readable medium having program code recorded thereon for tracking an object, the program code executed by a processor and comprising: program code to encode locations of visible objects in an environment captured in a current frame of a sequence of frames; program code to generate a representation of a current state of the environment based on an aggregation of the encoded locations and an encoded location of each object visible in one or more frames of the sequence of frames occurring prior to the current frame; program code to predict a location of an object occluded in the current frame based on a comparison of object centers decoded from the representation of the current state to object centers saved from each prior representation associated with a different respective frame of the sequence of frames occurring prior to the current frame; and program code to adjust a behavior of an autonomous agent in response to predicting the location of the object occluded in the current frame.
16. The non-transitory computer-readable medium of claim 15, wherein the program code further comprises program code to decode, from the generated representation of the current state, a location in the environment of each object center for each visible object in the current frame, a bounding box size for each visible object in the current frame, and a displacement vector for each visible object in the current frame.
17. The non-transitory computer-readable medium of claim 16, wherein the program code further comprises: program code to divide the current frame into a plurality of locations; and program code to assign a value to each location of the plurality of locations based on whether the location comprises an object center, a value of a location comprising the object center being different than a value of a location without the object center.
18. The non-transitory computer-readable medium of claim 16, wherein the program code further comprises program code to store for each prior representation a location in the environment of each object center, a displacement vector, and a bounding box size corresponding to each different respective visible object in a frame associated with a respective prior representation.
19. The non-transitory computer-readable medium of claim 18, wherein the program code to predict the location of the object occluded in the current frame further comprises: program code to compare the location in the environment of each object center for each visible object in the current frame with the location in the environment of each object center for each prior representation; program code to identify an object center of a prior representation that is not visible in the current frame based on comparing the location in the environment of each object center for each visible object in the current frame with the location in the environment of each object center for each prior representation; program code to determine an object corresponding to the identified object center is occluded in the current frame; and program code to predict the location of the object occluded in the current frame based on a location in the environment of the identified object center and a velocity predicted based on a displacement vector of the object corresponding to the identified object center.
20. The non-transitory computer-readable medium of claim 15, wherein the program code further comprises program code to capture the sequence of frames via one or more sensors of the autonomous agent, wherein the sequence of frames comprises a plurality of consecutive frames.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The features, nature, and advantages of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference characters identify correspondingly throughout.
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION
(7) The detailed description set forth below, in connection with the appended drawings, is intended as a description of various configurations and is not intended to represent the only configurations in which the concepts described herein may be practiced. The detailed description includes specific details for the purpose of providing a thorough understanding of the various concepts. It will be apparent to those skilled in the art, however, that these concepts may be practiced without these specific details. In some instances, well-known structures and components are shown in block diagram form in order to avoid obscuring such concepts.
(8) An agent, such as an autonomous agent, may detect and track multiple objects in an environment. Object detection and tracking may be used to perform various tasks, such as scene understanding, motion planning, and/or obstacle avoidance. That is, the agent may autonomously navigate through an environment based on the tracked objects.
(9) Conventional systems may individually localize each detected object, and the detected objects may be combined into tracks based on spatio-temporal overlap and appearance similarity. In such systems, the tracking may be fragmented if a tracked object is occluded for one or more frames. It may be desirable to improve tracking to account for object occlusions.
(10) That is, an ability to predict object locations behind occlusions may reduce collisions and improve vehicle navigation, such as autonomous or semi-autonomous navigation. As an example, a person may run behind a parked car and may no longer be visible to a driver of a vehicle. Still, in such an example, the driver is still aware of the potential danger and slows down when passing by the parked car. Conventional autonomous vehicles lack this type of ability. Aspects of the present disclosure improve object tracking by training an object tracking model on real and synthetic data to track objects that are occluded in one or more frames of a sequence of frames. In the current disclosure, an autonomous vehicle may refer to an autonomous vehicle and/or a semi-autonomous vehicle.
(11)
(12) In one configuration, the first sensor 108 captures a 2D image that includes objects in the first sensor's 108 field of view 114. The second sensor 106 may generate one or more output streams. The 2D image captured by the first sensor 108 includes a 2D image of the first vehicle 104, as the first vehicle 104 is in the first sensor's 108 field of view 114.
(13) The information obtained from the sensors 106, 108 may be used to navigate the ego vehicle 100 along a route when the ego vehicle 100 is in an autonomous mode. The sensors 106, 108 may be powered from electricity provided from the vehicle's 100 battery (not shown). The battery may also power the vehicle's motor. The information obtained from the sensors 106, 108 may be used for keypoint matching.
(14)
(15) In the example of
(16) In some aspects, a trained object tracking model may estimate a location of an occluded object at each moment in time. In the example of
(17) Conventional multi-object tracking systems operate in a tracking-by-detection paradigm. That is, the conventional multi-object tracking systems use an existing object detector to localize objects of interest in each frame of a sequence of frames, and then link the localized objects into tracks, in an online or offline manner. For ease of explanation, in the current disclosure, a multi-object tracking system or model may be referred to as an object tracking system or model. In some cases, conventional object tracking systems link a detected object to an existing trajectory based on bounding box overlap, learned appearance embedding, human pose, or graph-convolutional based trajectory representations. The conventional object tracking systems may be limited due to their frame-based nature. Such conventional object tracking systems resort to heuristic-based algorithms to handle occlusions.
(18) Additionally, some conventional object tracking systems combine detection and tracking in a single model. These conventional object tracking systems receive pairs of frames as an input, and output object detections together with pairwise associations. These conventional object tracking systems may improve tracking robustness. Still, such conventional object tracking systems may only handle primitive forms of occlusions, such as occlusions that last one frame.
(19) Aspects of the present disclosure are directed to an online setting, where an object tracking model associates an object detected in a current frame with one of the previously established trajectories for the detected object. In one configuration, an end-to-end trainable object tracking model is specified to localize objects behind occlusions. In some aspects, the object tracking model utilizes a center-tracking model to a sequence of frames as an input, and predicts object centers together with their displacement vectors. The displacement vectors may be used to link object detections into tracks.
(20) Aspects of the present disclosure may operate on sequences of frames (e.g., videos) having an arbitrary length. In one configuration, each frame may be processed by a center-tracking model configured to extract features from the frame. The resulting features may be provided to a convolutional gated recurrent unit (ConvGRU) to aggregate a spatio-temporal representation of the scene. The ConvGRU is an example of an extension of a conventional gated recurrent unit (GRU). In such an example, the fully connected layer of the GRU is replaced by a convolutional layer, such that the ConvGRU has the time sequence modeling capability of the GRU. Additionally, similar to a convolutional neural network (CNN), the ConvGRU may describe local features.
(21) The ConvGRU may generate a current state for a current frame t, which may be stored in a memory module. In some implementations, object centers and corresponding displacement vectors may be determined based on the current state of the frame t. In one configuration, the object tracking model may use a full context of a video from an initial frame (e.g., frame 1) to a current frame t {1, . . . , t}, in contrast to conventional object tracking systems that are limited to a previous frame t−1 and a current frame t {t−1, t}. As such, the object tracking model of the current disclosure may be more robust in comparison to conventional object tracking systems. Additionally, the object tracking model of the current disclosure may learn to localize and associate objects that are not visible in the current frame.
(22) As described, aspects of the present disclosure implement a center-tracking model. In one configuration, the center-tracking model generates a representation of each object (e.g., each object of interest) by a single point at a center of a bounding box of the object. This center point may be tracked through time. That is, the center-tracking model may localize object centers.
(23) A conventional center-tracking model detects object centers based on two consecutive frames {I.sup.t−1, I.sup.t}, as well as a heatmap of prior tracked objects H.sup.t−1, represented as center points p ∈.sup.2, where
.sup.2 is a 2-dimensional space of real numbers. In such conventional modes, the three input tensors (e.g., the two consecutive frames {I.sup.t−1, I.sup.t} and the heatmap of prior tracked objects H.sup.t−1) may be concatenated and passed through a backbone network f to produce a feature map F.sup.t=f(H.sup.t−1, I.sup.t−1, I.sup.t). The feature map F.sup.t may be used to localize object centers in a current frame {{circumflex over (p)}.sub.0.sup.t, {circumflex over (p)}.sub.1.sup.t, . . . }, regress object bounding box sizes {ŝ.sub.0.sup.t, ŝ.sub.1.sup.t . . . }, and predict object displacement vectors with respect to a location of the object in a previous frame {{circumflex over (d)}.sub.0.sup.t, {circumflex over (d)}.sub.1.sup.t . . . }. At test time (e.g., real-world deployment), displacement vectors may be used to project each center to the previous frame via {circumflex over (p)}.sub.i.sup.t−{circumflex over (d)}.sub.i.sup.t. The projected center may be greedily matched to a closest available center {circumflex over (p)}.sub.*.sup.t−1, thus recovering the track of the object. The detector of the conventional center-tracking model is trained to output an offset vector from an object center of a current frame t to its center in the previous frame t−1. That is, for each object of interest, the object may be associated with a track of a previous object based on greedy matching a distance between the predicted offset and the detected center point in the previous frame.
(24) The outputs of the center-tracking model (e.g., centers p, bounding box dimensions s, and displacement vectors d) may be predicted and supervised on a pixel level. That is, the feature map F.sup.t may be passed through separate sub-networks f.sub.p, f.sub.s, f.sub.d to produce corresponding outputs P.sub.t ∈ [0,1].sup.H×W×C, S.sub.t ∈.sup.H×W×C, D.sub.t ∈
.sup.H×W×2, where C represents a number of classes that may be detected by the object detector. The outputs generated by the sub-networks may be considered localizations, where each localization represents an object of a class that is centered in the localization (P.sub.t ∈ [0,1].sup.H×W×C), a size of the object's bounding box (S.sub.t ∈
.sup.H×W×C), and a displacement of the center with respect to the previous frame (D.sub.t ∈
.sup.H×W×2). The actual centers p may be recovered by extracting local peaks in each neighborhood, such as a 3×3 neighborhood, with a value in that location serving as confidence in a detection.
(25) As described, conventional center-tracking models consider a pair of consecutive frames. Limiting the consideration to the pair of consecutive frames may limit to model to tracking objects that are visible in every frame of the video. Incorporating previous frame detections in the input may assist in tracking partial occlusions or full occlusions which are one frame long. Still, the conventional center-tracking models cannot track more complex scenarios, such as an object that is invisible in both frames t and t−1. Therefore, aspects of the present disclosure are directed to extending the center-track model to a global, video-level model.
(26)
(27) Aspects of the present disclosure process sequences of feature maps and aggregate a representation of the scene, which encodes the locations of all the previously seen objects, even if they become fully occluded. In one configuration, the object tracking model 300 includes a convolutional gated recurrent unit (ConvGRU) 308, which may be a type of a recurrent memory network. The ConvGRU 308 may be an extension of a gated recurrent unit (GRU). That is, the ConvGRU 308 may replace a 1D state vector of the GRU with a 2D state feature map M In some examples, the 2D feature map represents spatial information (e.g., height and width). In contrast, the 1D state vector condenses all the spacial information into a single vector. For example, the 1D state vector may be an average of the values over all over all spatial locations). Additionally, the ConvGRU 308 may replace fully connected layers of the GRU, used to compute state updates, with 2D convolutions. As a result, the ConvGRU 308 may capture temporal and spatio-temporal patterns in the inputs. That is, the ConvGRU 308 aggregates information over the sequence of frames.
(28) In the example of
Z.sup.t=σ(F.sup.t+W.sup.FZ+M.sup.t−1*W.sup.MZ+B.sup.Z) (1)
R.sup.t=σ(F.sup.t+W.sup.FR+M.sup.t−1*W.sup.MR+B.sup.R), (2)
{tilde over (M)}.sup.t=tan h)F.sup.t+W.sup.F{tilde over (M)}+R.sup.t⊙M.sup.t−1*W.sup.M{tilde over (M)}+B.sup.{tilde over (M)}), (3)
M.sup.t=(1−Z.sup.t)⊙M.sup.t−1+Z.sup.t⊙{tilde over (M)}.sup.t, (4)
where ⊙ denotes element-wise multiplication, * represents a convolutional operation, σ is a sigmoid function, W is a learned transformation, and B is a bias term. The updated state 310 M.sup.t may be a weighted combination of the feature map F.sup.t and the previous state M.sup.t−1. The updated gate Z.sup.t may determine an amount of memory that is incorporated into the updated state 310 M.sup.t. In EQUATION 3, {tilde over (M)}.sup.t represents a candidate memory. The candidate memory {tilde over (M)}.sup.t may be ignored if the updated gate Z.sup.t is zero or a near-zero value. The reset gate R.sup.t controls an influence of a previous state M.sup.t−1 on the candidate memory {tilde over (M)}.sup.t. In summary, the GRU( ) function may be trained to combine appearance features of the current frame with the memorized video representation to refine motion predictions, or even fully restore them from the previous observations in case a moving object becomes stationary.
(29) In such an example, the updated state 310 is determined by a GRU function based on a previous state M.sup.t−1 and the feature map F.sup.t. For an initial frame, the previous state M.sup.t−1 may be initialized to a particular value, such as zero. The updated state 310 M.sup.t may be an example of an output feature map. In the example of
(30) In some implementations, for each location in a frame, the object location heatmap 312 provides a score indicating whether an object center is present in the location at the current frame. In the example of
(31) In some implementations, the object location heatmap 312 P.sub.t (e.g., location of object centers for each visible object in the current frame), bounding box dimensions 314 S.sub.t, and displacement vectors 316 D.sub.t may be stored in memory, such as a memory module associated with the object tracking system and/or a memory module of an agent implementing the object tracking system. The agent may be an autonomous or semi-autonomous agent, such as an autonomous vehicle or a semi-autonomous vehicle. Additionally, object location heatmaps P.sub.t (e.g., locations of object centers for each visible object in the current frame), bounding box dimensions S.sub.t, and displacement vectors D.sub.t may be stored in memory for each prior state (e.g., M.sup.t−1 to M.sup.1) corresponding to each respective previous frame (I.sup.t−1 to I.sup.1).
(32) As described, conventional center-tracking models establish correspondences between objects in a pair of frames {I.sup.t−1, I.sup.t} based on raw pixel values. Aspects of the present disclosure improve object tracking by establishing correspondences between objects over a sequence of frames based on feature representations. That is, the predictions for a current frame t, such as object location heatmaps 312, bounding box dimensions 314, and displacement vectors 316, may be based on a sequence of previous frames (e.g., frames (I.sup.t−1 to I.sup.1)) in contrast to only a single previous frame (I.sup.t−1). Therefore, the object tracking model 300 may predict the presence of a occluded object at a location based on stored information regarding one or more of the object's previous locations, velocity, or trajectory. That is, object information, such as the object's location, velocity, and/or trajectory, may be aggregated over the previous frames to predict an object's location at a current frame regardless of whether the object is visible in the current frame.
(33) In some implementations, a location of an object occluded in the current frame may be predicted based on a comparison of object centers P.sub.t decoded from the representation of the current state M.sup.t to object centers saved for each prior representation corresponding to each different respective prior frame (I.sup.t−1 to I.sup.1). In such implementations, the location of each object center P.sub.t for each visible object in the current frame may be compared with the stored location of each object center for each respective prior representation. The location of an object center P.sub.t may be matched to the closest object center P.sub.t−1 to recover a track (e.g., path) for a corresponding object. Additionally, an object center of a prior representation that is not visible in the current frame may be identified based on the comparison of the location of each object center P.sub.t for each visible object in the current frame with the stored object center locations.
(34) The object tracking model may then determine that an object corresponding to the identified object center is occluded in the current frame. Furthermore, the object tracking model (e.g., object tracking system) may predict the location of the object occluded in the current frame based on a stored location of the identified object center and a velocity predicted based on a stored displacement vector of the object corresponding to the identified object center. As described, the displacement vector identifies a displacement of the object from current frame to a prior frame. Thus, the object tracking system may predict an object's velocity based on a time between frames and a length of the displacement. That is, the model identifies a location of the occluded object by using an object's previously observed velocity, the object's last observed location, and a speed of the ego vehicle. In some examples, if a person walks behind a parked car the model can predict the person's location by propagating it with the last observed velocity of the person and accounting for the change of the relative position of the occluded with respect to the ego-car. In some other examples, after training, the model may predict the location based on training. * An accuracy of the predicted velocity may increase as a number of frames in which the object is visible increases.
(35) In some implementations, a supervised learning method may be used to train the object-tracking model. Training and evaluation on sequences that are longer than two frames may further improve the object-tracking model due to the increased robustness of a video representation M, aggregated over multiple frames.
(36) Conventional object tracking datasets do not provide labels for fully occluded objects, due to the complexity of collecting such annotations. That is, it is very difficult, if not practically impossible, to accurately labels invisible objects (e.g., occluded objects) in existing videos. In some implementations, a new dataset may be generated to train the object tracking model of the current disclosure. The new dataset may be collected in a controlled environment, where objects of interest may be equipped with tracking devices to registered their positions. Still, tracking behind occlusions may be prone to overfitting. Thus, it may be desirable to train an object tracking model on a large dataset, such as a dataset with at least hundreds of videos. Generating a large dataset in the controlled environment with objects of interest equipped with tracking devices may be cost-prohibitive. In one configuration, the new dataset is generated with synthetic data. The synthetic data (e.g., synthetic videos) may provide annotations for all the objects, irrespective of their visibility, at no additional cost.
(37) Despite the progress in computer graphics realism, a model trained on synthetic videos may fail to achieve a desired level of accuracy for tracking and detecting objects. In one configuration, the object tracking model is jointly trained on synthetic data and real data. The real data may be provided for visible objects, and the synthetic data may be provided for occluded objects. Samples of real data used for training may be less than samples of synthetic data used for training. For example, during training, the object tracking model may be trained on the real data of length R and synthetic data of length N, where N is greater than R. Join training on synthetic and real data may allow the object tracking model to learn complex behavior, such as tracking behind occlusions, from synthetic data, while minimizing a domain gap due to the inclusion of real data.
(38) In one configuration, to generate the training labels for a video sequence, the supervised learning method may receive a sequence of object annotations {O.sup.1, O.sup.2, . . . , O.sup.n}, with O.sup.t={o.sub.1.sup.t, o.sub.2.sup.t, . . . , o.sub.m.sup.t}, as an input. Each object o.sub.i.sup.t may be described by its center p ∈ .sup.2, bounding box size s ∈
.sup.2, identity id ∈ I and visibility level vis ∈ [0,1], such that the object o.sub.i.sup.t=(p, s, id, vis). In some implementations, the identity id may be used together with a center p to supervise displacement vectors d.
(39) The visibility levels vis may constrain the object-tracking model to detect and track visible objects. That is, without the visibility level vis, the object-tracking model may be forced to detect and track objects before they become visible and/or produce tracks for objects that are fully occluded for a whole duration of a video. In some implementations, the object annotations {O.sup.1, O.sup.2, . . . , O.sup.n} may be pre-processed to supervise occluded an occluded object after the object has been visible for at least two frames. In some such implementations, a visual threshold T.sub.vis and an occlusion threshold T.sub.occl may be specified to enforce visibility constraints. In such implementations, beginning with a first frame in a sequence O.sup.1, for every object o.sub.i.sup.1 in the frame, the object o.sub.i.sup.1 is treated as a negative if a visibility of the object vis.sub.i.sup.1 is less than the visibility threshold T.sub.vis (e.g., vis.sub.i.sup.1<T.sub.vis). Additionally, the object may be ignored if the visibility of the object vis.sub.i.sup.1 is greater than the visibility threshold T.sub.vis and less than the occlusion threshold T.sub.occl (e.g., T.sub.vis<vis.sub.i.sup.1<T.sub.occl). Furthermore, the object o.sub.i.sup.1 may be marked as visible and used to produce a label if the visibility of the object vis.sub.i.sup.1 is greater than the occlusion threshold T.sub.occl (e.g., vis.sub.i.sup.1>T.sub.occl). The same procedure may be repeated for every frame in a sequence. Beginning with a third frame of the sequence of frames, objects that were previously marked as visible for two consecutive frames are treated as positives regardless of their visibility status in the current frame. The procedure for treating an object as a negative, a positive, or ignoring the object o.sub.i.sup.t based on the visibility of the object vis.sub.i.sup.t in comparison to the visual threshold T.sub.vis and the occlusion threshold T.sub.occl may provide a soft transition between visible and invisible objects, instead of forcing the model to make a hard choice. That is, the model does not need to make a hard choice whether to treat a partially occluded object as visible or as invisible. Instead the model may ignore such borderline cases during training. In some example, the visibility threshold T.sub.vis may be 0.05 and the occlusion threshold T.sub.occl may be 0.15, corresponding to 5% and 15% of the object being visible, respectively. aT.sub.vis and T.sub.occl may define which level of occlusion to treat as fully invisible, which as ignored, and which as visible
(40) The training procedure described above, however, assumes an availability of a video dataset with objects labeled regardless of whether they are visible. Such a dataset with such labels may be difficult to obtain due to costs and the complexity of obtaining precise bounding box labels for invisible objects. Therefore, in some implementations, synthetic datasets may be used to train the object tracking model.
(41) The synthetic dataset may include a set of video clips (e.g., sequences of frames), where each video clip in the set of video clips has a same length, such as ten seconds. Alternatively, the length of the video clips may vary. Each video clip may represent a driving scenario, such as a crowded street with one or more occluded objects, such as a person and/or a vehicle. Due to the synthetic nature of each video clip, one or more objects in each video clip may be annotated with a bounding box, irrespective of the object's visibility. Additionally, or alternatively, accurate visibility estimates may be provided for one or more objects in each video clip.
(42) Each video clip may provide one or more sequences, with each sequence captured by a sensor integrated with an ego-vehicle. As an example, the sensor may be a camera, a RADAR sensor, or a LiDAR sensor. In some implementations, the supervised training method uses sequences corresponding to front and side sensors to increase data diversity and complexity, and also minimize a domain gap with real datasets. During training, frame sequences of length N may be samples from each video clip. The annotations of the sampled sequences may be pre-processed, as described above. The object-tracking model may learn to track behind occlusions based on the training. For additional data augmentation, the training method may use consecutive frames or randomly sampled frames. As an example, the frames may be randomly sampled based on a random temporal stride Mf and/or reversed at random.
(43) Training on synthetic datasets may create a domain discrepancy with real datasets. In some conventional systems, the domain discrepancy is addressed by fine-tuning the resulting model on a small real dataset. Still, in the current examples, the real datasets do not have labels for occluded objects. Therefore, such fine-tuning would result in un-learning the ability to track behind occlusions. Thus, in some implementations, the object-tracking model may be trained jointly on synthetic data and real data, where, at each iteration, a batch is sampled from one of the datasets at random. Additionally, to maintain consistency for occluded object supervision, a batch of real data may be less than or equal to two frames. As a result, the supervised training method may sample synthetic video clips of length N and real video clips of length two (e.g., a pair of real frames). In such implementations, the synthetic data may be used to learn a desired behavior and real data (e.g., pairs of real frames) may be used to reduce a domain gap. Additionally, or alternatively, a pair of real frames may be simulated by randomly shifting an image from an image-based, object detection dataset.
(44) In some examples, an object-tracking model may ignore occluded objects, since full occlusions may constitute a small fraction of the dataset. In some such examples, to avoid ignoring objects, a weight of a localization loss for fully occluded instances may be increased. In addition, a box size loss for fully occluded instances may be ignored, because predicting a size of an invisible object may be ambiguous, and may not be needed during tracking.
(45)
(46) The vehicle 428 may operate in one or more of an autonomous operating mode, a semi-autonomous operating mode, and a manual operating mode. Furthermore, the vehicle 428 may be an electric vehicle, a hybrid vehicle, a fuel vehicle, or another type of vehicle. The autonomous operating mode may autonomously control the vehicle without human interaction or intervention. The semi-autonomous mode may control the vehicle 428 with human interaction. Additionally, or alternatively, in the semi-autonomous mode, a human may control the vehicle 428 and one or more components, such as one or more of the object tracking module 408, processor 420, a communication module 422, a location module 418, a sensor module 402, a locomotion module 426, a navigation module 424, memory 452, and a computer-readable medium 414 may override the human control. For example, the human control may be overridden to prevent a collision.
(47) The object tracking system 400 may be implemented with a bus architecture, represented generally by a bus 440. The bus 440 may include any number of interconnecting buses and bridges depending on the specific application of the object tracking system 400 and the overall design constraints. The bus 440 links together various circuits including one or more processors and/or hardware modules, represented by a processor 420, a communication module 422, a memory 452, a location module 418, a sensor module 402, a locomotion module 426, a navigation module 424, memory 452, and a computer-readable medium 414. The bus 440 may also link various other circuits such as timing sources, peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further. The memory 452 may include volatile and/or non-volatile memory. For example, the memory 452 may be read only memory (ROM), programmable ROM (PROM), electronic programmable ROM (EPROM), electronic erasable PROM (EEPROM), flash memory, random access memory (RAM), or other types of volatile or non-volatile memory. Additionally, the RAM may be, for example, synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), sync link DRAM, (SLDRAM), direct RAM bus RAM (DRRAM), or other types of RAM.
(48) The object tracking system 400 includes a transceiver 416 coupled to the processor 420, the sensor module 402, a occluded object tracking module 408, the communication module 422, the location module 418, the locomotion module 426, the navigation module 424, and the computer-readable medium 414. The transceiver 416 is coupled to an antenna 444.
(49) The object tracking system 400 includes the processor 420 coupled to the computer-readable medium 414. The processor 420 performs processing, including the execution of software stored on the computer-readable medium 414 providing functionality according to the disclosure. The software, when executed by the processor 420, causes the object tracking system 400 to perform the various functions described for a particular device, such as the vehicle 428, or any of the modules 402, 408, 414, 416, 418, 420, 422, 424, 426. The computer-readable medium 414 may also be used for storing data that is manipulated by the processor 420 when executing the software.
(50) The sensor module 402 may be used to obtain measurements via different sensors, such as a first sensor 406 and a second sensor 404. The first sensor 406 may be a vision sensor, such as a stereoscopic camera or a red-green-blue (RGB) camera, for capturing 2D images. The second sensor 404 may be a ranging sensor, such as a light detection and ranging (LIDAR) sensor or a radio detection and ranging (RADAR) sensor. Of course, aspects of the present disclosure are not limited to the aforementioned sensors as other types of sensors, such as, for example, thermal, sonar, and/or lasers are also contemplated for either of the sensors 404, 406.
(51) The measurements of the first sensor 406 and the second sensor 404 may be processed by one or more of the processor 420, the sensor module 402, the occluded object tracking module 408, the communication module 422, the location module 418, the locomotion module 426, the navigation module 424, in conjunction with the computer-readable medium 414 to implement the functionality described herein. In one configuration, the data captured by the first sensor 406 and the second sensor 404 may be transmitted to an external device via the transceiver 416. The first sensor 406 and the second sensor 404 may be coupled to the vehicle 428 or may be in communication with the vehicle 428.
(52) The location module 418 may be used to determine a location of the vehicle 428. For example, the location module 418 may use a global positioning system (GPS) to determine the location of the vehicle 428. The communication module 422 may be used to facilitate communications via the transceiver 416. For example, the communication module 422 may be configured to provide communication capabilities via different wireless protocols, such as WiFi, long term evolution (LTE), 4G, etc. The communication module 422 may also be used to communicate with other components of the vehicle 428 that are not modules of the object tracking system 400.
(53) The locomotion module 426 may be used to facilitate locomotion of the vehicle 428. As an example, the locomotion module 426 may control a movement of the wheels. As another example, the locomotion module 426 may be in communication with one or more power sources of the vehicle 428, such as a motor and/or batteries. Of course, aspects of the present disclosure are not limited to providing locomotion via wheels and are contemplated for other types of components for providing locomotion, such as propellers, treads, fins, and/or jet engines.
(54) The object tracking system 400 also includes the navigation module 424 for planning a route or controlling the locomotion of the vehicle 428, via the locomotion module 426. The navigation module 424 may override user input when the user input is expected (e.g., predicted) to cause a collision. The modules may be software modules running in the processor 420, resident/stored in the computer-readable medium 414, one or more hardware modules coupled to the processor 420, or some combination thereof.
(55) The occluded object tracking module 408 may perform one or more elements of the process 500 described with respect to
(56) In some implementations, working in conjunction with one or more of the processor 420, sensor module 402, and computer-readable medium 414, the object tracking module 408 decodes, from the generated representation of the current frame, a location in the environment of each object center for each visible object in the current frame, a bounding box size for each visible object in the current frame, and a displacement vector for each visible object in the current frame. For example, the object tracking module 408 may use one or more neural networks, such as sub-networks 330a, 330b, 330c as described with respect to
(57) In some implementations, working in conjunction with one or more of the processor 420, sensor module 402, and computer-readable medium 414, the object tracking module 408 predicts a location of an object occluded in the current frame based on a comparison of object centers decoded from the representation of the current state to object centers saved from each prior representation associated with a different respective frame of the sequence of frames occurring prior to the current frame. In some implementations, the process 500 may predict the location by comparing the location of each object center for each visible object in the current frame with the stored location of each object center for each respective prior representation. In such implementations, working in conjunction with one or more of the processor 420, sensor module 402, and computer-readable medium 414, the object tracking module 408 may also identify an object center of a prior representation that is not visible in the current frame based on comparing the location of each object center for each visible object in the current frame with the stored location. The object tracking module 408 may further determine an object corresponding to the identified object center is occluded in the current frame. Also, in such implementations, the object tracking module 408 may predict the location of the object occluded in the current frame based on a stored location of the identified object center and a velocity predicted based on a stored displacement vector of the object corresponding to the identified object center.
(58) In some implementations, working in conjunction with one or more of the processor 420, sensor module 402, locomotion module 426, planning module 424, communication module 422, and computer-readable medium 414, the object tracking module 408 adjusts a behavior of the vehicle 428 in response to identifying the location of the occluded object.
(59)
(60) In some implementations, the process 500 begins in block 502 with encoding locations of visible objects in an environment captured in a current frame of a sequence of frames. At block 504, the process 500 generates a representation of a current state of the environment based on an aggregation of the encoded locations and an encoded location of each object visible in one or more frames of the sequence of frames occurring prior to the current frame. The sequence of frames may be captured via one or more sensors of the autonomous agent, such as the sensors 106,108 or 404,406 described with respect to
(61) In some implementations, the process 500 decodes, from the generated representation of the current frame, a location in the environment of each object center for each visible object in the current frame, a bounding box size for each visible object in the current frame, and a displacement vector for each visible object in the current frame. For example, the process 500 may use one or more neural networks, such as sub-networks 330a, 330b, 330c as described with respect to
(62) At block 506, the process 500 predicts a location of an object occluded in the current frame based on a comparison of object centers decoded from the representation of the current state to object centers saved from each prior representation associated with a different respective frame of the sequence of frames occurring prior to the current frame. In some implementations, the process 500 may predict the location by comparing the location of each object center for each visible object in the current frame with the stored location of each object center for each respective prior representation. In such implementations, the process 500 may also identify an object center of a prior representation that is not visible in the current frame based on comparing the location of each object center for each visible object in the current frame with the stored location. The process 500 may further determine an object corresponding to the identified object center is occluded in the current frame. Also, in such implementations, the process 500 may predict the location of the object occluded in the current frame based on a stored location of the identified object center and a velocity predicted based on a stored displacement vector of the object corresponding to the identified object center.
(63) At block 508, the process 500 adjusts a behavior of an autonomous agent in response to identifying the location of the occluded object. Aspects of the present disclosure are not limited to implementing the object tracking system in an autonomous agent, other types of agents, such as semi-autonomous or manually operated agents are contemplated.
(64) Based on the teachings, one skilled in the art should appreciate that the scope of the present disclosure is intended to cover any aspect of the present disclosure, whether implemented independently of or combined with any other aspect of the present disclosure. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth. In addition, the scope of the present disclosure is intended to cover such an apparatus or method practiced using other structure, functionality, or structure and functionality in addition to, or other than the various aspects of the present disclosure set forth. It should be understood that any aspect of the present disclosure may be embodied by one or more elements of a claim.
(65) The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
(66) Although particular aspects are described herein, many variations and permutations of these aspects fall within the scope of the present disclosure. Although some benefits and advantages of the preferred aspects are mentioned, the scope of the present disclosure is not intended to be limited to particular benefits, uses or objectives. Rather, aspects of the present disclosure are intended to be broadly applicable to different technologies, system configurations, networks and protocols, some of which are illustrated by way of example in the figures and in the following description of the preferred aspects. The detailed description and drawings are merely illustrative of the present disclosure rather than limiting, the scope of the present disclosure being defined by the appended claims and equivalents thereof.
(67) As used herein, the term “determining” encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Additionally, “determining” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Furthermore, “determining” may include resolving, selecting, choosing, establishing, and the like.
(68) As used herein, a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover: a, b, c, a-b, a-c, b-c, and a-b-c.
(69) The various illustrative logical blocks, modules and circuits described in connection with the present disclosure may be implemented or performed with a processor specially configured to perform the functions discussed in the present disclosure. The processor may be a neural network processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array signal (FPGA) or other programmable logic device (PLD), discrete gate or transistor logic, discrete hardware components or any combination thereof designed to perform the functions described herein. Alternatively, the processing system may comprise one or more neuromorphic processors for implementing the neuron models and models of neural systems described herein. The processor may be a microprocessor, controller, microcontroller, or state machine specially configured as described herein. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or such other special configuration, as described herein.
(70) The steps of a method or algorithm described in connection with the present disclosure may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in storage or machine readable medium, including random access memory (RAM), read only memory (ROM), flash memory, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, a hard disk, a removable disk, a CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. A software module may comprise a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across multiple storage media. A storage medium may be coupled to a processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.
(71) The methods disclosed herein comprise one or more steps or actions for achieving the described method. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
(72) The functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in hardware, an example hardware configuration may comprise a processing system in a device. The processing system may be implemented with a bus architecture. The bus may include any number of interconnecting buses and bridges depending on the specific application of the processing system and the overall design constraints. The bus may link together various circuits including a processor, machine-readable media, and a bus interface. The bus interface may be used to connect a network adapter, among other things, to the processing system via the bus. The network adapter may be used to implement signal processing functions. For certain aspects, a user interface (e.g., keypad, display, mouse, joystick, etc.) may also be connected to the bus. The bus may also link various other circuits such as timing sources, peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further.
(73) The processor may be responsible for managing the bus and processing, including the execution of software stored on the machine-readable media. Software shall be construed to mean instructions, data, or any combination thereof, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
(74) In a hardware implementation, the machine-readable media may be part of the processing system separate from the processor. However, as those skilled in the art will readily appreciate, the machine-readable media, or any portion thereof, may be external to the processing system. By way of example, the machine-readable media may include a transmission line, a carrier wave modulated by data, and/or a computer product separate from the device, all which may be accessed by the processor through the bus interface. Alternatively, or in addition, the machine-readable media, or any portion thereof, may be integrated into the processor, such as the case may be with cache and/or specialized register files. Although the various components discussed may be described as having a specific location, such as a local component, they may also be configured in various ways, such as certain components being configured as part of a distributed computing system.
(75) The machine-readable media may comprise a number of software modules. The software modules may include a transmission module and a receiving module. Each software module may reside in a single storage device or be distributed across multiple storage devices. By way of example, a software module may be loaded into RAM from a hard drive when a triggering event occurs. During execution of the software module, the processor may load some of the instructions into cache to increase access speed. One or more cache lines may then be loaded into a special purpose register file for execution by the processor. When referring to the functionality of a software module below, it will be understood that such functionality is implemented by the processor when executing instructions from that software module. Furthermore, it should be appreciated that aspects of the present disclosure result in improvements to the functioning of the processor, computer, machine, or other system implementing such aspects.
(76) If implemented in software, the functions may be stored or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media include both computer storage media and communication media including any storage medium that facilitates transfer of a computer program from one place to another.
(77) Further, it should be appreciated that modules and/or other appropriate means for performing the methods and techniques described herein can be downloaded and/or otherwise obtained by a user terminal and/or base station as applicable. For example, such a device can be coupled to a server to facilitate the transfer of means for performing the methods described herein. Alternatively, various methods described herein can be provided via storage means, such that a user terminal and/or base station can obtain the various methods upon coupling or providing the storage means to the device. Moreover, any other suitable technique for providing the methods and techniques described herein to a device can be utilized.
(78) It is to be understood that the claims are not limited to the precise configuration and components illustrated above. Various modifications, changes, and variations may be made in the arrangement, operation, and details of the methods and apparatus described above without departing from the scope of the claims.