Method and System for Determining Weights for an Attention Based Method for Trajectory Prediction
20220212660 · 2022-07-07
Inventors
Cpc classification
B60W50/0098
PERFORMING OPERATIONS; TRANSPORTING
G06V20/58
PHYSICS
B60W2050/0022
PERFORMING OPERATIONS; TRANSPORTING
B60W30/095
PERFORMING OPERATIONS; TRANSPORTING
International classification
B60W30/095
PERFORMING OPERATIONS; TRANSPORTING
B60W50/00
PERFORMING OPERATIONS; TRANSPORTING
Abstract
A computer implemented method for determining weights for an attention based trajectory prediction comprises the following steps carried out by computer hardware components: receiving a sequence of a plurality of captures taken by a sensor; determining an unnormalized weight for a first capture of the sequence based on the first capture of the sequence; and determining a normalized weight for the first capture of the sequence based on the unnormalized weight for the first capture of the sequence and a normalized weight for a second capture of the sequence.
Claims
1. A method comprising: determining, by computer hardware components, weights for an attention based trajectory prediction, the weights for the attention based trajectory prediction determined by: receiving a sequence of a plurality of captures taken by a sensor including a first capture of the sequence and a second capture of the sequence; determining an unnormalized weight for the first capture of the sequence; determining a normalized weight for the second capture of the sequence; and determining, based on the unnormalized weight for the first capture of the sequence and the normalized weight for the second capture of the sequence, a normalized weight for the first capture of the sequence.
2. The method according to claim 1, further comprising: recursively determining a plurality of normalized weights for the sequence including the normalized weight for the first capture of the sequence, the normalized weight for the second capture of the sequence, and a normalized weight for each other capture of the sequence.
3. The method according to claim 1, wherein the sequence of captures represents a temporal sequence of captures, and wherein at least some of the captures of the sequence including the second capture of the sequence and the first capture of the sequence correspond captures taken by the sensor at different time instances.
4. The method according to claim 3, wherein the first capture of the sequence corresponds to a first time instance and the second capture of the sequence corresponds to a second time instance that is before the first time instance.
5. The method according to claim 1, wherein determining the normalized weight for the first capture of the sequence comprises determining the normalized weight for the first capture of the sequence by: merging, according to a merging rule, the unnormalized weight for the first capture of the sequence and the normalized weight for the second capture of the sequence.
6. The method according to claim 5, wherein the merging rule defines that the unnormalized weight for the first capture and the normalized weight for the second capture are added using respective factors, and the method further comprising: applying a normalization rule to a resulting sum to obtain the normalized weight for the first capture.
7. The method according to claim 6, wherein applying the normalization rule to the resulting sum comprises multiplying the unnormalized weight for the first capture is by a first factor, and multiplying the normalized weight for the second capture by a second factor.
8. The method according to claim 6, wherein the normalization rule comprises an exponential normalization.
9. The method according to claim 8, wherein the exponential normalization comprises a SoftMax normalization.
10. The method according to claim 1, further comprising: generating the unnormalized weight for the first capture using a neural network.
11. The method according to claim 10, wherein the neural network comprises a convolutional neural network
12. The method according to claim 1, further comprising: using the weights for determining a dot product with a feature vector in determining the attention based trajectory prediction to determine a relevance of respective portions of the feature vector.
13. A computer system comprising a plurality of computer hardware components configured to carry out steps for determining weights for an attention based trajectory prediction, the steps including: receiving a sequence of a plurality of captures taken by a sensor including a first capture of the sequence and a second capture of the sequence; determining an unnormalized weight for the first capture of the sequence; determining a normalized weight for the second capture of the sequence; and determining, based on the unnormalized weight for the first capture of the sequence and the normalized weight for the second capture of the sequence, a normalized weight for the first capture of the sequence.
14. The computer system according to claim 13, the steps further comprising: recursively determining a plurality of normalized weights for the sequence including the normalized weight for the first capture of the sequence, the normalized weight for the second capture of the sequence, and a normalized weight for each other capture of the sequence.
15. The computer system according to claim 13, wherein the sequence of captures represents a temporal sequence of captures, and wherein at least some of the captures of the sequence including the second capture of the sequence and the first capture of the sequence correspond captures taken by the sensor at different time instances.
16. The computer system according to claim 15, wherein the first capture of the sequence corresponds to a first time instance and the second capture of the sequence corresponds to a second time instance that is before the first time instance.
17. The computer system according to claim 13, wherein the steps for determining the normalized weight for the first capture of the sequence comprises: merging, according to a merging rule, the unnormalized weight for the first capture of the sequence and the normalized weight for the second capture of the sequence.
18. The computer system according to claim 17, wherein the merging rule defines that the unnormalized weight for the first capture and the normalized weight for the second capture are added using respective factors, and a normalization rule is applied to a resulting sum to obtain the normalized weight for the first capture.
19. The computer system of claim 13, wherein the computer system is part of a vehicle.
20. A non-transitory computer readable medium comprising instructions for configuring a computer system to carry out steps for determining weights for an attention based trajectory prediction by: receiving a sequence of a plurality of captures taken by a sensor including a first capture of the sequence and a second capture of the sequence; determining an unnormalized weight for the first capture of the sequence; determining a normalized weight for the second capture of the sequence; and determining, based on the unnormalized weight for the first capture of the sequence and the normalized weight for the second capture of the sequence, a normalized weight for the first capture of the sequence.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0033] Exemplary embodiments and functions of the present disclosure are described herein in conjunction with the following drawings, showing schematically:
[0034]
[0035]
[0036]
[0037]
DETAILED DESCRIPTION
[0038] Predicting a trajectory of an object, for example a vehicle, a cyclist, or a pedestrian, is an important task in various automotive applications.
[0039] According to various embodiments, efficient and reliable trajectory prediction may be provided.
[0040] Methods of (or for) trajectory prediction have found widespread use. As one example, trajectory prediction is now used in autonomous applications such as autonomous driving of a vehicle. In such applications not only image data but also data from other sensors, e.g. radar and lidar sensors has to be processed and analyzed with respect to its content, and the trajectory of other vehicles, cyclists or pedestrians may be predicted.
[0041] Since the output of the trajectory prediction, i.e. the predicted trajectory (or predicted trajectories, for example of one or more vehicles, of one or more cyclist, and/or one or more pedestrian) forms a safety-critical basis for the automatic application, e.g., automatic generation of instructions for autonomously driving a car, a high reliability of the trajectory prediction may be important.
[0042] One problem which hinders a high accuracy of the trajectory prediction is that a single capture is as such not always very reliable. In order to reduce this problem, according to various embodiments, a plurality of subsequent captures, which are a part of a sequence such as a video sequence, may be evaluated. This approach may allow for taking into account temporal dependencies between the predicted trajectory based on the captures, wherein individual outliers that do not fit to a systematic trend can be excluded or at least reduced with respect to its impact on the trajectory prediction.
[0043] Attention models and configurations may output two tensors. A first tensor may include a feature vector for each agent in the scene. A second tensor may provide the weights for each agent. The second tensor may correspond to a distribution over the estimated relevancy of the different environment elements. The dot product (multiplication and addition) of the vectors (in other words: of the feature vector and a vector including the weights for each agent) may then be calculated to create a constant sized feature vector in which all agents are incorporated according to their relevancy score.
[0044] According to various embodiments a classical transformers framework as described in various publications may be applied.
[0045] Distribution may refer to the relevance distribution over all neighbors. For example, if a lane change to the left is planned, the vehicle to the left may be 80% important while the vehicle ahead is only 20% relevant.
[0046] Attention vector may be the name of the vector which gives the so said distribution.
[0047] Neighbors may refer to agents, for example vehicles, around the ego agent.
[0048] Feature values may be understood as the outputs of a given layer of the neural network.
[0049]
[0050] The respective data, functions or models used or obtained at time step i are indicated by index i.
[0051] In each time step i, a capture 108.sub.i may be provided to a first machine learning model 110.sub.i to obtain attention weights 112.sub.i and features 116.sub.i. The attention weights 112.sub.i may be referred to as unnormalized weights.
[0052] Fused weights 124.sub.i may be determined as a weighted sum of the attention weights 112.sub.i (for example multiplied with weighting factor α) and the attended vector 118.sub.i-1 of the previous time step (for example multiplied with weighting factor β). It is to be understood that according to various embodiments, the softmax weight may be used in the weighted sum instead of the attended vector.
[0053] The fused weights 124.sub.i may undergo a normalization (for example a SoftMax normalization) to obtain the SoftMax weights 114.sub.i.
[0054] It is to be understood that in time step 1, no previous time step is available, and as such, the SoftMax weight 114.sub.1 may be obtained directly based on the attention weights 112.sub.1 (without obtaining fused weights).
[0055] The SoftMax weights 114.sub.i may be multiplied with the features 116.sub.i to obtain attended vectors 118.sub.i, which may be the input to a second machine learning model 120.sub.i which yields the predication 122.sub.i at time step i.
[0056] The features may be arranged agent-wise as a tensor (i.e. the first tensor) of the form [features, agents], e.g. with dimensions [64, 7] for 64 features and 7 agents, i.e. with 64 rows and 7 columns. The attention vector (i.e. the second tensor) may then be a relevance distribution over all agents, i.e. a matrix with the dimensions [7, 1], i.e. with 7 rows and 1 column.
[0057] By computing the respective matrix multiplication, a matrix (which actually is a vector) may be obtained in which all features are a weighted sum of the original features over all 7 agents, i.e. the result of the matrix multiplication may have dimensions [64, 1], i.e. 64 rows and 1 column.
[0058] In machine learning applications, predictions may be performed based on a large number of variables. According to various embodiments, multiple, (semi-) independent variables may be taken to account while understanding their interactions.
[0059] For example, traffic may be more complicated than a single agent task, with an ever-increasing number of agents driving the same roads at the same time, creating complicated interactions.
[0060] According to various embodiments, temporal integration may be provided. When processing a time series with a strong temporal consistency, e.g. agent tracks, the temporal dimension may hold information which may be considered according to various embodiments.
[0061] The methods and system provided according to various embodiments may be applied to classification and/or regression.
[0062] According to various embodiments, temporally adaptive attention for trajectory prediction may be provided.
[0063] According to various embodiments, trajectory prediction may be improved by an adaptive temporal method for handling attention.
[0064] To encourage the generation of a distribution over the neighbors while not letting the feature values explode (or vanish) as a result of the dot product, the SoftMax normalization may be applied to the weights.
[0065] According to various embodiments, the following function may be used:
fused.sub.{t=n}=softmax(α*y.sub.{t=n}+β*fused.sub.{t=n-1}).
fused.sub.{t=n} may be the normalized weight related to a first capture (for example at time t=n), fused.sub.{t=n-1} may be the normalized weight related to a second capture (for example at time t=n−1, for example one time step preceding the time step t=n related to the first capture), y.sub.{t=n} may be the unnormalized weight related to the first capture, and α and β may be weighting factors. The weighting factors α and β may be trained; for example, the weighting factors may be improved together with the network. The gradients with respect to these parameters may be calculated and used to update their values at each training iteration. Alternatively, the weighting parameters may be determined heuristically.
[0066] The system may be initialised with zeros. Since the values are SoftMax normalised such that the largest value is 1, it does not affect the results of the first iteration. This may be equivalent to setting fused.sub.{t=1}=softmax(α*y.sub.{t=1}).
[0067] Traffic data is mostly sparse in terms of attention, i.e., neighboring agents which are important at timestep t.sub.0 are also very likely to be relevant at the next timestep (t.sub.1). In these majority of cases, the attention may account for past time steps and have a stable attention vector. However, occasionally an event may occur which demands the immediate attention of the model, e.g. hard breaking. According to various embodiments, the methods allow the model to overrule the temporal fusion and shift the focus immediately.
[0068] During training, the network may learn that a strong feature value must relate to a critical event. As an example, it may be assumed that one of the following classes is predicted for our vehicle: [keep at speed, break, accelerate]. In the common case, the network may output certainty values for these classes which are roughly within the range of [0, 5], e.g. [5, 1, 1] represent a strong preference to maintaining current velocity. It is to be noticed that these values are then SoftMax-normalised to a pseudo distribution, that is they are normalised such that they sum up to 1. When the network recognises a situation that may indicate a need for full an immediate attention, e.g. emergency breaking of the leading vehicle, it can output a vector like [0, 50, 0]. Mathematically speaking, such a strong value may suffice to overrule all other values in the system, thus forcing the winning class (which after normalization may have a value of 1 or close to 1, no matter what the other values are, since these other values, in the given example, and may not exceed 5).
[0069] Various embodiments may be used on top of any attention model which produces SoftMax weights.
[0070] It is to be understood that instead of SoftMax, another normalization function may be used, for example normalization may be provided based on min-max feature scaling.
[0071] According to various embodiments, the normalization happens directly on the network activations/predictions (instead of on the unnormalized past predictions). In other words, first, the fusion is carried out and only then the SoftMax is carried out. It has been found that normalizing prior to fusing can have negative properties.
[0072] It is to be understood that “weights” as used herein may means the parameters which are used to create the activations which are, in turn, normalised (for example SoftMax-normalised).
[0073] Various embodiments may be provided for temporal integration of attention data for trajectory prediction. They may provide an adjustment of normalization and may for example be applied to the field of attention for trajectory prediction. This may provide more efficient and/or more reliable methods for example for adaptive cruise control, path planning, realistic simulations.
[0074]
[0075] According to various embodiments, a plurality of normalized weights may be determined recursively with respect to the captures of the sequence.
[0076] According to various embodiments, the sequence of captures may represent a temporal sequence of captures, and wherein at least some of the captures of the sequence including the second capture and the first capture correspond to different time instances.
[0077] According to various embodiments, the first capture may correspond to a first time instance and the second capture may correspond to a second time instance. The second time instance may be before (in other words: preceding) the first time instance.
[0078] According to various embodiments, the normalized weight for the first capture may be determined by merging the unnormalized weight for the first capture and the normalized weight for the second capture according to a merging rule.
[0079] According to various embodiments, the merging rule may define that the unnormalized weight for the first capture and the normalized weight for the second capture are added using respective factors, and a normalization rule may be applied to the resulting sum to obtain the normalized weight for the first capture.
[0080] According to various embodiments, in the sum, the unnormalized weight for the first capture may be multiplied by a first factor, and the normalized weight for the second capture may be multiplied by a second factor.
[0081] According to various embodiments, the normalization rule may include or may be an exponential normalization.
[0082] According to various embodiments, the exponential normalization may include or may be a SoftMax normalization.
[0083] According to various embodiments, the unnormalized weight for the first capture may be generated by using a neural network.
[0084] According to various embodiments, the neural network may include or may be a convolutional neural network.
[0085] According to various embodiments, the weights may be used in a dot product with a feature vector in the attention based method.
[0086] According to various embodiments, the weights may be related to a relevance of respective portions of the feature vector.
[0087] Each of the steps 202, 204, 206 and the further steps described above may be performed by computer hardware components.
[0088]
[0089] The receiving circuit 302 may be configured to receive a sequence of a plurality of captures taken by a sensor.
[0090] The unnormalized weight determination circuit 304 may be configured to determine an unnormalized weight for a first capture of the sequence based on the first capture of the sequence.
[0091] The normalized weight determination circuit 306 may be configured to determine a normalized weight for the first capture of the sequence based on the unnormalized weight for the first capture of the sequence and a normalized weight for a second capture of the sequence.
[0092] The receiving circuit 302, the unnormalized weight determination circuit 304, and the normalized weight determination circuit 306 may be coupled with each other, e.g. via an electrical connection 308, such as e.g. a cable or a computer bus or via any other suitable electrical connection to exchange electrical signals.
[0093] A “circuit” may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing a program stored in a memory, firmware, or any combination thereof.
[0094]
[0095] The processor 402 may carry out instructions provided in the memory 404. The non-transitory data storage 406 may store a computer program, including the instructions that may be transferred to the memory 404 and then executed by the processor 402. The camera 408 and/or the distance sensor 410 may be used to determine captures.
[0096] The processor 402, the memory 404, and the non-transitory data storage 406 may be coupled with each other, e.g. via an electrical connection 412, such as e.g. a cable or a computer bus or via any other suitable electrical connection to exchange electrical signals. The camera 408 and/or the distance sensor 410 may be coupled to the computer system 400, for example via an external interface, or may be provided as parts of the computer system (in other words: internal to the computer system, for example coupled via the electrical connection 412).
[0097] The terms “coupling” or “connection” are intended to include a direct “coupling” (for example via a physical link) or direct “connection” as well as an indirect “coupling” or indirect “connection” (for example via a logical link), respectively.
[0098] It is to be understood that what has been described for one of the methods above may analogously hold true for the weights determination system 300 and/or for the computer system 400.