Pedestrian trajectory prediction method and system based on multi-interaction spatiotemporal graph network

11495055 · 2022-11-08

Assignee

Inventors

Cpc classification

International classification

Abstract

The present invention discloses a pedestrian trajectory prediction method and system based on a multi-interaction spatio-temporal graph network, which belong to the field of pedestrian trajectory prediction. The method includes: extracting multi-interaction features of pedestrians in video frames; for each frame in a video sequence, abstracting each pedestrian in the frame as a vertex, connecting the pedestrian(s) with other pedestrians to form an edge(s), where the vertex attribute is the multi-interaction feature of the pedestrian so as to obtain a multi-interaction spatio-temporal graph network; for each multi-interaction spatio-temporal graph, obtaining spatial dependencies of each pedestrian with other pedestrians in the spatio-temporal graph, and optimizing the attribute of each vertex through the spatial dependencies between pedestrians; the vertices of adjacent time points of the pedestrians are connected to obtain time dependencies of the pedestrians, and then infer a trajectory of a future time point(s).

Claims

1. A method for multi-interaction pedestrian feature extraction, wherein for each frame in an input video sequence, the following operations are performed: extracting a global context feature of the frame through scene segmentation and a convolution operation; extracting a local context feature of a pedestrian in the frame through gridding and location mapping; employing an attention mechanism to fuse the global context feature and the local context feature of the pedestrian to obtain a global-local context feature of the pedestrian in the frame; extracting a trajectory feature of the pedestrian in the frame; and concatenating the global-local context feature and the trajectory feature of the pedestrian in the frame to obtain a multi-interaction feature of the pedestrian in the frame. wherein the attention mechanism is as follows:
C.sub.t|i=Sum (Softmax(c.sub.g|t))+c.sub.l|t|i wherein i represents an i-th pedestrian, t represents a t-th video frame, c.sub.t|i represents the global-local context feature, c.sub.g|t represents the global context feature, c.sub.l|t|i represents the local context feature, Softmax( ) represents a Softmax operation, and Sum( ) represents adding up each element.

2. A method for pedestrian trajectory prediction based on a multi-interaction spatio-temporal graph network, wherein the method comprises: S1: employing the method set forth in claim 1 to extract a multi-interaction feature of pedestrians in video frames; S2: for each frame in a video sequence, abstracting each pedestrian in the frame as a vertex, connecting the pedestrians with other pedestrians in a scene to form edges, a vertex attribute being the multi-interaction feature corresponding to the pedestrian so as to obtain a multi-interaction spatio-temporal graph network; S3: for each multi-interaction spatio-temporal graph, obtaining spatial dependencies of each pedestrian with other pedestrians in the spatio-temporal graph, and optimizing the attribute of each vertex based on the spatial dependencies between pedestrians; and S4: connecting vertices of the same pedestrian at adjacent time points to obtain time dependencies of each pedestrian so as to infer a trajectory thereof at a future time point.

3. The method according to claim 2, wherein in step S3, GCN is used to measure interaction weights between pedestrians, and in the GCN, a weight adjacency matrix A.sub.t of a spatial graph is as shown below: a t .Math. "\[LeftBracketingBar]" ij = { 1 / .Math. d t .Math. "\[LeftBracketingBar]" i - d t .Math. "\[LeftBracketingBar]" j .Math. 2 , i j 0 , i = j , wherein t represents a time point, i,j represents pedestrian serial numbers, ∥d.sub.t|i−d.sub.t|j∥.sub.2 represents a Euclidean distance between pedestrians i and j; and vertex features are optimized and aggregated by the GCN: ( M t ) = σ ( Λ t - 1 2 ( A t + I ) Λ t - 1 2 M t W G C N ) wherein Λ.sub.t represents a vertex degree matrix of A.sub.t+I, I represents an identity matrix, Λ t - 1 2 ( A t + I ) Λ t - 1 2 represents a normalized Laplacian matrix, W.sub.GCN represents a weight matrix of a learned linear transformation, σ( ) represents an activation function, custom character( ) represents a GCN process, and M.sub.t represents multi-interaction features of all pedestrians in the t-th frame.

4. The method according to claim 3, wherein after the GCN, Transformers are connected in series, and a self-attention mechanism of the Transformers is as follows: 𝒯 n ( M t .Math. "\[LeftBracketingBar]" i ) = Softmax ( ( q t .Math. "\[LeftBracketingBar]" i n ) T k t | i n ) d k 1 / 2 ( l t .Math. "\[LeftBracketingBar]" j n ) T + M t .Math. "\[LeftBracketingBar]" i wherein g.sub.t|i.sup.n represents a query vector, k.sub.t|i.sup.n represents a key vector, d.sub.k is a dimension of each query, l.sub.t|j.sup.n represents a value vector, n represents an attention head sequence number, custom character.sup.n( ) represents a self-attention mechanism process of the Transformers, M.sub.t|i represents multiple interaction features; and a multi-head attention mechanism is used to capture richer information through different aspects:
custom character(M.sub.t|i)=Concat.sub.n=1,2, . . . , Ncustom character.sup.n(M.sub.t|i)) wherein N represents a number of heads of attention, custom character( ) represents a multi-head attention mechanism process, and Concat.sub.n=1,2, . . . , N represents the concatenation operation.

5. A system for pedestrian trajectory prediction based on a multi-interaction spatio-temporal graph network, the system comprises: a computer-readable storage medium, and a processor; the computer-readable storage medium is configured to store executable instructions; and the processor is configured to read the executable instructions stored in the computer-readable storage medium, and execute the pedestrian trajectory prediction method based on the multi-interaction spatio-temporal graph network according to claim 2.

6. A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium comprises a stored computer program; when the computer program is executed by a processor, a device where the non-transitory computer-readable storage medium is located is controlled to execute the method for pedestrian trajectory prediction based on the multi-interaction spatio-temporal graph network according to claim 2.

7. A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium comprises a stored computer program; when the computer program is executed by a processor, a device where the non-transitory computer-readable storage medium is located is controlled to execute the method for extracting multiple interactive pedestrian features according to claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a flowchart of a method for predicting pedestrian trajectory using a Transformer in the existing technologies;

(2) FIG. 2 is a flowchart of a pedestrian trajectory prediction method based on a multi-interaction spatio-temporal graph network provided by the present invention;

(3) FIG. 3 is a schematic graph of a multi-interaction feature extraction process provided by the present invention;

(4) FIG. 4 is a schematic graph of a multi-interaction optimization process provided by the present invention, where the dots represent pedestrians, and the lines between the dots represent a process of updating and optimizing vertex features.

DETAILED DESCRIPTION

(5) To make the objects, technical solutions and advantages of the present invention clear, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention as described below can be combined with each other as long as there is no conflict with each other.

(6) The present invention provides a pedestrian trajectory prediction method based on a multi-interaction spatio-temporal graph network. As shown in FIG. 2, the method can be divided into three parts: multi-interaction feature extraction, multi-interaction graph modeling, and multi-interaction optimization, which will be described in detail below. It is assumed that there are X pedestrians in an observed video sequence of a length T.sub.obs; the goal of the present invention is to predict the trajectory coordinates of these Xpedestrians in a sequence of length T.sub.pred in the future.

(7) (1) Multi-Interaction Feature Extraction

(8) In the existing technologies, either local context information alone or only global context information alone is used. Without global context, some important information far from pedestrians may be lost. For example, in the case where a pedestrian wants to take a bus that is in the distance, the location of the bus plays a key role in trajectory prediction. On the other hand, without local context information, it would be difficult to clearly model interactions with surrounding scenes and objects. For example, a pedestrian needs to go around an obstacle in front of him.

(9) The present invention proposes a new feature extraction method. As shown in FIG. 3, both local and global context information are first extracted and then combined through an attention mechanism. In this way, both the global context with rich global information and the local context with key local information around a particular pedestrian are involved in capturing the interaction with the context. In addition, trajectory features are also extracted from videos. The trajectory features and context information features are combined to jointly represent multi- interaction features.

(10) To extract global context information, the present invention uses a pre-trained scene segmentation model to extract scene feature graph(s) of a video frame, and then crop them through convolution operations. In this way, context features around specific pedestrians can be easily extracted. At the same time, the global context can be compressed without losing key information. After these steps, the present invention obtains the global context feature c.sub.g|t.

(11) To extract the local context feature c.sub.l|t|i of each pedestrian in a video frame, the present invention establishes a connection between each pedestrian and a pixel grid. For example, the original video frame can be divided into P×P grids, and the global context features can also be divided into P×P grids. If a pedestrian belongs to the grid with the position of (1,1) after the original video frame is divided, then its local context feature is represented by the grid feature with the global context feature position of (1,1). In this way, the present invention can easily obtain local context features, which are closely related to the global context features, and contain rich context information about the surrounding scene related to the pedestrian.

(12) After acquiring the global and local contextual features, the present invention further proposes an attention mechanism. It not only emphasizes local information, but also fully considers global information. First of all, the present invention utilizes the Softmax operation to optimize the weight of each grid in c.sub.g|f.sub.t. Accordingly, each grid is adaptively assigned a weight that measures the impact on a particular pedestrian. Second, the present invention further compresses the global context information by means of summing the optimized features of each grid. Third, the present invention adds up c.sub.l|t|i and the processed c.sub.g|t to obtain the global-local context feature C.sub.t|i. This step combines global and local information together in a compact and efficient way. In addition, the local context feature weights are further enhanced, as the global context also contains the local context. The definition of the attention mechanism is shown in formula (1) below:
C.sub.t|i=Sum (Softmax(c.sub.g|t))+c.sub.l|t|i   (1)

(13) where Softmax( ) represents the Softmax operation, and Sum( ) represents adding up each element.

(14) (2) Multi-Interaction Graph Modeling

(15) After obtaining the features containing contextual interaction information, multiple interactions can be modeled using a spatio-temporal graph(s). This module will be described in three parts: graph construction, spatial aggregation mechanism, and temporal aggregation mechanism.

(16) (i) Graph Construction

(17) The interaction of the pedestrians and the context in the present invention is embodied in the construction of a graph(s). The construction of the graph(s) can be divided into three parts: the properties of vertices and edges, the connectivity of the graph(s), and the features of vertices.

(18) First, graphs are connected in both time and space dimensions. This can be expressed as G={G.sub.t|t∈{1, . . . , T.sub.obs}}, where G.sub.t represents the spatial graph of the t-th frame, G.sub.t is defined as G.sub.t=(V.sub.t, E.sub.t), and V.sub.t represents a set of the vertices in the t-th frame, and E.sub.t represents a set of edges in the t-th frame. In addition, V.sub.t={v.sub.t|i|i ∈{1, . . . , X}}.

(19) Second, in the spatial dimension, the present invention adopts a fully connected manner, and all pedestrians are connected with other pedestrians in the scene. In addition, the vertices of the same pedestrian at adjacent time points in the time dimension are also connected.

(20) Third, in order to introduce the interactions between context information into the method of the present invention, the present invention combines the trajectory feature and the context interaction feature as a vertex feature. In this way, interactions with context can be obtained in an efficient and simple way and incorporated into a graph network structure, thereby facilitating subsequent aggregation and prediction.

(21) (ii) Spatial Aggregation Mechanism

(22) In the spatial dimension, the present invention uses GCN to measure the interaction weight(s) between pedestrians. Specifically, in GCN, the Euclidean distance between pedestrians i and j is used to compute the adjacency matrix of the spatial graph. The weight adjacency matrix A.sub.t is defined in formula (2) below:

(23) a t .Math. "\[LeftBracketingBar]" ij = { 1 / .Math. d t .Math. "\[LeftBracketingBar]" i - d t .Math. "\[LeftBracketingBar]" j .Math. 2 , i j 0 , i = j , ( 2 )

(24) where t represents the time, i,j represent the pedestrian serial numbers, and ∥d.sub.t|i −d.sub.t|j ∥.sub.2 represents the Euclidean distance between the pedestrians i and j.

(25) Next, the vertex features are optimized and aggregated by GCN, as shown in formula (3) below:

(26) ( M t ) = σ ( Λ t - 1 2 ( A t + I ) Λ t - 1 2 M t W G C N ) ( 3 )

(27) where Λ.sub.t represents the vertex degree matrix of A.sub.t+I, I represents an identity matrix,

(28) Λ t - 1 2 ( A t + I ) Λ t - 1 2

(29) represents a normalized Laplacian matrix, W.sub.GCN represents a weight matrix of the learned linear transformation, σ( ) represents an activation function, custom character( ) represents a GCN process, and M.sub.t represents multi-interaction features of all pedestrians in the t-th frame.

(30) (iii) Time Aggregation Mechanism

(31) After obtaining the aggregated compact features including context interactions and interactions with other pedestrians in the spatial dimension, the temporal interaction should also be considered. This also corresponds to the interaction with the pedestrian himself or herself, since the future trajectory of a pedestrian will be profoundly affected by the past trajectory of the pedestrian. The graph in the time dimension connects the vertices of the same pedestrian at different time points. Next, operations such as causal convolution, weighted normalization, activation function, dropout, and residual connection (improved CNN) are used to update pedestrian vertices, and the updated vertex features include interactions with the pedestrian's past trajectory. In addition, the time aggregation mechanism also obtains a Gaussian distribution of future trajectories, which facilitates prediction of possible trajectories of future diversity.

(32) The improved CNN includes the following layers in series: the first layer is used to reduce the dimension of each vertex to 5 dimensions, representing the X/Y mean, X/Y standard deviation and correlation coefficient of the predicted trajectory respectively; the second layer is used to change the length of the observed video frame sequence into the sequence length to be predicted; each layer of the third to fifth layers includes operations such as causal convolution, weighted normalization, activation functions, and residual connections to obtain temporal interaction-dependent features of the pedestrian.

(33) On the basis of the X/Y mean, X/Y standard deviation and correlation coefficient of the predicted trajectory, the final predicted trajectory coordinates can be obtained.

(34) (3) Multi-Interaction Optimization

(35) The pedestrian vertex features obtained through the above graph modeling can handle some typical and general scenes. However, the existing graph structure-based optimization methods still have room for improvement due to the low efficiency of global information transfer. First, the interactions between pedestrians are complex and cannot be measured by distance as a factor alone. Second, the graph model has limitations in transmitting global information. Third, GCNs have shortcomings in pooling multimodal features. Therefore, an attention mechanism is needed to better model these factors. In addition, it should fully adapt to the graph structure.

(36) In the present invention, the Transformer and GCN are combined in a new way different from that in the existing technologies for further optimization. The present invention also needs to consider how to adopt the Transformer into the spatial dimension and how to adapt to the graph structure. First, the self-attention mechanism is suitable for transferring information between vertices. Second, compared to CNNs with limited receptive fields, the long-distance property of the Transformer makes it possible to efficiently utilize global information from shallow layers to deep layers. FIG. 4 shows the process of multi-interaction optimization.

(37) For illustration, the present invention only describes the optimization process of one vertex. Based on the above considerations, for M.sub.t|i, its query vector, key vector, and value vector are correspondingly marked as q.sub.t|i, k.sub.t|i and l.sub.t|i. The Transformer's self-attention mechanism is shown in formula (4) below:

(38) 𝒯 n ( M t .Math. "\[LeftBracketingBar]" i ) = Softmax ( ( q t .Math. "\[LeftBracketingBar]" i n ) T k t | i n ) d k 1 / 2 ( l t .Math. "\[LeftBracketingBar]" j n ) T + M t .Math. "\[LeftBracketingBar]" i ( 4 )

(39) where q.sub.t|i.sup.n represents the query vector, k.sub.t|i.sup.n represents the key vector, d.sub.k is the dimension of each query, l.sub.t|j.sup.n represents the value vector, n represents an attention head sequence number, custom character.sup.n( ) represents a self-attention mechanism process of the Transformer, and M.sub.t|i represents multiple interaction features;
custom character(M.sub.t|i)=Concat.sub.n=1,2, . . . , N(custom character.sup.n(M.sub.t|i))  (5)

(40) where N represents the number of heads of attention, custom character( ) represents a multi-head attention mechanism process, and Concat.sub.n=1,2, . . . , N represents a concatenation operation.

(41) Of course, in the multi-interaction optimization module, the way of combining GCN and the Transformer can also be that the Transformer updates the weights first, and then GCN updates the weights.

(42) After spatial aggregation and time aggregation, the attributes of the vertices are updated. Training samples are used for training. In this embodiment, the training samples are 8 adjacent video frames, and the corresponding labels are the trajectories of pedestrians in 12 consecutive frames after the observed video frames. After the training is completed, the trajectory prediction can be performed using the observed 8 video frames.

(43) A person skilled in the art can easily understand that the above descriptions are only some preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall fall within the scope of protection of the present invention.