VIDEO PERSONNEL RE-IDENTIFICATION METHOD BASED ON TRAJECTORY FUSION IN COMPLEX UNDERGROUND SPACE
20230196586 · 2023-06-22
Inventors
- Yanjing SUN (Xuzhou, CN)
- Xiao YUN (Xuzhou, CN)
- Kaiwen DONG (Xuzhou, CN)
- Kaili SONG (Xuzhou, CN)
- Xiaozhou CHENG (Xuzhou, CN)
Cpc classification
G06V10/44
PHYSICS
G06V20/46
PHYSICS
G06V10/26
PHYSICS
G06V10/75
PHYSICS
G06V20/52
PHYSICS
G06V20/41
PHYSICS
G06V10/774
PHYSICS
G06V10/80
PHYSICS
International classification
G06V10/26
PHYSICS
G06V10/44
PHYSICS
G06V10/62
PHYSICS
G06V10/74
PHYSICS
G06V10/80
PHYSICS
Abstract
Disclosed is a video personnel re-identification method based on trajectory fusion in a complex underground space; an accurate personnel trajectory prediction may be realized through the Social-GAN model; and a spatio-temporal trajectory fusion model is constructed, and personnel trajectory videos that are not affected by the occlusion are introduced into the re-identification network to solve the problem of false extraction of the apparent visual features caused by the occlusion. In addition, a trajectory fusion MARS_traj data set is constructed, and a number of time frames and space coordinate information are added to the MARS data set.
Claims
1. A video personnel re-identification method based on trajectory fusion in a complex underground space, comprising following steps: S1, establishing a trajectory fusion data set MARS_traj, comprising personnel identity data and video sequences; and adding a number of time frames and space coordinate information to each person on the MARS_traj, wherein test sets in the MARS_traj comprise a retrieval data set query and a candidate data set gallery; S2, judging whether retrieval videos in the retrieval data set query comprise occluded images, inputting sequences of the occluded images into a trajectory prediction model for a future trajectory prediction, and obtaining a prediction set query_pred comprising a predicted trajectory; and going to S4, and performing a fusion feature extraction but not the trajectory prediction directly for sequences of images without occlusion in the S4; S3, fusing the obtained query_pred with candidate videos in the candidate data set gallery, and obtaining a new fused video set query_TP; and S4, extracting spatio-temporal trajectory fusion features comprising apparent visual information and motion trajectory information by using a video re-identification model for the query_TP, performing a feature distance measure and candidate video ranking, and obtaining final re-identification performance evaluation indexes mAP and Rank-k, wherein mAP represents a mean average precision, Rank-k indicates a possibility of a cumulative match characteristic (CMC) curve matching correctly in the first k videos in the ranked gallery, and the CMC curve reflects cumulative match characteristics of a retrieval precision of an algorithm; and using a Rank-1 result as a video re-identification result.
2. The video personnel re-identification method based on the trajectory fusion in the complex underground space according to claim 1, wherein in the S2, the future trajectory prediction is based on a favourable historical trajectory, and is realized by a Social GAN model and belongs to historical trajectory coordinates of known personnel, and predicted trajectory coordinates are obtained.
3. The video personnel re-identification method based on the trajectory fusion in the complex underground space according to claim 1, wherein in the S3, in the spatio-temporal trajectory fusion features, a temporal trajectory fusion is to calculate a temporal fusion loss l.sub.t.sup.tem in a time domain considering a continuity of the predicted trajectory and the known historical trajectory, as shown in formula (1):
l.sub.t.sup.tem=max[ϕ(Δt−T),0] (1), wherein Δt is a frame difference between a last frame of the video sequences in the query and a first frame of video sequences in the gallery, and a frame constant threshold T and a large constant ϕ determine a temporal continuity of the frame difference Δt between the query and the gallery.
4. The video personnel re-identification method based on the trajectory fusion in the complex underground space according to claim 1, wherein in the S3, in the spatio-temporal trajectory fusion features, a space trajectory fusion is to calculate a space fusion loss l.sub.t.sup.spa considering a dislocation of the predicted trajectory and the frames of the candidate videos in the gallery:
l.sub.i.sup.spa=min(l.sub.j),
∀j∈1,2, . . . ,N,
N=2,3, . . . ,7, (2), wherein
5. The video personnel re-identification method based on the trajectory fusion in the complex underground space according to claim 1, wherein in the S3, after the temporal fusion loss and the space fusion loss are obtained, a limited fusion loss l.sub.i.sub.
6. The video personnel re-identification method based on the trajectory fusion in the complex underground space according to claim 1, wherein in the S4, a new query set query_TP extracted after the fusion of temporal trajectory and space trajectory and the candidate set gallery are sent to a temporal complementary learning network (TCLNet), and finally, group features are aggregated by temporal average pooling to obtain a final fused video feature vector; the TCLNet takes a ResNet-50 network as a backbone network, wherein a temporal saliency boosting (TSB) module and a temporal saliency erasing (TSE) module are inserted; and for a T-frame continuous video, the backbone network with the TSB inserted extracts the features from each frame, and the features are labelled as F={F.sub.1, F.sub.2, . . . , F.sub.T}, and then the features are equally divided into k groups; each group comprises N continuous frame features C.sub.k={F.sub.(k−1)N+1, . . . , F.sub.kN}, and each group is input into the TSE, and complementary features are extracted by formula (4):
c.sub.k=TSE(F.sub.(k−1)N+1, . . . ,F.sub.kN)=TSE(C.sub.k) (4); the distance measure between a video feature vector A (x.sub.1,y.sub.1) in the query_TP and the video feature vector B (x.sub.2,y.sub.2) in the candidate set gallery is calculated by a cosine similarity, as shown in formula (5):
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0023]
[0024]
[0025]
[0026]
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0027] Technical schemes of the application are further explained in detail with reference to accompanying drawings of the specification.
[0028] An overall framework of an algorithm according to the application is shown in
[0029] A personnel trajectory prediction is to predict a future trajectory of personnel by observing historical trajectory information of the personnel. The application adopts a Social GAN to realize the future trajectory prediction of the personnel. Coordinates of 8 known personnel are input into the Social GAN model for the trajectory prediction, and 8 frames of predicted trajectory coordinates are obtained. From a perspective of time domain and space domain, these predicted trajectory sequences are fused and extracted with the candidate videos in gallery.
[0030] (1) Temporal Trajectory Fusion
[0031] A temporal fusion loss l.sub.t.sup.tem is calculated in the time domain considering a continuity of the predicted trajectory and the known historical trajectory, as shown in formula (1):
l.sub.t.sup.tem=max[ϕ(Δt−T),0] (1),
[0032] where Δt is a frame difference between a last frame of video sequences in the query and a first frame of the video sequences in the gallery, and a frame constant threshold T and a large constant ϕ determine a temporal continuity of the frame difference Δt between the query and the gallery. By comparing values of the frame constant T, T=4 is selected in an embodiment of the application.
[0033] (2) Space Trajectory Fusion
[0034] In an actual scene, there are some problems such as discontinuous sequences of the frames between adjacent video sequences, resulting in a dislocation of the frames in the predicted trajectory sequences according to the application and candidate sequences in the gallery. Therefore, according to the application, a space fusion loss l.sub.t.sup.spa is calculated considering a possible frame error:
l.sub.i.sup.spa=min(l.sub.j),
∀j∈1,2, . . . ,N,
N=2,3, . . . ,7, (2),
[0035] where
(n=9−j), p.sub.i represents Euclidean distances between the coordinates corresponding to the predicted trajectory sequences and the candidate sequences in the gallery, and meanings expressed in different l.sub.N are different, as shown in
[0036] In formula (2), N indicates an allowable deviation range between the predicted trajectory sequences and the frames of the candidate videos. Because the frames are fixed, too small N may reduce a flexibility of fusion matching, while too large N may increase a possibility of fusion matching errors. Therefore, when N=4 is adopted in the embodiment of the application, a better experimental result may be obtained.
[0037] After the temporal fusion loss and the space fusion loss are obtained according to the formulas (1) and (2), a limited fusion loss l.sub.i.sub.
[0038] where N.sub.2 is a total number of video sequences in the gallery (there is no Min the above formula, please confirm), and a minimum j value that minimizes l.sub.i.sub.
[0039] a new query set query_TP extracted after the fusion of temporal trajectory and space trajectory and the candidate set gallery are sent to a temporal complementary learning network (TCLNet). This network takes a ResNet-50 network as a backbone network, in which a temporal saliency boosting (TSB) module and a temporal saliency erasing (TSE) module are inserted. For a T-frame continuous video, the backbone network with the TSB inserted extracts the features from each frame, and the features are labelled as F={F.sub.1, F.sub.2, . . . , F.sub.T}, and then the features are equally divided into k groups; each group includes N continuous frame features C.sub.k={F.sub.(k−1)N+1, . . . , F.sub.kN}, and each group is input into the TSE, and complementary features are extracted by formula (4). Finally, group features are aggregated by temporal average pooling to obtain a final fused video feature vector; a distance measure between a video feature vector A(x.sub.1,y.sub.1) in the query_TP and the video feature vector B(x.sub.2,y.sub.2) in the candidate set gallery is calculated by a cosine similarity, as shown in formula (5); and the videos in the gallery are ranked according to the distance measure, and the re-identification evaluation indexes mAP and Rank-k are calculated according to a ranking result, and the Rank-1 result is taken as the video re-identification result.
[0040] According to the application, a trajectory fusion data set MARS_traj suitable for the personnel re-identification in the occluded videos based on the trajectory prediction is constructed. In order to test an ability of the model to deal with the occlusion problem, test sets of the MARS_traj according the application include a query test set query and a candidate test set gallery, with a total of 744 personnel identities and 9,659 video sequences. In order to verify the personnel trajectory prediction, a number of time frames and space coordinate information are added to a personnel tag for each personnel on the selected MARS_traj test set, as shown in
[0041] Based on the fusion data set MARS_traj, a flow of the re-identification method according the application is as follows.
[0042] Input: data set MARS_traj; trajectory prediction model Social GAN; and video personnel re-identification model.
[0043] Output: mAP and rank-k.
[0044] (1) Spatio-temporal information in a video ID in the query data set is input into the trajectory prediction model.
[0045] (2) A generator in the Social GAN generates a possible prediction trajectory according to the input spatio-temporal information.
[0046] (3) A discriminator in the Social GAN discriminates the generated prediction trajectory to obtain the query_pred accorded with the prediction trajectory.
[0047] (4) An initial value is set to i=1.
[0048] (5) The initial value is set to j=1.
[0049] (6) The temporal fusion loss and the space fusion loss of the jth video in the gallery and the ith video prediction trajectory predi in the query_pred are calculated according to the formula (1) and formula (2).
[0050] (7)j=j+1; the operation (6) is repeated until j=N.sub.2 (the number of video sequences in the gallery of the MARS_traj data set).
[0051] (8) A minimum limited fusion loss is obtained according to the formula (3), and j corresponding to the minimum limited fusion loss is assigned i.sub.j.
[0052] (9) The ith video sequence in the Gallery is put into query_TP.
[0053] (10) i=i+1; the operations (5)-(9) are repeated until i=N.sub.1 (the number of video sequences in the query of the MARS_traj data set).
[0054] (11) Video fusion features of the query_TP and the gallery are extracted.
[0055] (12) The feature distance measure is calculated according to the video features in the query_TP and the gallery, and the gallery is ranked.
[0056] (13) The final re-identification performance evaluation indexes mAP and Rank-k are obtained according to the query, and the Rank-1 result is used as the video re-identification result. mAP represents the mean average precision, Rank-k indicates the possibility of the cumulative match characteristic (CMC) curve matching correctly in the first k videos in the ranked gallery, and the CMC curve reflects the cumulative match characteristics of the retrieval precision of the algorithm.
[0057] The above are only the preferred embodiments of the application, and a scope of protection of the application is not limited to the above embodiments. However, all equivalent modifications or changes made by ordinary technicians in the field according to the disclosure of the application should be included in the scope of protection stated in claims.