END-TO-END MULTIMODAL GAIT RECOGNITION METHOD BASED ON DEEP LEARNING
20220343686 ยท 2022-10-27
Assignee
Inventors
Cpc classification
G06V10/46
PHYSICS
G06V10/26
PHYSICS
G06V40/25
PHYSICS
International classification
G06V10/26
PHYSICS
G06V10/46
PHYSICS
G06V10/80
PHYSICS
Abstract
An end-to-end multimodal gait recognition method based on deep learning includes: first extracting gait appearance features (color, texture and the like) through RGB video frames, and obtaining a mask by semantic segmentation of the RGB video frames; then extracting gait mask features (contour and the like) through the mask; and finally performing fusion and recognition on the two kinds of features. The method is configured for extracting gait appearance feature and mask feature by improving GaitSet, improving semantic segmentation speed on the premise of ensuring accuracy through simplified FCN, and fusing the gait appearance feature and the mask feature to obtain a more complete information representation.
Claims
1. An end-to-end multimodal gait recognition method based on a deep learning, comprising the following steps: step 1: accessing a pedestrian gait image sequence or video and inputting the pedestrian gait image sequence or video into a gait appearance feature extraction branch based on a GaitSet network to extract an appearance feature F.sub.App comprising color and texture; step 2: through a simplified fully convolutional network (FCN), namely a semantic segmentation branch, performing a semantic segmentation on an image to obtain a mask containing only pedestrian gait contour information; step 3: extracting a pedestrian gait mask feature F.sub.Mask comprising a contour from the mask by a gait mask feature extraction branch based on the GaitSet network; step 4: setting appropriate weights for the extracted features to perform a feature fusion, to obtain a fusion feature for a subsequent calculation; step 5: when a network is trained, for the fusion feature, calculating a triple loss L.sub.BA+, and a cross entropy loss L.sub.Cross of the semantic segmentation branch to perform a Loss fusion, wherein the network comprises the gait appearance feature extraction branch, the semantic segmentation branch and the gait mask feature extraction branch; and step 6: when a trained network is configured for forward reasoning, calculating Euclidean distances between fusion features of a pedestrian gait sequence to be retrieved and fusion features of a pedestrian gait sequence in a retrieval database, performing ranking according to the Euclidean-distances, and calculating a recognition accuracy of rank-1 according the Euclidean distances; wherein in step 1, the gait appearance feature extraction branch is obtained by improving on the GaitSet network; the improving on the GaitSet network comprises: changing a number of input channels in an input layer from 1 to 3 to input a red-green-blue (RGB) image, replacing a global maximum pooling in a spatial pyramid pooling with a sum of the global maximum pooling and a global average pooling. and a horizontal pyramid mapping in the GaitSet network is replaced.
2. (canceled)
3. The end-to-end multimodal gait recognition method based on the deep learning according to claim 1, wherein an attention mechanism is configured to promote useful features, and an independent full connection layer is configured to map the useful features.
4. The end-to-end multimodal gait recognition method based on the deep learning according to claim 1, wherein in step 2, the simplified FCN comprises nine convolutional layers and one upper sampling layer, wherein first six convolutional layers share a weight with first six convolutional layers of the gait appearance feature extraction branch.
5. The end-to-end multimodal gait recognition method based on the deep learning according to claim 1, wherein in step 3, an input layer of the gait mask feature extraction branch is 1, and a rest structure is identical to the gait appearance feature extraction branch.
6. The end-to-end multimodal gait recognition method based on the deep learning according to claim 1, wherein in step 4, a specific process of the feature fusion is F=p*F.sub.App+q*F.sub.Mask, wherein F represents the fusion feature, p represents a weight of the appearance feature F.sub.App, q represents a weight of the pedestrian gait mask feature F.sub.Mask.
7. The end-to-end multimodal gait recognition method based on the deep learning according to claim 1, wherein in step 5, a specific process of the Loss fusion is Loss=r*L.sub.BA++s*L.sub.Cross, wherein Loss represents a fusion loss, r represents a weight of the triple loss L.sub.BA+, s represents a weight of the cross entropy loss L.sub.Cross.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] In order to more clearly demonstrate the network structure and the training and forward reasoning process in the embodiment of the present invention, the drawings used in the embodiment are briefly introduced as follows.
[0015]
[0016]
[0017]
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0018] In order to describe the present invention in detail, the technical solution of the present invention is described in detail below in combination with the drawings and a specific embodiment.
[0019] The present invention provides a general end-to-end multimodal gait recognition method based on deep learning. As shown in
[0020] As shown in
[0021] As shown in
[0022] Embodiment
[0023] A multimodal gait recognition method based on deep learning includes the following steps:
[0024] Step 1. A gait sequence or video is accessed to extract the pedestrian gait appearance feature F.sub.App.
[0025] Specifically, the gait sequence or video is input into the gait appearance feature extraction branch to extract the gait appearance feature. The branch is based on the GaitSet gait recognition network. The network is improved as follows: firstly, the number of input channels in an input layer is changed from 1 to 3 to input an RGB image; secondly, the global maximum pooling in spatial pyramid pooling (SPP) is replaced by the sum of global maximum pooling and global average pooling, and the horizontal pyramid mapping (HPM) in GaitSet is replaced; then, attention mechanism squeeze-and-excitation (SE) is configured to promote useful features and suppress features that are useless for gait recognition; finally, an independent full connection layer (FC) is configured to map the features.
[0026] Step 2. Through the simplified fully convolutional network (FCN), namely the semantic segmentation branch, the semantic segmentation is performed on the image to obtain a mask that contains only pedestrian gait contour information.
[0027] Specifically, the simplified FCN of the present invention includes nine convolutional layers and one upper sampling layer, where the first six convolutional layers share the weight with the first six convolutional layers of the gait appearance feature extraction branch. Compared with the prior FCN, the skip architecture is removed and one convolutional layer is additionally added to ensure the segmentation speed with little loss of accuracy.
[0028] Step 3. The pedestrian gait mask feature F.sub.Mask is extracted from the mask by the gait mask feature extraction branch based on the GaitSet network. The input layer of the gait mask feature extraction branch is 1, and the rest structure is identical to the gait appearance feature extraction branch.
[0029] Step 4. Appropriate weights are set for the extracted features to perform feature fusion, namely F=p*F.sub.App+q*F.sub.Mask. The fusion feature is the final feature extracted by the method proposed by the present invention. According to the experiment, it is concluded that p is 0.8 and q is 0.2.
[0030] Step 5. When the network is trained, for the fusion feature, the triple loss L.sub.BA+ and the cross entropy loss L.sub.Cross of the semantic segmentation branch are calculated to perform Loss fusion, and different weights are set for weighted summation, that is, Loss=r*L.sub.BA++s*L.sub.Cross. According to the experiment, r is 0.7 and s is 0. 3.
[0031] Step 6. When the trained network is configured for forward reasoning, Euclidean distances between fusion features of a pedestrian gait sequence to be retrieved and fusion features of a pedestrian gait sequence in a retrieval database are calculated, ranking is performed according to the distances, and the recognition accuracy of rank-1 is calculated, where the sequences having the closest distances are from the same sample.