METHOD FOR RE-RECOGNIZING OBJECT IMAGE BASED ON MULTI-FEATURE INFORMATION CAPTURE AND CORRELATION ANALYSIS

20220415027 · 2022-12-29

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for re-recognizing an object image is provided based on multi-feature information capture and correlation analysis weights of an input feature map by using a convolutional layer with a spatial attention mechanism and a channel attention mechanism, causing channel and spatial information to effectively combined, which not only focus on an important feature and suppress an unnecessary feature, but also improve a representation of a feature. A multi-head attention mechanism is used to process a feature after an image is divided into blocks to capture abundant feature information and determine a correlation between features to improve performance and efficiency of object image retrieval. The convolutional layer with the channel attention mechanism and the spatial attention mechanism is combined with a transformer having the multi-head attention mechanism to focus on globally important features and capture fine-grained features, thereby improving performance of re-recognition.

Claims

1. A method for re-recognizing an object image based on a multi-feature information capture and correlation analysis comprising: a) collecting a plurality of object images to form an object image re-recognition database, labeling identifier (ID) information of an object image in the object image re-recognition database, and dividing the object image re-recognition database into a training set and a test set; b) establishing an object image re-recognition model by using the multi-feature information capture and correlation analysis; c) optimizing an objective function of the object image re-recognition model by using a cross-entropy loss function and a triplet loss function to obtain an optimized object image re-recognition model; d) marking the object images with the ID information to obtain marked object images, inputting the marked object images into the optimized object image re-recognition model in step c) for training to obtain a trained object image re-recognition model and storing the trained object image re-recognition model; e) inputting a to-be-retrieved object image into the trained object image re-recognition model in step d) to obtain a feature of a to-be-retrieved object; and f) comparing the feature of the to-be-retrieved object with features of the object images in the test set and sorting comparison results by a similarity measurement.

2. The method for re-recognizing the object image based on the multi-feature information capture and correlation analysis according to claim 1, wherein step b) comprises the following steps: b-1) setting an image input network to two branch networks comprising a first feature branch network and a second feature branch network; b-2) inputting an object image h in the training set into the first feature branch network, wherein h∈R.sup.e×w×3, R represents a real number space, e represents a number of horizontal pixels of the object image h, w represents a number of vertical pixels of the object image h, and 3 represents a number of channels of each red, green, and blue (RGB) image; processing the object image h by using a convolutional layer to obtain a feature map f; processing the feature map f by using a channel attention mechanism; performing a global average pooling and a global maximum pooling on the feature map f to obtain two one-dimensional vectors; normalizing the two one-dimensional vectors through a convolution, a Rectified Linear Unit (ReLU) activation function, a 1*1 convolution, and sigmoid function operations in turn to weight the feature map f to obtain a weighted feature map f; performing a maximum pooling and an average pooling on all channels at each position in the weighted feature map f by using a spatial attention mechanism to obtain a maximum pooled feature map and an average pooled feature map; stitching the maximum pooled feature map and the average pooled feature map to obtain a stitched feature map; performing a 7*7 convolution on the stitched feature map, and then normalizing the stitched feature map by using a batch normalization layer and a sigmoid function to obtain a normalized stitched feature map; and multiplying the normalized stitched feature map by the feature map f to obtain a new feature; b-3) inputting the object image h in the training set into the second feature branch network, wherein h∈R.sup.e×w×3; dividing the image h into n two-dimensional blocks; representing embeddings of the two-dimensional blocks as a one-dimensional vector h.sub.l∈R.sup.n×(p.sup.2.sup..Math.3) by using a linear transformation layer, wherein p represents a resolution of an image block, and n=ew/p.sup.2; calculating an average embedding h.sub.a of all the two-dimensional blocks according to a formula h a = .Math. i n h i n ,  wherein h.sub.i represents an embedding of an i.sup.th block obtained through a Gaussian distribution initialization, and i∈{1, . . . , n}; calculating an attention coefficient a.sub.i of the i.sup.th block according to a formula a.sub.i=q.sup.T σ(W.sub.1h.sub.0+W.sub.2h.sub.i+W.sub.3h.sub.a), wherein q.sup.T represents a weight, σ represents the sigmoid function, h.sub.0 represents a class marker, and W.sub.1, W.sub.2, and W.sub.3 are weights; calculating a new embedding h.sub.l of each of the two-dimensional blocks according to a formula h l = .Math. i = 1 n a i h i n ;  and calculating a new class marker h.sub.0′ according to a formula h.sub.0′=W.sub.4[h.sub.0∥h.sub.l], wherein W.sub.4 represents a weight; b-4) taking the new class marker h.sub.0′ and a sequence with an input size of h.sub.l∈custom-character.sup.n×d.sup.c as an overall representation of a new image, wherein d.sub.c=d*m, d represents a dimension size of a head of each self-attention mechanism in a multi-head attention mechanism, and m represents a number of heads of the multi-head attention mechanism; adding position information in the new image, and then taking the new image as an input of a transformer encoder to complete the establishment of the object image re-recognition model.

3. The method for re-recognizing the object image based on the multi-feature information capture and correlation analysis according to claim 2, wherein the transformer encoder in step b-4) comprises the multi-head attention mechanism and a feedforward layer; the multi-head attention mechanism comprises a plurality of self-attention mechanisms; a weight Attention(h.sub.l,i) of an i.sup.th value in the sequence h.sub.l∈custom-character.sup.n×d is calculated according to a formula Attention ( h l , i ) = Softmax ( Q i T K i d ) V i ,  wherein Q.sub.i represents an i.sup.th queried vector, T represents an inversion, K.sub.i represents a vector of a correlation between i.sup.th queried information and queried information from other blocks of the two-dimensional blocks, and V.sub.i represents a vector of the i.sup.th queried information; a new output embedding SA(h.sub.l) of the multi-head attention mechanism is calculated according to a formula SA(h.sub.l)=Proj(Concat.sub.i=1.sup.m(Attention(h.sub.l,i))); an input h′ of the feedforward layer is calculated according to a formula h′=ωLN(h.sub.l+SA(h.sub.l)); an output y of the transformer encoder is calculated according to a formula y=ωLN(h′+FFN(h′)), wherein Proj(.Math.) represents a linear mapping, Concat(.Math.) represents a stitching operation, FFN(h′)=∂W.sub.2 (h.sub.lW.sub.1+c.sub.1)+c.sub.2, ∂ represents a Gaussian Error Linear Unit (GELU) activation function, c.sub.1 and c.sub.2 are learnable offsets, ω represents a ratio, LN represents a normalization operation; and the new feature output from the first feature branch network and a feature y output from the second feature branch network are stitched into a feature vector of the object image.

4. The method for re-recognizing the object image based on the multi-feature information capture and correlation analysis according to claim 2, wherein in step c), a cross-entropy loss V.sub.ID is calculated according to a formula V ID = - .Math. i = 1 n g i log ( p i ) { g i = 0 , y i g i = 1 , y = i , wherein g.sub.i represents an indicator variable, n represents a number of classes in the training set, and p.sub.i represents a predicted probability of a class-i image; and the triplet loss function V.sub.t is calculated according to a formula V.sub.t=[∥ν.sub.a−ν.sub.p∥.sup.2−∥ν.sub.a−σ.sub.n∥.sup.2+α].sub.+, wherein α represents a spacing, ν.sub.a represents a sample of a class marker learned by a transformer, ν.sub.p represents a positive sample of the class marker learned by the transformer, ν.sub.n represents a negative sample of the class marker learned by the transformer, [d].sub.+ is max[d,0], and d=∥ν.sub.a−ν.sub.p∥.sup.2−∥ν.sub.a−σ.sub.n∥.sup.2+α.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0022] FIG. 1 shows a model of a method for re-recognizing an object image based on multi-feature information capture and correlation analysis according to the present disclosure.

[0023] FIG. 2 illustrates a feature map f of the vehicle image processed by using a convolutional layer.

[0024] FIG. 3 illustrates a feature map f of the vehicle image processed by using a linear transformation layer.

[0025] FIG. 4 illustrates a retrieval result.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0026] The present disclosure will be described in detail below with reference to FIG. 1.

[0027] A method for re-recognizing an object image based on multi-feature information capture and correlation analysis includes the following steps:

[0028] a) Collect a plurality of object images to form an object image re-recognition database, label ID information of an object image in the database, and divide the database into a training set and a test set.

[0029] b) Establish an object image re-recognition model based on multi-feature information capture and correlation analysis.

[0030] c) Optimize an objective function of the object image re-recognition model by using a cross-entropy loss function and a triplet loss function.

[0031] d) Manually mark the collected object images with ID information, input the marked object images into an optimized object image re-recognition model in step c) for training to obtain a trained object image re-recognition model, and store the trained object image re-recognition model.

[0032] e) Input a to-be-retrieved object image into the trained object image re-recognition model in step d) to obtain a feature of a to-be-retrieved object.

[0033] f) Compare the feature of the to-be-retrieved object with features of object images in the test set and sort comparison results by similarity measurement.

[0034] An input feature map is weighted by using a convolutional layer with a spatial attention mechanism and a channel attention mechanism, such that channel and spatial information is effectively combined, which can not only focus on an important feature and suppress an unnecessary feature, but also improve a representation of a concern to obtain a better feature. A transformer is used. A multi-head attention mechanism can better process a feature after an image is divided into blocks to capture more abundant feature information and can take into account a correlation between features to obtain good performance and improve efficiency of object image retrieval. The convolutional layer with the channel attention mechanism and the spatial attention mechanism and the transformer with the multi-head attention mechanism are combined to globally focus on the important feature and better capture a fine-grained feature, thereby improving performance of re-recognition.

[0035] Step b) includes the following steps:

[0036] b-1) Set an image input network to two branch networks, namely, a first feature branch network and a second feature branch network.

[0037] b-2) Input an object image h in the training set into the first feature branch network, where h∈R.sup.e×w×3, R represents real number space, e represents a quantity of horizontal pixels of the object image h, w represents a quantity of vertical pixels of the object image h, and 3 represents a quantity of channels of each RGB image; process the object image h by using a convolutional layer to obtain a feature map f; process the feature map f by using the channel attention mechanism; perform global average pooling and global maximum pooling on the feature map f once to obtain two one-dimensional vectors; normalize the two one-dimensional vectors through convolution, ReLU activation function, 1*1 convolution, and sigmoid function operations to give weight to the feature map f; perform maximum pooling and average pooling on all channels at each position in a weighted feature map f by using the spatial attention mechanism to obtain a maximum pooled feature map and an average pooled feature map respectively for stitching; perform 7*7 convolution on a stitched feature map and then normalize the stitched feature map by using a batch normalization layer and a sigmoid function; and multiply a normalized stitched feature map by the feature map f to obtain a new feature.

[0038] b-3) Input the object image h in the training set into the second feature branch network, where h∈R.sup.e×w×3; divide the image h into n two-dimensional blocks; represent embeddings of the two-dimensional blocks as a one-dimensional vector h.sub.l∈R.sup.n×(p.sup.2.sup..Math.3) by using a linear transformation layer, where p represents resolution of an image block, and n=ew/p.sup.2; in order to obtain a total quantity of blocks, namely, a length of a valid input sequence of the transformer, flatten these small blocks and map them into embeddings with a size of d, where because important information of some edges and corners may be omitted when the image is divided into the blocks, a different attention coefficient is allocated to each block by using an attention mechanism, and then an additional class marker k is added to the sequence; calculate an average embedding h.sub.a of all the blocks according to a formula

[00005] h a = .Math. i n h i n ,

where h.sub.i represents an embedding that is of an i.sup.th block and obtained through initialization based on a Gaussian distribution, and i∈{1, . . . , n}; calculate an attention coefficient a.sub.i of the i.sup.th block according to a formula a.sub.i=q.sup.Tσ(W.sub.1h.sub.0+W.sub.2h.sub.i+W.sub.3h.sub.a), where q.sup.T represents a weight, σ represents the sigmoid function, h.sub.0 represents the class marker, and W.sub.1, W.sub.2, and W.sub.3 are weights; calculate a new embedding h of each block according to a formula

[00006] h l = .Math. i = 1 n a i h i n ;

and calculate a new class marker h.sub.0′ according to a formula h.sub.0′=W.sub.4[h.sub.0∥h.sub.l], where W.sub.4 represents a weight.

[0039] b-4) Take the new class marker h.sub.0′ and a sequence with an input size of h.sub.l∈custom-character.sup.n×d as an overall representation of a new image, where d.sub.c=d*m, d represents a dimension size of a head of each self-attention mechanism in the multi-head attention mechanism, and m represents a quantity of heads of the multi-head attention mechanism; and add position information in the new image, and then take the new image as an input of a transformer encoder to complete the establishment of the object image re-recognition model.

[0040] The transformer encoder in step b-4) includes the multi-head attention mechanism and a feedforward layer. The multi-head attention mechanism is composed of a plurality of self-attention mechanisms. A weight Attention(h.sub.l,i) of an i.sup.th value in the sequence h.sub.l∈custom-character.sup.n×d is calculated according to a formula

[00007] Attention ( h l , i ) = Softmax ( Q i T K i d ) V i ,

where Q.sub.i represents an i.sup.th queried vector, T represents inversion, K.sub.i represents a vector of a correlation between i.sup.th queried information and other information, and V.sub.i represents a vector of the i.sup.th queried information. A new output embedding SA(h.sub.l) of the multi-head attention mechanism is calculated according to a formula SA(h.sub.l)=Proj(Concat.sub.i=1.sup.m(Attention(h.sub.l,i))). An input h′ of the feedforward layer is calculated according to a formula h′=ωLN(h.sub.l+SA(h.sub.l)). An output y of the encoder is calculated according to a formula y=ωLN(h′+FFN(h′)), where Proj(.Math.) represents a linear mapping, Concat(.Math.) represents a stitching operation based on the formula FFN(h′)=∂W.sub.2 (h.sub.lW.sub.1+c.sub.1)+c.sub.2, ∂ represents a GELU activation function, c.sub.1 and c.sub.2 are learnable offsets, ω represents a ratio, and LN represents a normalization operation. The feature output from the first feature branch network and a feature y output from the second feature branch network are stitched into a feature vector of the object image. A residual eigenvalue is rescaled by using a smaller value of ω, which helps to enhance a residual connection, and y represents the output of the encoder. After the attention coefficient is learned by the multi-head attention mechanism composed of the self-attention mechanisms, more abundant feature information is captured, and a degree of attention to each feature is obtained. In addition, residual design and layer normalization are added to prevent disappearance of a gradient and accelerate convergence. The new feature on this branch is obtained by using a plurality of encoders.

[0041] In step c), a cross-entropy loss V.sub.ID is calculated according to a formula

[00008] V ID = - .Math. i = 1 n g i log ( p i ) { g i = 0 , y i g i = 1 , y = i ,

where g.sub.i represents an indicator variable, n represents a number of classes in the training data set, and p.sub.i represents a predicted probability of a class-i image. The triplet loss function V.sub.t is calculated according to a formula V.sub.t=[∥ν.sub.a−ν.sub.p∥.sup.2−∥ν.sub.a−σ.sub.n∥.sup.2+α].sub.+, where α represents a spacing, ν.sub.a represents a sample of a class marker learned by the transformer, ν.sub.p represents a positive sample of the class marker learned by the transformer, σ.sub.n represents a negative sample of the class marker learned by the transformer, [d].sub.+ is max[d,0], and d=∥ν.sub.a−ν.sub.p∥.sup.2−∥ν.sub.a−σ.sub.n∥.sup.2+α.

[0042] Re-recognition of a transport vehicle in a large industrial park is taken as an example. An implementation of the present disclosure is as follows: At first, a plurality of images of the transport vehicle in the industrial park are collected to construct a re-recognition database, ID information of a vehicle image in the database is labeled, and the database is divided into a training set and a test set.

[0043] Then, a vehicle image re-recognition model based on multi-feature information capture and correlation analysis is established. The model is divided into a first feature branch network and a second feature branch network. In the first feature branch network, a vehicle image h in the training set is input, where h∈R.sup.e×w×3, R represents real number space, e represents a quantity of horizontal pixels of the vehicle image (e=256), w represents a quantity of vertical pixels of the vehicle image (w=256), and 3 represents a quantity of channels of each RGB image. The vehicle image is processed by using a convolutional layer to obtain a feature map f of the vehicle image, as shown in FIG. 2.

[0044] After that, the feature map f of the vehicle image is processed by using a channel attention mechanism. Global average pooling and global maximum pooling are performed on the feature map to obtain two one-dimensional vectors, and the two one-dimensional vectors are normalized through convolution, ReLU activation function, 1*1 convolution, and sigmoid function operations to weight the feature map. Maximum pooling and average pooling are performed on all channels at each position in a weighted feature map f by using a spatial attention mechanism to obtain a maximum pooled feature map and an average pooled feature map for stitching. 7*7 convolution is performed on a stitched feature map, and then the stitched feature map is normalized by using the batch normalization layer and a sigmoid function. A normalized stitched feature map is multiplied by the feature map f to obtain a new feature.

[0045] In the second feature branch network, the vehicle image h in the training set is input and divided into n two-dimensional vehicle image blocks (n=256). Embeddings of the two-dimensional vehicle image blocks are represented as a one-dimensional feature vector h.sub.l∈R.sup.n×(p.sup.2.sup..Math.3) of the vehicle image by using a linear transformation layer, where p represents resolution of a vehicle image block (p=16), and n=ew/p.sup.2, as shown in FIG. 3.

[0046] An average embedding h.sub.a of all the vehicle image blocks is calculated according to a formula

[00009] h a = .Math. i n h i n ,

where h.sub.i represents an embedding that is of an i.sup.th vehicle image block and obtained through initialization based on a Gaussian distribution, and i∈{1, . . . , n}. An attention coefficient a.sub.i of the i.sup.th vehicle image block is calculated according to a formula a.sub.i=q.sup.Tσ(W.sub.1h.sub.0+W.sub.2h.sub.i+W.sub.3h.sub.a), where q.sup.T represents a weight, σ represents the sigmoid function, h.sub.0 represents a class marker of the vehicle image block, and W.sub.1, W.sub.2, and W.sub.3 are weights. A new embedding h.sub.l of each vehicle image block is calculated according to a formula

[00010] h l = .Math. i = 1 n a i h i n .

A new class marker h.sub.0′ of the vehicle image block is calculated according to a formula h.sub.0′=W.sub.4[h.sub.0∥h.sub.l], where W.sub.4 represents a weight. The new class marker h.sub.0′ of the vehicle image block and a sequence with an input size of the vehicle image as h.sub.l∈custom-character.sup.n×d.sup.c are taken as an overall representation of a new vehicle image, where d.sub.c=d*m, d represents a dimension size of a head of each self-attention mechanism in a multi-head attention mechanism (d=96), and m represents a quantity of heads of the multi-head attention mechanism (m=8). Position information is added in the new vehicle image, and the new vehicle image is taken as an input of a transformer encoder. The transformer encoder includes the multi-head attention mechanism and a feedforward layer, and the multi-head attention mechanism is composed of a plurality of self-attention mechanisms. A weight Attention (h.sub.l,i) of an i.sup.th value in the sequence h.sub.l∈custom-character.sup.n×d is calculated according to a formula

[00011] Attention ( h l , i ) = Softmax ( Q i T K i d ) V i ,

where Q.sub.i represents a vector of an i.sup.th queried vehicle, T represents inversion, K.sub.i represents a vector of a correlation between i.sup.th queried vehicle image block information and other vehicle image block information, and V.sub.i represents a vector of the i.sup.th queried vehicle image block information. A feature embedding SA(h.sub.l) of a new vehicle image output by the multi-head attention mechanism is calculated according to a formula SA(h.sub.l)=Proj(Concat.sub.i=1.sup.m(Attention (h.sub.l,i))). An input vehicle image h′ of the feedforward layer is calculated according to a formula h′=ωLN(h.sub.l+SA(h.sub.l)). An output vehicle image y of the encoder is calculated according to a formula y=ωLN(h′+FFN(h′)), where Proj(.Math.) represents a linear mapping, Concat(.Math.) represents a stitching operation FFN(h′)=∂W.sub.2 (h.sub.lW.sub.1+c.sub.1)+c.sub.2, ∂ represents a GELU activation function, c.sub.1 and c.sub.2 are learnable offsets, ω represents a ratio, and LN represents a normalization operation. Finally, the feature output from the first feature branch network and a feature y output from the second feature branch network are stitched into a feature vector of the vehicle image to complete the establishment of the vehicle image re-recognition model.

[0047] Then, the vehicle image re-recognition model is optimized by using a cross-entropy loss function and a triplet loss function. A cross-entropy loss V.sub.ID is calculated according to a formula

[00012] V ID = - .Math. i = 1 n g i log ( p i ) { g i = 0 , y i g i = 1 , y = i ,

where g.sub.i represents an indicator variable, n represents a quantity of vehicle image classes in the training data set, and p.sub.i represents a predicted probability of a class-i vehicle image. The triplet loss function V.sub.t is calculated according to a formula V.sub.t=[∥ν.sub.a−ν.sub.p∥.sup.2−∥ν.sub.a−σ.sub.n∥.sup.2+α].sub.+, where α represents a spacing, ν.sub.a represents a sample of a class marker that is in the vehicle image and learned by a transformer, ν.sub.p represents a positive sample of the class marker that is in the vehicle image and learned by the transformer, ν.sub.n represents a negative sample of the class marker that is in the vehicle image and learned by the transformer, [d].sub.+ is max[d,0], and d=∥ν.sub.a−ν.sub.p∥.sup.2−∥ν.sub.a−σ.sub.n∥.sup.2+α.

[0048] A trained vehicle image re-recognition model is obtained after the optimization by using the loss functions and is stored.

[0049] A to-be-retrieved vehicle image is input into the trained vehicle image re-recognition model to obtain a feature of the to-be-retrieved vehicle image.

[0050] Finally, the feature of the to-be-retrieved vehicle image is compared with a feature of a vehicle image in the test set, and a comparison result is sorted by similarity measurement. A retrieval result is shown in FIG. 4.

[0051] Finally, it should be noted that the above descriptions are only preferred embodiments of the present disclosure and are not intended to limit the present disclosure. Although the present disclosure is described in detail with reference to the foregoing embodiments, a person skilled in the art can still make modifications to the technical solutions described in the foregoing embodiments or make equivalent replacement of some technical features therein. Any modifications, equivalent substitutions, improvements, and the like made within the spirit and principle of the present disclosure should be included within the protection scope of the present disclosure.