Ear detection method with deep learning pairwise model based on contextual information
11521427 · 2022-12-06
Assignee
Inventors
Cpc classification
G06V10/44
PHYSICS
G06N3/082
PHYSICS
G06V10/774
PHYSICS
G06V10/762
PHYSICS
G06V10/26
PHYSICS
G06V40/171
PHYSICS
International classification
G06V10/762
PHYSICS
G06V10/26
PHYSICS
G06V10/44
PHYSICS
Abstract
An ear detection method with deep learning pairwise model based on contextual information belongs to the field of biometric recognition technologies, and addresses a problem that an ear location cannot be found in a large scene, especially in a background image containing a whole body. The method includes: performing preprocessing and object labeling on images; modifying an Oquab network to be a local model for four classes through transfer learning and training the local model; training two pairwise models of head and ear as well as body and head based on the local model; and performing joint detection for an ear through the local model, the two pairwise models and body features. The method uses a hierarchical relationship from large to small to establish contextual information, which can reduce the interference of other features and detect the location of the ear more accurately.
Claims
1. An ear detection method with deep learning pairwise model based on contextual information, comprising: step 1, image preprocessing and object labeling, comprising: obtaining original images, performing data augmentation processing on the original images to obtain an image training set, and labeling bodies, heads and ears of images in the image training set with classes through rectangular boxes; step 2, modifying an Oquab network to be a local model for four classes through transfer learning and training the local model, comprising: i) removing an output layer and a last feature layer of the Oquab network, adding a fully connected layer including rectified linear unit (ReLU) and dropout functions for feature extraction, and adding an output layer including the four classes of body, head, ear and background; ii) generating suggested candidate boxes for each the image in the image training set through a signed sliding window (SSW) method, and adding truth values of the body, the head and the ear into the suggested candidate boxes for each the image to form training samples; iii) reading images of the training samples, calculating an image average value according to the training samples, subtracting the image average value from the training samples and then training; wherein the image average value is represented by [M.sub.r, M.sub.g, M.sub.b], and M.sub.r represents a red average value of the training samples, M.sub.g represents a green average value of the training samples, and M.sub.b represents a blue average value of the training samples; and iv) employing an already set network structure, and training network parameters of the local model through a random gradient descent method with momentum; step 3, training two pairwise models of head and ear as well as body and head based on the local model individually, comprising: selecting a first convolutional layer (conv1) through an eighth convolutional layer (conv8) of the local model as a front part of each the pairwise model and connecting two fully connected layers (conv10, conv 11) In parallel as a rear part of each the pairwise model, wherein 1) pairwise model building, comprising: for one of the pairwise models, the front part thereof is the same as that of the trained local model, and the rear part thereof is the two fully connected layers connected in parallel; one of the two fully connected layers is a unitary potential field network model layer, and the other of the two fully connected layers is a pairwise model potential field network layer; and a joint score function is expressed as formula (1):
S(y;ω)=αΣ.sub.i∈v.sub.
θ.sub.i.sup.U=φ.sup.U(f.sub.i,ω.sup.U),θ.sub.j.sup.U=φ.sub.U(f.sub.j,ω.sup.U) (2) a joint potential field value corresponding to the head and the ear in pair is obtained through formula 3:
θ.sub.i,j,k.sub.
s.sub.p(ω)=max.sub.y:y.sub.
L(ω,ÿ,X)=Σ.sub.i:ÿ.sub.
2. The ear detection method with deep learning pairwise model based on contextual information according to claim 1, wherein weights of truth labels and a background weight are set based on a quantity of the truth labels and a quantity of background labels inversely proportional to the quantity of the truth labels and normalized, when using the formula (5) to calculate the final function loss value.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
DETAILED DESCRIPTION OF EMBODIMENTS
(5) The invention will be described in detail below in combination with the accompanying drawings and embodiments.
(6) Please refer to
(7) Step 1: image preprocessing and object labeling
(8) In particular, 700 numbers of original images are obtained through network collection and personal shooting, and then the original images are performed data augmentation processing, including operations such as image flipping, image resizing, image translation, image rotation, and noise addition; and an image training set including a total of more than 8000 numbers of images is then obtained. Afterwards, bodies, heads and ears of images in the image training set are labeled with classes.
(9) Step 2: modifying an Oquab convolutional neural network (also referred to as Oquab network) to be a local model for four classes through transfer learning, and training the local model.
(10) The transfer learning can transfer feature parameters of a source network learned from a large amount of data to a new network with a small number of training samples. In Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic, “Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks”, CVPR, June 2014, the Oquab network is proposed. The Oquab network is copied and the output layer and the last feature layer of the Oquab network are removed, and then a fully connected layer including rectified linear unit (ReLU) and dropout functions is added for feature extraction, and an output layer of four classes (i.e., body, head, ear and background) is added. For each of the images in the image training set, a signed sliding window (SSW) method is used to generate recommended/suggested candidate boxes, and truth values of the body, the head and the ear are added to the recommended candidate boxes for each of the images, to thereby form initial training samples. Afterwards, images of the training samples are read, and an image average value is calculated according to the training samples, and then each of the training samples is subtracted with the average value and then used for training. The calculated image average value is [M.sub.r, M.sub.g, M.sub.b], M.sub.r represents a red average value of the training samples, M.sub.g represents a green average value of the training samples, and M.sub.b represents a blue average value of the training samples. The already set network structure then is employed, and a random gradient descent method with momentum is used to train network parameters of the local model.
(11) The images inputted into the local model network are 224*224*3, and a whole network flow is shown in
(12) TABLE-US-00001 TABLE 1 network parameters 224*224*3 Parameters Output Conv1 11*11*3*96, Stride = 4, Pad = 55*55*96*128 2*1*2*1, ReLU = 1, Mpool = 2 Conv2 5*5*96*256, Stride = 1, Pad = 2, 27*27*256*128 ReLU = 1, Mpool = 2 Conv3 3*3*256*384, Stride = 1, Pad = 1, 13*13*384*128 ReLU = 1 Conv4 3*3*384*384, Stride = 1, Pad = 1, 13*13*384*128 ReLU = 1 Conv5 3*3*384*256, Stride = 1, Pad = 1, 6*6*256*128 ReLU = 1, Mpool = 2 Conv6 6*6*256*6144, Stride = 1, Pad = 0, 1*1*6144*128 ReLU = 1, dropout = 1 Conv7 1*1*6144*6144, Stride = 1, Pad = 0, 1*1*6144*128 ReLU = 1, dropout = 1 Conv8 1*1*6144*2048, Stride = 1, Pad = 0, 1*1*2048*128 ReLU = 1, dropout = 1 Conv9 1*1*2048*4, Stride = 1, Pad = 0, 1*1*2*128 ReLU = 1, dropout = 1 Softmax Using cross entropy to calculate loss loss function Conv10 1*1*2048*1, Stride = 1, Pad = 0, 1*1*1*128 ReLU = 1, dropout = 1 Conv11 1*1*4096*16, Stride = 1, Pad = 0, 1*1*16*1024 ReLU = 1, dropout = 1 loss according to a unitary potential field output by conv10 and a pairwise potential field output by conv11, a target loss value of a mixed function is obtained by using formula (5).
(13) Step 3: two pairwise models of head and ear as well as body and head (i.e., a pairwise model 2 and pairwise model 1) are trained individually according to the local model.
(14) Specifically, first to eighth convolutional layers (i.e., conv1 to conv8) of the local model network are selected as a front part, and then a tenth convolutional layer (conv10) and an eleventh convolutional layer (conv11) are connected in parallel to the front part. A loss value can be obtained by inputting the unitary potential field values output by the conv10 and a pairwise potential field value output by the conv11 into the formula (5). A network flow of each the pairwise model is shown in
(15) 1. Building of Pairwise Model
(16) For one of the pairwise models, the front part of the pairwise model is the same as that of the trained local model, and two fully connected layers (i.e., conv10 and conv11) connected in parallel are taken as a rear part of the pairwise model. One of the two fully connected layers is a unitary potential field network model layer, the other one of the two fully connected layers is a pairwise model potential field network layer. A joint score function is defined by formula (1) as follows:
S(y;ω)=αΣ.sub.i∈v.sub.
(17) where S(y; ω) represents a joint score, α, β and γ (α+β+γ=1) are penalty weights, which represent influences of different potential fields on the joint score. α represents the influence of a head potential field on the joint score, β represents the influence on an ear potential field on the joint score, and γ represents the influence of the pairwise model on the joint score. Because the head is a larger object and has more features, and thus it is easier to detect, so that a probability of error and loss is smaller, and consequently a relatively small weight is given; while the ear is a small object, which is difficult to detect and easy to make a mistake, and thus a larger penalty weight is given. y.sub.i (i∈v.sub.1) and y.sub.j (j∈v.sub.2) are binary variables, v.sub.1 and v.sub.2 are candidate variables of head and ear respectively. ε is a candidate set of pairwise head and ear formed by (i, j), which is also called as an edge set.
(18) Corresponding to feature vector f.sub.1 of one head and a feature vector f of one ear, corresponding unitary potential field values can be obtained through formula (2) as follows:
θ.sub.i.sup.U=φ.sup.U(f.sub.i,ω.sup.U),θ.sub.j.sup.U=φ.sub.U(f.sub.j,ω.sup.U) (2).
(19) A joint potential field value corresponding to the pairwise head and ear can be obtained through formula (3) as follows:
θ.sub.i,j,k.sub.
(20) where the θ.sub.i.sup.U represents the unitary potential field value of the head, the θ.sub.j.sup.U represents the unitary potential field value of the ear, the φ.sup.U is used for mapping candidate box features (also referred to as feature vectors) f.sub.i and f.sub.j to the θ.sub.i.sup.U and the θ.sub.j.sup.U. The θ.sub.i,j,k.sub.
(21) For each pairwise candidate boxes P of head and ear, an individual score s.sub.p(ω) defined by a maximum marginal difference of joint score is calculated through formula (4) as follows:
s.sub.p(ω)=max.sub.y:y.sub.
(22) where v.sub.1=v.sub.2 and v.sub.1+v.sub.2=v, when a value of v is small, an enumeration method can be used to accurately solve an optimal solution of the formula (4). When the value of v is large, firstly a quadratic pseudo-Boolean function can be used to solve a suboptimal solution of the formula (4) and some data candidates are labeled, then the remaining unlabeled data candidates are solved by using the enumeration method and labeled. After data candidates all are labeled, a function loss of the pairwise model can be calculated. In this design, the quantity of candidate targets is set to 32, and an image scale corresponding to the model is small, which belongs to small-scale target optimization. The solution of the maximum marginal difference of joint score is transformed into the quadratic pseudo-Boolean function optimization problem, the pseudo-Boolean optimization function is an important basic combinatorial optimization problem. A maximum flow minimum cut graph-cut algorithm is used to solve the optimal solution of some variables, and the remaining unlabeled variables are solved by using the enumeration method. This is a heuristic optimization algorithm, which can converge to a better local solution through fast iteration.
(23) The loss function is defined by formula (5) as follows:
L(ω,ÿ,X)=Σ.sub.i:ÿ.sub.
(24) where v(t)=log (1+exp (−t)).
(25) 2. Training of Pairwise Model
(26) {circle around (1)} First, according to scores of the respective images Img obtained through the local model, ranking the obtained scores from high to low and combining a non-maximum suppression method to select 32 numbers of head candidate boxes and 32 numbers of ear candidate boxes, and then forming head and ear pairs based on all the selected head candidate boxes and ear candidate boxes. Afterwards, sorting all the paired head and ear data to form a total of 32*32=1024 candidate pairs each with a layout of head-left and ear-right.
(27) {circle around (2)} Then using a k-means algorithm to perform clustering analysis on the candidate pairs, and assigning a class number k to each of samples in the candidate pairs, wherein the class number k refers to a cluster center to which each the image belongs. This cluster center will be used in the next step {circle around (3)} to calculate the loss value. See the formula (3) for the existence of implicit correlation. The process may be as follows:
(28) calculating k numbers of cluster centers for the 1024 numbers of candidate pairs, i.e., the candidate pairs of the pairwise model of all the samples, k=16; and after applying the k-means method to analyze and cluster all the samples, assigning the class numbers [1, 2, 3 . . . 16] to all the samples. The clustering process may be as follows: letting a rectangular box of head truth value is represented by [y.sub.1, x.sub.1, y.sub.2, x.sub.2] and a rectangular box of ear truth value is represented by [y.sub.3, x.sub.3, y.sub.4, x.sub.4], and thereby forming a layout pair of head-left and ear-right.
(29)
(30) A clustering feature F may be expressed as follows:
(31)
(32) The k-means method is applied to perform clustering analysis through the feature F.
(33) {circle around (3)} Removing the softmax layer of the local model, taking the head feature vector f.sub.i and the ear feature vector f.sub.j obtained by the conv8 as initial features, and sending the initial features into conv10 and conv11 simultaneously. The unitary potential field values are obtained from the conv10, and the pairwise potential field value is obtained from the conv11. For the formula (4), using a maximum flow minimum cut method to determine class labels of all the candidate boxes. If there is a candidate box has not been labeled, using the enumeration method to determine a class label of each remaining candidate box. The formula (5) is applied to calculate a final function loss value. When the loss value is calculated, setting weights of truth labels to be larger than a background weight (setting based on a quantity of the truth labels and a quantity of background labels inversely proportional to the quantity of the truth labels and normalization). In this way, a greater loss will be caused when a class label is wrongly assigned, and the impact on loss will be added to the final loss. Then, calculating a gradient differential of the model, and updating parameters by back propagation, so as to obtain trained model parameters ω.sup.U and ω.sup.P under a lowest loss value.
(34) Based on the above design process, a pairwise model 2 of head and ear can be obtained, which is with the layout of head-left and ear-right.
(35) By repeating the above operations of building and training in the step 3, a pairwise model 1 of body and head can be obtained, which is with a layout of body-left and head-right.
(36) Step 4, performing joint detection for an ear by using the local model, the pairwise model 1, the pairwise model 2 and body features.
(37) Because an ear occupies a relatively small portion of the body, and thus it is a difficult problem to detect the ear in an image with half-body or even full-body as the scene. Referring to
(38) (1) Candidate boxes of a detected image are first obtained by using a SSW segmentation method, and the obtained candidate boxes are sent to the local model for detection. Local scores of corresponding classes are obtained and then ranked from high to low. The first 32 numbers of the candidate boxes of each class are selected through a non-maximum suppression method. According to the pairwise model 1 and the pairwise model 2, obtained local features as input items are input into conv10 and conv11, and unitary potential field values and pairwise potential field values of the detected image then are obtained. Finally, calculation is performed based on the unitary potential field values and the pairwise potential field values, to obtain a head candidate set C.sub.h and an ear candidate set C.sub.e as per scores ranking from high to low.
(39) (2) The pairwise model 1 and the local model are used to detect a head location. Because the body is a large object and has rich features, and thus the local model is easy to detect the location of the body. Then, through the location information of the body, combined with head candidate box probabilities obtained from the local model to judge that: for head candidate boxes intersecting with an upper region of the body, ones of which with high probabilities are selected as a head candidate set. According to the theory of human body structure proportion, a shoulder width is a distance about 1.5˜2.3 times a height of the head. According to a width of the body candidate box, a head height H.sub.h can be calculated. Taking a top center of the body candidate box as a reference, moving upwards and downwards each with a distance of one head height, and moving leftwards and rightwards each with a distance of one head height to thereby form a reference region H.sub.s=4H.sub.h.sup.2. Candidate boxes intersecting with the region H.sub.s are taken as a head candidate box set H.sub.c. A head probability threshold a.sub.h is set, and candidate boxes meeting the condition of H.sub.c>a.sub.h are used as head candidate targets H.sub.ca.
(40) (3) C.sub.h number of head candidate targets are selected as per scores S.sub.h of the pairwise model 1 of body and head ranking from high to low. Joint judgement is then performed on H.sub.ca and the C.sub.h number of head candidate targets to obtain an intersection of them. When the intersection is not null/empty, head candidate boxes of the intersection are used as a head candidate target set H.sub.sec, or when intersection is empty, the head candidate targets with higher scores in H.sub.ca are selected as the head candidate target set H.sub.sec.
(41) (4) According to the head candidate target set H.sub.sec obtained in the above step, ear candidate targets corresponding to head candidate boxes in the set H.sub.sec are calculated. According to the method of “facial height being divided into approximately three equal parts and facial width being divided into approximately five equal parts”, the location of ear is roughly between the upper ⅓ part and the lower ⅓ part based on a center line of the head height as reference; and if it is a child, moving the center line down to the lower ⅓ part of the head height. Left and right positions of the ear are about at leftwards ⅕ part and rightwards ⅕ part relative to the head width. Considering that the head's outward tilt posture generally does not exceed 45 degrees, the method extends the left and right positions of the ear each with one ⅕ part outwards (in order to cover special cases, it can be extended outwards with three ⅕ parts), so as to measure a left-right distance range of the ear. Therefore, an embodiment of the invention sets an ear candidate region to be in a range of [−⅖H.sub.h,⅖H.sub.h] obtained according to a range of [−⅖,⅖] of head width using left and right boundary lines as reference and being replaced the head width with the head height H.sub.h. According to head regions in the set H.sub.sec, corresponding ear regions can be calculated as
(42)
Intersection of a segmentation target set and the ear regions
(43)
is taken as an ear candidate set S.sub.e. All the candidate boxes in the set S.sub.e are ranked as per scores of the local model for ear from high to low to obtain an ear candidate target set S.sub.ec. Then, the pairwise model 2 is applied to obtain a candidate box score set, and then an intersection of C.sub.e number of ear candidate boxes contained in the candidate box score set and ear-containing candidate boxes in the head candidate target set H.sub.sec, if the intersection is not null, the intersection is taken as an ear detection target set C.sub.ec, and whereas, if the intersection is null, ear candidate boxes with higher scores in the set C.sub.e are selected as the ear detection target set C.sub.ec.
(44) (5) A joint judgement is performed on the set C.sub.ec and the set S.sub.ec to obtain an intersection of the ear detection target set C.sub.ec and the set S.sub.ec; when the intersection is not empty, taking the ear candidate box with the largest score obtained by the pairwise model 2 in the intersection as a resultant ear object; or when the intersection is empty, selecting the ear candidate box with the largest score in the set C.sub.ec as the resultant ear object.
(45) (6), performing curve evolution of ear outer contour. In particular, the rectangular candidate box of the ear (i.e., the ear candidate box corresponding to the resultant ear object) is taken as an initial boundary, and the curve evolution is performed on an image in a region twice as large as that the ear candidate box corresponding to the resultant ear object through a Chan-Vese (C-V) method, thereby a curve contour of the ear is obtained. An ear contour pixel coordinate set is set to that P.sub.c {P.sub.i,j|i,j∈N}. Coordinates (i, j) of uppermost, lowermost, leftmost and rightmost pixels in the contour curve of the ear are extracted, and a rectangular box is then redrawn according to the extracted coordinates as a resultant ear object region.