Ear detection method with deep learning pairwise model based on contextual information

Abstract

An ear detection method with deep learning pairwise model based on contextual information belongs to the field of biometric recognition technologies, and addresses a problem that an ear location cannot be found in a large scene, especially in a background image containing a whole body. The method includes: performing preprocessing and object labeling on images; modifying an Oquab network to be a local model for four classes through transfer learning and training the local model; training two pairwise models of head and ear as well as body and head based on the local model; and performing joint detection for an ear through the local model, the two pairwise models and body features. The method uses a hierarchical relationship from large to small to establish contextual information, which can reduce the interference of other features and detect the location of the ear more accurately.

Claims

1. An ear detection method with deep learning pairwise model based on contextual information, comprising: step 1, image preprocessing and object labeling, comprising: obtaining original images, performing data augmentation processing on the original images to obtain an image training set, and labeling bodies, heads and ears of images in the image training set with classes through rectangular boxes; step 2, modifying an Oquab network to be a local model for four classes through transfer learning and training the local model, comprising: i) removing an output layer and a last feature layer of the Oquab network, adding a fully connected layer including rectified linear unit (ReLU) and dropout functions for feature extraction, and adding an output layer including the four classes of body, head, ear and background; ii) generating suggested candidate boxes for each the image in the image training set through a signed sliding window (SSW) method, and adding truth values of the body, the head and the ear into the suggested candidate boxes for each the image to form training samples; iii) reading images of the training samples, calculating an image average value according to the training samples, subtracting the image average value from the training samples and then training; wherein the image average value is represented by [M.sub.r, M.sub.g, M.sub.b], and M.sub.r represents a red average value of the training samples, M.sub.g represents a green average value of the training samples, and M.sub.b represents a blue average value of the training samples; and iv) employing an already set network structure, and training network parameters of the local model through a random gradient descent method with momentum; step 3, training two pairwise models of head and ear as well as body and head based on the local model individually, comprising: selecting a first convolutional layer (conv1) through an eighth convolutional layer (conv8) of the local model as a front part of each the pairwise model and connecting two fully connected layers (conv10, conv 11) In parallel as a rear part of each the pairwise model, wherein 1) pairwise model building, comprising: for one of the pairwise models, the front part thereof is the same as that of the trained local model, and the rear part thereof is the two fully connected layers connected in parallel; one of the two fully connected layers is a unitary potential field network model layer, and the other of the two fully connected layers is a pairwise model potential field network layer; and a joint score function is expressed as formula (1):
S(y;ω)=αΣ.sub.i∈v.sub.1y.sub.iθ.sub.i.sup.U(ω)+βΣ.sub.j∈v.sub.2y.sub.jθ.sub.j.sup.U(ω)+γΣ.sub.(i,j)∈εy.sub.iy.sub.jθ.sub.i,j,k.sub.i,j.sup.P(ω) (1) where S(y; ω) represents a joint score; α, β and γ are penalty weights, which represent influences of different potential fields on the joint score; a represents the influence of a head potential field on the joint score, β represents the influence of an ear potential field on the joint score, γ represents the influence of the pairwise model on the joint score, and α+β+γ=1; y.sub.i and y.sub.j are binary variables, v.sub.1 and v.sub.2 are candidate variables of head and ear respectively; and ε is a candidate set of pairwise head and ear formed by (i, j), namely, an edge set; corresponding to a feature vector f.sub.1 of single head and a feature vector f.sub.j of single ear, corresponding unitary potential field values are obtained through formula (2):
θ.sub.i.sup.U=φ.sup.U(f.sub.i,ω.sup.U),θ.sub.j.sup.U=φ.sub.U(f.sub.j,ω.sup.U) (2) a joint potential field value corresponding to the head and the ear in pair is obtained through formula 3:
θ.sub.i,j,k.sub.i,j.sup.P=φ.sub.k.sub.i,j.sup.P(f.sub.i,f.sub.j,ω.sup.P) (3) where θ.sub.i.sup.U represents the unitary potential field value of the head, θ.sub.j.sup.U represents the unitary potential field value of the ear, φ.sup.U maps the feature vectors f.sub.i and f.sub.j to θ.sub.i.sup.U and θ.sub.j.sup.U, or θ.sub.i,j,k.sub.i,j.sup.P represents the joint potential field value of the head and the ear in pair, φ.sup.P maps candidate features of the head and the ear in pair to θ.sub.i,j,k.sub.i,j.sup.P, a k-th component corresponds to a k-th cluster center index, and ω.sup.U and ω.sup.P are trainable parameters; for each pairwise candidate boxes P of head and ear, an individual score s.sub.p (ω) defined by a maximum marginal difference of joint score is obtained through formula (4):
s.sub.p(ω)=max.sub.y:y.sub.p.sub.=1S(y;ω)−max.sub.y:y.sub.p.sub.=0S(y;ω) (4) where v.sub.1=v.sub.2 and v.sub.1+v.sub.2=v; when a value of v is small, an enumeration method is used to solve an optimal solution of the formula (4); when the value of v is large, a quadratic pseudo-Boolean function is first used to solve a suboptimal solution of the formula (4) and some of data are labeled, and then remaining unlabeled data are solved through the enumeration method and labeled; and after data all are labeled, a function loss of the pairwise model is calculated through a loss function expressed as formula (5):
L(ω,ÿ,X)=Σ.sub.i:ÿ.sub.i.sub.=1v(s.sub.i(ω,x))+Σ.sub.i:ŷ.sub.i.sub.=0v(−s.sub.i(ω,x)) (5) where v(t)=log (1+exp (−t)); 2) Pairwise model training, comprising: {circle around (1)} according to scores of images obtained through the local model, selecting head candidate boxes and ear candidate boxes by ranking the scores from high to low and using a non-maximum suppression method, forming head and ear pairs based on the selected head candidate boxes and ear candidate boxes, and sorting paired head and ear data to form candidate pairs each with a layout of head-left and ear-right; {circle around (2)} performing cluster analysis on the candidate pairs through a k-means method, and assigning a class number k to each of samples in the candidate pairs, wherein the class number k refers to a cluster center to which each the sample belongs; {circle around (3)} removing a softmax layer of the local model, taking the feature vector f.sub.i of head and the feature vector f.sub.j of ear obtained by the eighth convolution layer (conv8) as initial features, sending the initial features simultaneously to the two fully connected layers (conv10, conv11) to thereby obtain the unitary potential field values from the unitary potential field network model layer and the joint potential field value from the pairwise model potential field network layer; determining class labels of candidate boxes by using a maximum flow minimum cut method for the formula (4), using the enumeration method to determine a class label of each remaining candidate box when there is a candidate box has not been labeled, calculating a final function loss value through the formula (5), obtaining trained values of the trainable parameters ω.sup.U, ω.sup.P under a lowest loss value by calculating a gradient differential of the pairwise model and updating parameters through back propagation, and thereby obtaining the pairwise model of head and ear with the layout of head-left and ear-right; and repeating the above 1) and 2) in the step 3, and thereby obtaining the pairwise model of body and head with a layout of body-left and head-right; step 4, performing joint detection for an ear through the local model, the two pairwise models and body features; I) obtaining candidate boxes of a detected image through the SSW segmentation method, and sending the candidate boxes of the detected image to the local model for detection to obtain local scores of corresponding classes of the candidate boxes of the detected image, ranking the local scores from high to low, and selecting candidate boxes from the candidate boxes of the detected image for the corresponding classes through a non-maximum suppression method; inputting local features corresponding to the selected candidate boxes to the two fully connected layers to obtain unitary potential field values and pairwise potential field values of the detected image according to the two pairwise models; calculating based on the unitary potential field values and the pairwise potential field values, to obtain a head candidate set and an ear candidate set as per scores ranking from high to low; II) detecting a head location based on through the pairwise model of body and head and the local model, comprising: calculating a head height H.sub.h according to a width of a body candidate box; moving upwards, downwards, leftwards and rightwards each with a distance of one the head height H.sub.h to form a reference region H.sub.s=4H.sub.h.sup.2 by taking a top center of the target candidate box as a reference, obtaining candidate boxes in the head candidate set intersecting with the reference region H.sub.s as a head candidate box set H.sub.c, setting a head probability threshold a.sub.h, and obtaining candidate boxes in the head candidate box set H.sub.c meeting a condition of H.sub.c>a.sub.h as head candidate targets H.sub.ca; III) selecting C.sub.h number of head candidate targets as per scores S.sub.h of the pairwise model of body and head ranking from high to low, performing joint judgement on the head candidate targets H.sub.ca and the C.sub.h number of head candidate targets to obtain a first intersection of the head candidate targets H.sub.ca and the C.sub.h number of head candidate targets; taking the first intersection as a head candidate target set H.sub.sec when the first intersection is not empty, or selecting the head candidate targets with larger scores from the C.sub.h number of head candidate targets as the head candidate target set H.sub.sec when the first intersection is empty; IV) calculating ear candidate targets corresponding to head candidate boxes in the head candidate target set H.sub.sec based on the head candidate target set H.sub.sec, comprising: setting an ear candidate region in a range of $[- \frac{2}{5} H_{h}, \frac{2}{5} H_{h}]$ obtained according to a range of $[- \frac{2}{5}, \frac{2}{5}]$ of head width using left and right boundary lines as reference and being replaced the head width with the head height H.sub.h, calculating corresponding ear regions $[- \frac{2}{5} H_{h}, \frac{2}{5} H_{h}]$ according to head regions in the head candidate target set H.sub.sec, obtaining an intersection of a segmentation target set and the ear regions $[- \frac{2}{5} H_{h}, \frac{2}{5} H_{h}]$ as an ear candidate set S.sub.e, obtaining an ear candidate target set S.sub.ec by ranking candidate boxes in the ear candidate set S.sub.e as per scores of the local model for ear from high to low; obtaining a candidate box score set by applying the pairwise model of head and ear, obtaining a second intersection of C.sub.e number of ear candidate boxes contained in the candidate box score set and ear-containing candidate boxes in the head candidate target set H.sub.sec; taking the second intersection as an ear detection target set C.sub.ec when the second intersection is not empty, or selecting ear candidate boxes with larger scores from the C.sub.e number of ear candidate boxes as the ear detection target set C.sub.ec when the second intersection is empty; V) performing joint judgement on the ear detection target set C.sub.ec and the ear candidate target set S.sub.ec, obtaining a third intersection of the ear detection target set C.sub.ec and the ear candidate target set S.sub.ec, taking an ear candidate box with a largest score of the pairwise model of head and ear in the third intersection as a resultant ear object when the third intersection is not empty, or selecting the ear candidate box with a largest score from the ear detection target set C.sub.ec as the resultant ear object when the third intersection set is empty; and VI) performing curve evolution of ear outer contour, comprising: obtaining a curve contour of the ear by taking the ear candidate box corresponding to the resultant ear object as an initial boundary and performing the curve evolution on an image in a region twice as large as that the ear candidate box corresponding to the resultant ear object through a Chan-Vese (C-V) method; extracting coordinates of uppermost, lowermost, leftmost and rightmost pixels in the curve contour of the ear from an ear contour pixel coordinate set P.sub.c={P.sub.r,c|r, c∈N}; and redrawing a rectangular box based on the coordinates of the uppermost, lowermost, leftmost and rightmost pixels as a resultant ear object region.

2. The ear detection method with deep learning pairwise model based on contextual information according to claim 1, wherein weights of truth labels and a background weight are set based on a quantity of the truth labels and a quantity of background labels inversely proportional to the quantity of the truth labels and normalized, when using the formula (5) to calculate the final function loss value.

Description

BRIEF DESCRIPTION OF DRAWINGS

(1) FIG. 1 is a schematic flowchart of a method according to an embodiment of the invention.

(2) FIG. 2 is a schematic view of a network flow of a local model according to an embodiment of the invention.

(3) FIG. 3 is a schematic view of a network flow of a pairwise model according to an embodiment of the invention.

(4) FIG. 4 is a schematic diagram of a model joint detection according to an embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS

(5) The invention will be described in detail below in combination with the accompanying drawings and embodiments.

(6) Please refer to FIG. 1, an embodiment of the invention provides an ear detection method with deep learning pairwise model based on contextual information, including steps as follows.

(7) Step 1: image preprocessing and object labeling

(8) In particular, 700 numbers of original images are obtained through network collection and personal shooting, and then the original images are performed data augmentation processing, including operations such as image flipping, image resizing, image translation, image rotation, and noise addition; and an image training set including a total of more than 8000 numbers of images is then obtained. Afterwards, bodies, heads and ears of images in the image training set are labeled with classes.

(9) Step 2: modifying an Oquab convolutional neural network (also referred to as Oquab network) to be a local model for four classes through transfer learning, and training the local model.

(10) The transfer learning can transfer feature parameters of a source network learned from a large amount of data to a new network with a small number of training samples. In Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic, “Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks”, CVPR, June 2014, the Oquab network is proposed. The Oquab network is copied and the output layer and the last feature layer of the Oquab network are removed, and then a fully connected layer including rectified linear unit (ReLU) and dropout functions is added for feature extraction, and an output layer of four classes (i.e., body, head, ear and background) is added. For each of the images in the image training set, a signed sliding window (SSW) method is used to generate recommended/suggested candidate boxes, and truth values of the body, the head and the ear are added to the recommended candidate boxes for each of the images, to thereby form initial training samples. Afterwards, images of the training samples are read, and an image average value is calculated according to the training samples, and then each of the training samples is subtracted with the average value and then used for training. The calculated image average value is [M.sub.r, M.sub.g, M.sub.b], M.sub.r represents a red average value of the training samples, M.sub.g represents a green average value of the training samples, and M.sub.b represents a blue average value of the training samples. The already set network structure then is employed, and a random gradient descent method with momentum is used to train network parameters of the local model.

(11) The images inputted into the local model network are 224*224*3, and a whole network flow is shown in FIG. 2. See Table 1 for the parameters.

(12) TABLE-US-00001 TABLE 1 network parameters 224*224*3 Parameters Output Conv1 11*11*3*96, Stride = 4, Pad = 55*55*96*128 2*1*2*1, ReLU = 1, Mpool = 2 Conv2 5*5*96*256, Stride = 1, Pad = 2, 27*27*256*128 ReLU = 1, Mpool = 2 Conv3 3*3*256*384, Stride = 1, Pad = 1, 13*13*384*128 ReLU = 1 Conv4 3*3*384*384, Stride = 1, Pad = 1, 13*13*384*128 ReLU = 1 Conv5 3*3*384*256, Stride = 1, Pad = 1, 6*6*256*128 ReLU = 1, Mpool = 2 Conv6 6*6*256*6144, Stride = 1, Pad = 0, 1*1*6144*128 ReLU = 1, dropout = 1 Conv7 1*1*6144*6144, Stride = 1, Pad = 0, 1*1*6144*128 ReLU = 1, dropout = 1 Conv8 1*1*6144*2048, Stride = 1, Pad = 0, 1*1*2048*128 ReLU = 1, dropout = 1 Conv9 1*1*2048*4, Stride = 1, Pad = 0, 1*1*2*128 ReLU = 1, dropout = 1 Softmax Using cross entropy to calculate loss loss function Conv10 1*1*2048*1, Stride = 1, Pad = 0, 1*1*1*128 ReLU = 1, dropout = 1 Conv11 1*1*4096*16, Stride = 1, Pad = 0, 1*1*16*1024 ReLU = 1, dropout = 1 loss according to a unitary potential field output by conv10 and a pairwise potential field output by conv11, a target loss value of a mixed function is obtained by using formula (5).

(13) Step 3: two pairwise models of head and ear as well as body and head (i.e., a pairwise model 2 and pairwise model 1) are trained individually according to the local model.

(14) Specifically, first to eighth convolutional layers (i.e., conv1 to conv8) of the local model network are selected as a front part, and then a tenth convolutional layer (conv10) and an eleventh convolutional layer (conv11) are connected in parallel to the front part. A loss value can be obtained by inputting the unitary potential field values output by the conv10 and a pairwise potential field value output by the conv11 into the formula (5). A network flow of each the pairwise model is shown in FIG. 3, and the parameters are shown in Table 1.

(15) 1. Building of Pairwise Model

(16) For one of the pairwise models, the front part of the pairwise model is the same as that of the trained local model, and two fully connected layers (i.e., conv10 and conv11) connected in parallel are taken as a rear part of the pairwise model. One of the two fully connected layers is a unitary potential field network model layer, the other one of the two fully connected layers is a pairwise model potential field network layer. A joint score function is defined by formula (1) as follows:
S(y;ω)=αΣ.sub.i∈v.sub.1y.sub.iθ.sub.i.sup.U(ω)+βΣ.sub.j∈v.sub.2y.sub.jθ.sub.j.sup.U(ω)+γΣ.sub.(i,j)∈εy.sub.iy.sub.jθ.sub.i,j,k.sub.i,j.sup.P(ω) (1);

(17) where S(y; ω) represents a joint score, α, β and γ (α+β+γ=1) are penalty weights, which represent influences of different potential fields on the joint score. α represents the influence of a head potential field on the joint score, β represents the influence on an ear potential field on the joint score, and γ represents the influence of the pairwise model on the joint score. Because the head is a larger object and has more features, and thus it is easier to detect, so that a probability of error and loss is smaller, and consequently a relatively small weight is given; while the ear is a small object, which is difficult to detect and easy to make a mistake, and thus a larger penalty weight is given. y.sub.i (i∈v.sub.1) and y.sub.j (j∈v.sub.2) are binary variables, v.sub.1 and v.sub.2 are candidate variables of head and ear respectively. ε is a candidate set of pairwise head and ear formed by (i, j), which is also called as an edge set.

(18) Corresponding to feature vector f.sub.1 of one head and a feature vector f of one ear, corresponding unitary potential field values can be obtained through formula (2) as follows:
θ.sub.i.sup.U=φ.sup.U(f.sub.i,ω.sup.U),θ.sub.j.sup.U=φ.sub.U(f.sub.j,ω.sup.U) (2).

(19) A joint potential field value corresponding to the pairwise head and ear can be obtained through formula (3) as follows:
θ.sub.i,j,k.sub.i,j.sup.P=φ.sub.k.sub.i,j.sup.P(f.sub.i,f.sub.j,ω.sup.P) (3).

(20) where the θ.sub.i.sup.U represents the unitary potential field value of the head, the θ.sub.j.sup.U represents the unitary potential field value of the ear, the φ.sup.U is used for mapping candidate box features (also referred to as feature vectors) f.sub.i and f.sub.j to the θ.sub.i.sup.U and the θ.sub.j.sup.U. The θ.sub.i,j,k.sub.i,j.sup.P represents the joint potential field value of the pairwise head and ear. The φ.sup.P is used for mapping the candidate box features of the pairwise head and ear to the θ.sub.i,j,k.sub.i,j.sup.P. A k-th component corresponds to a k-th cluster center index, and ω.sup.U and ω.sup.P are trainable parameters.

(21) For each pairwise candidate boxes P of head and ear, an individual score s.sub.p(ω) defined by a maximum marginal difference of joint score is calculated through formula (4) as follows:
s.sub.p(ω)=max.sub.y:y.sub.p.sub.=1S(y;ω)−max.sub.y:y.sub.p.sub.=0S(y;ω) (4).

(22) where v.sub.1=v.sub.2 and v.sub.1+v.sub.2=v, when a value of v is small, an enumeration method can be used to accurately solve an optimal solution of the formula (4). When the value of v is large, firstly a quadratic pseudo-Boolean function can be used to solve a suboptimal solution of the formula (4) and some data candidates are labeled, then the remaining unlabeled data candidates are solved by using the enumeration method and labeled. After data candidates all are labeled, a function loss of the pairwise model can be calculated. In this design, the quantity of candidate targets is set to 32, and an image scale corresponding to the model is small, which belongs to small-scale target optimization. The solution of the maximum marginal difference of joint score is transformed into the quadratic pseudo-Boolean function optimization problem, the pseudo-Boolean optimization function is an important basic combinatorial optimization problem. A maximum flow minimum cut graph-cut algorithm is used to solve the optimal solution of some variables, and the remaining unlabeled variables are solved by using the enumeration method. This is a heuristic optimization algorithm, which can converge to a better local solution through fast iteration.

(23) The loss function is defined by formula (5) as follows:
L(ω,ÿ,X)=Σ.sub.i:ÿ.sub.i.sub.=1v(s.sub.i(ω,x))+Σ.sub.i:ŷ.sub.i.sub.=0v(−s.sub.i(ω,x)) (5)

(24) where v(t)=log (1+exp (−t)).

(25) 2. Training of Pairwise Model

(26) {circle around (1)} First, according to scores of the respective images Img obtained through the local model, ranking the obtained scores from high to low and combining a non-maximum suppression method to select 32 numbers of head candidate boxes and 32 numbers of ear candidate boxes, and then forming head and ear pairs based on all the selected head candidate boxes and ear candidate boxes. Afterwards, sorting all the paired head and ear data to form a total of 32*32=1024 candidate pairs each with a layout of head-left and ear-right.

(27) {circle around (2)} Then using a k-means algorithm to perform clustering analysis on the candidate pairs, and assigning a class number k to each of samples in the candidate pairs, wherein the class number k refers to a cluster center to which each the image belongs. This cluster center will be used in the next step {circle around (3)} to calculate the loss value. See the formula (3) for the existence of implicit correlation. The process may be as follows:

(28) calculating k numbers of cluster centers for the 1024 numbers of candidate pairs, i.e., the candidate pairs of the pairwise model of all the samples, k=16; and after applying the k-means method to analyze and cluster all the samples, assigning the class numbers [1, 2, 3 . . . 16] to all the samples. The clustering process may be as follows: letting a rectangular box of head truth value is represented by [y.sub.1, x.sub.1, y.sub.2, x.sub.2] and a rectangular box of ear truth value is represented by [y.sub.3, x.sub.3, y.sub.4, x.sub.4], and thereby forming a layout pair of head-left and ear-right.

(29) $\begin{matrix} H_{c} = (X_{h c}, Y_{h c}) = (\frac{x_{1} + x_{2}}{2}, \frac{y_{1} + y_{2}}{2}); & (6) \end{matrix}$ $\begin{matrix} (w_{h}, h_{h}) = (x_{1} - x_{2} + 1, y_{2} - y_{1} + 1); & (7) \end{matrix}$ $\begin{matrix} E_{c} = (X_{e c}, Y_{e c}) = (\frac{x_{3} + x_{4}}{2}, \frac{y_{3} + y_{4}}{2}); & (8) \end{matrix}$ $\begin{matrix} (w_{e}, h_{e}) = (x_{4} - x_{3} + 1, y_{4} - y_{3} + 1) . & (9) \end{matrix}$

(30) A clustering feature F may be expressed as follows:

(31) $\begin{matrix} f_{1} = .Math. X_{h c} - X_{e c} .Math.; & (10) \end{matrix}$ $\begin{matrix} f_{2} = .Math. Y_{h c} - Y_{e c} .Math.; & (11) \end{matrix}$ $\begin{matrix} f_{3} = \frac{w_{h} ⋆ h_{h}}{w_{e} ⋆ h_{e}}; & (12) \end{matrix}$ $\begin{matrix} F = (f_{1}, f_{2}, f_{3}) . & (13) \end{matrix}$

(32) The k-means method is applied to perform clustering analysis through the feature F.

(33) {circle around (3)} Removing the softmax layer of the local model, taking the head feature vector f.sub.i and the ear feature vector f.sub.j obtained by the conv8 as initial features, and sending the initial features into conv10 and conv11 simultaneously. The unitary potential field values are obtained from the conv10, and the pairwise potential field value is obtained from the conv11. For the formula (4), using a maximum flow minimum cut method to determine class labels of all the candidate boxes. If there is a candidate box has not been labeled, using the enumeration method to determine a class label of each remaining candidate box. The formula (5) is applied to calculate a final function loss value. When the loss value is calculated, setting weights of truth labels to be larger than a background weight (setting based on a quantity of the truth labels and a quantity of background labels inversely proportional to the quantity of the truth labels and normalization). In this way, a greater loss will be caused when a class label is wrongly assigned, and the impact on loss will be added to the final loss. Then, calculating a gradient differential of the model, and updating parameters by back propagation, so as to obtain trained model parameters ω.sup.U and ω.sup.P under a lowest loss value.

(34) Based on the above design process, a pairwise model 2 of head and ear can be obtained, which is with the layout of head-left and ear-right.

(35) By repeating the above operations of building and training in the step 3, a pairwise model 1 of body and head can be obtained, which is with a layout of body-left and head-right.

(36) Step 4, performing joint detection for an ear by using the local model, the pairwise model 1, the pairwise model 2 and body features.

(37) Because an ear occupies a relatively small portion of the body, and thus it is a difficult problem to detect the ear in an image with half-body or even full-body as the scene. Referring to FIG. 4, an embodiment of the invention trains the pairwise model 1 (body and head) and the pairwise model 2 (head and ear) as well as the local model for body, head and ear; and the two pairwise models and the one local model are used to jointly judge and detect the ear. As shown in FIG. 4, Bkg represents an image background, bb1 represents a body rectangular box, bb2 represents a head rectangular box, and bb3 represents an ear rectangular box.

(38) (1) Candidate boxes of a detected image are first obtained by using a SSW segmentation method, and the obtained candidate boxes are sent to the local model for detection. Local scores of corresponding classes are obtained and then ranked from high to low. The first 32 numbers of the candidate boxes of each class are selected through a non-maximum suppression method. According to the pairwise model 1 and the pairwise model 2, obtained local features as input items are input into conv10 and conv11, and unitary potential field values and pairwise potential field values of the detected image then are obtained. Finally, calculation is performed based on the unitary potential field values and the pairwise potential field values, to obtain a head candidate set C.sub.h and an ear candidate set C.sub.e as per scores ranking from high to low.

(39) (2) The pairwise model 1 and the local model are used to detect a head location. Because the body is a large object and has rich features, and thus the local model is easy to detect the location of the body. Then, through the location information of the body, combined with head candidate box probabilities obtained from the local model to judge that: for head candidate boxes intersecting with an upper region of the body, ones of which with high probabilities are selected as a head candidate set. According to the theory of human body structure proportion, a shoulder width is a distance about 1.5˜2.3 times a height of the head. According to a width of the body candidate box, a head height H.sub.h can be calculated. Taking a top center of the body candidate box as a reference, moving upwards and downwards each with a distance of one head height, and moving leftwards and rightwards each with a distance of one head height to thereby form a reference region H.sub.s=4H.sub.h.sup.2. Candidate boxes intersecting with the region H.sub.s are taken as a head candidate box set H.sub.c. A head probability threshold a.sub.h is set, and candidate boxes meeting the condition of H.sub.c>a.sub.h are used as head candidate targets H.sub.ca.

(40) (3) C.sub.h number of head candidate targets are selected as per scores S.sub.h of the pairwise model 1 of body and head ranking from high to low. Joint judgement is then performed on H.sub.ca and the C.sub.h number of head candidate targets to obtain an intersection of them. When the intersection is not null/empty, head candidate boxes of the intersection are used as a head candidate target set H.sub.sec, or when intersection is empty, the head candidate targets with higher scores in H.sub.ca are selected as the head candidate target set H.sub.sec.

(41) (4) According to the head candidate target set H.sub.sec obtained in the above step, ear candidate targets corresponding to head candidate boxes in the set H.sub.sec are calculated. According to the method of “facial height being divided into approximately three equal parts and facial width being divided into approximately five equal parts”, the location of ear is roughly between the upper ⅓ part and the lower ⅓ part based on a center line of the head height as reference; and if it is a child, moving the center line down to the lower ⅓ part of the head height. Left and right positions of the ear are about at leftwards ⅕ part and rightwards ⅕ part relative to the head width. Considering that the head's outward tilt posture generally does not exceed 45 degrees, the method extends the left and right positions of the ear each with one ⅕ part outwards (in order to cover special cases, it can be extended outwards with three ⅕ parts), so as to measure a left-right distance range of the ear. Therefore, an embodiment of the invention sets an ear candidate region to be in a range of [−⅖H.sub.h,⅖H.sub.h] obtained according to a range of [−⅖,⅖] of head width using left and right boundary lines as reference and being replaced the head width with the head height H.sub.h. According to head regions in the set H.sub.sec, corresponding ear regions can be calculated as

(42) $[- \frac{2}{5} H_{h}, \frac{2}{5} H_{h}] .$
Intersection of a segmentation target set and the ear regions

(43) $[- \frac{2}{5} H_{h}, \frac{2}{5} H_{h}]$
is taken as an ear candidate set S.sub.e. All the candidate boxes in the set S.sub.e are ranked as per scores of the local model for ear from high to low to obtain an ear candidate target set S.sub.ec. Then, the pairwise model 2 is applied to obtain a candidate box score set, and then an intersection of C.sub.e number of ear candidate boxes contained in the candidate box score set and ear-containing candidate boxes in the head candidate target set H.sub.sec, if the intersection is not null, the intersection is taken as an ear detection target set C.sub.ec, and whereas, if the intersection is null, ear candidate boxes with higher scores in the set C.sub.e are selected as the ear detection target set C.sub.ec.

(44) (5) A joint judgement is performed on the set C.sub.ec and the set S.sub.ec to obtain an intersection of the ear detection target set C.sub.ec and the set S.sub.ec; when the intersection is not empty, taking the ear candidate box with the largest score obtained by the pairwise model 2 in the intersection as a resultant ear object; or when the intersection is empty, selecting the ear candidate box with the largest score in the set C.sub.ec as the resultant ear object.

(45) (6), performing curve evolution of ear outer contour. In particular, the rectangular candidate box of the ear (i.e., the ear candidate box corresponding to the resultant ear object) is taken as an initial boundary, and the curve evolution is performed on an image in a region twice as large as that the ear candidate box corresponding to the resultant ear object through a Chan-Vese (C-V) method, thereby a curve contour of the ear is obtained. An ear contour pixel coordinate set is set to that P.sub.c {P.sub.i,j|i,j∈N}. Coordinates (i, j) of uppermost, lowermost, leftmost and rightmost pixels in the contour curve of the ear are extracted, and a rectangular box is then redrawn according to the extracted coordinates as a resultant ear object region.

Ear detection method with deep learning pairwise model based on contextual information

Assignee

Inventors

Cpc classification

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V10/44

PHYSICS

Classification Explorer

G06N3/082

PHYSICS

Classification Explorer

G06N3/0464

PHYSICS

Classification Explorer

G06V10/774

PHYSICS

Classification Explorer

G06N3/096

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06V10/762

PHYSICS

Classification Explorer

G06V40/172

PHYSICS

Classification Explorer

G06V10/26

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06V40/171

PHYSICS

Classification Explorer

G06N3/09

PHYSICS

International classification

Classification Explorer

G06V40/16

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06V10/762

PHYSICS

Classification Explorer

G06V10/26

PHYSICS

Classification Explorer

G06V10/44

PHYSICS

Classification Explorer

G06V10/774

PHYSICS

Abstract

Claims

Description