Image processing device, an image processing method, and computer-readable recording medium
10657625 ยท 2020-05-19
Assignee
Inventors
Cpc classification
G06V10/758
PHYSICS
International classification
G06T3/40
PHYSICS
Abstract
An image processing device according to one of the exemplary aspects of the present invention includes: a scale space generation means for generating the scaled samples from a given input region of interest; feature extraction means for extracting features from the scale samples; a likelihood estimation means for deriving an estimated probability distribution of the scaled samples by maximizing the likelihood of a given scaled sample and the parameters of the distribution; a probability distribution learning means for updating the model parameters given the correct distribution of the scaled samples; a template generation means to combine the previous estimates of the object features into a single template which represents the object appearance; an outlier rejection means to remove samples which have a probability below the threshold; and a feature matching means for obtaining the similarity between a given template and a scaled sample and selecting the sample with the maximum similarity as the final output.
Claims
1. An image processing device comprising: a feature extraction unit that extracts features from scaled samples generated from given region of interest, after normalizing the samples; a maximum likelihood estimation unit that derives an estimated probability score of the scaled samples by maximizing the likelihood of a given scaled sample and a parameter of the probability distribution model; an estimation unit that combines the previous estimates of the object and its features into a single template which represents the object appearance, and that removes samples which have a probability score below the threshold; a feature matching unit that obtains a similarity between a given template and a scaled sample and selecting the sample with the maximum similarity as the final output.
2. The image processing device according to claim 1, further comprising a learning unit that updates the probability distribution model parameters given the distribution of the scaled samples and the template derived from the previous frames.
3. The image processing device according to claim 1, Wherein the maximum likelihood estimation unit obtains the probability that a sample is generated by distribution which is given by the model of the distribution of the features, the model is applied to the newly generated scale samples and a score is calculated based on the distance of the samples.
4. The image processing device according to claim 2, Wherein the learning unit that learns the probability distribution models parameters by one or more series of training samples and template which are given as true samples and generated from the previous frames.
5. The image processing device according to claim 1, Wherein estimation unit that combines the previous estimates of the object and its features into a single template which represents the object appearance.
6. An image processing method comprising: a step (a) of extracting features from scaled samples generated from given region of interest, after normalizing the samples; a step (b) of deriving an estimated probability distribution score of the scaled samples by maximizing the likelihood of a given scaled sample and a parameters of the probability distribution model; a step (c) of combining the previous estimates of the object and its features into a single template which represents the object appearance; a step (d) of removing samples which have a probability score below the threshold; a step (e) of obtaining a similarity between a given template and a scaled sample and selecting the sample with the maximum similarity as the final output.
7. The image processing method according to claim 6, further comprising a step (f) of updating the probability distribution model parameters given the distribution of the scaled samples and the template derived from the previous frames.
8. The image processing method according to claim 6, Wherein in the step (b), obtaining the probability that a sample is generated by distribution which is given by the model of the distribution of the features, the model is applied to the newly generated scale samples and a score is calculated based on the distance of the samples.
9. The image processing method according to claim 7, Wherein in the step (f) learning the probability distribution models parameters by one or more series of training samples and template which are given as true samples and generated from the previous frames.
10. The image processing method according to claim 6, Wherein in the step (c) combining the previous estimates of the object and its features into a single template which represents the object appearance.
11. A non-transitory computer-readable recording medium storing a program that causes a computer to operate as: a feature extraction unit that extracts features from scaled samples generated from given region of interest, after normalizing the samples; a maximum likelihood estimation unit that derives an estimated probability score of the scaled samples by maximizing the likelihood of a given scaled sample and a parameters of the probability distribution model; an estimation unit that combines the previous estimates of the object and its features into a single template which represents the object appearance, and that removes samples which have a probability score below the threshold; a feature matching unit that obtains a similarity between a given template and a scaled sample and selecting the sample with the maximum similarity as the final output.
12. The non-transitory computer-readable recording medium according to claim 11, further the program causes the computer to operate as: a learning unit that updates the probability distribution model parameters given the distribution of the scaled samples and the template derived from the previous frames.
13. The non-transitory computer-readable recording medium according to claim 11, Wherein the maximum likelihood estimation unit obtains the probability that a sample is generated by distribution which is given by the model of the distribution of the features, the model is applied to the newly generated scale samples and a score is calculated based on the distance of the samples.
14. The non-transitory computer-readable recording medium according to claim 12, Wherein the learning unit that learns the probability distribution models parameters by one or more series of training samples and template which are given as true samples and generated from the previous frames.
15. The non-transitory computer-readable recording medium according to claim 11, Wherein estimation unit that combines the previous estimates of the object and its features into a single template which represents the object appearance.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DESCRIPTION OF EMBODIMENTS
(9) To solve the technical problems discussed above, the overall approach is summarized here. The scale estimation process is decoupled from the location estimation process so as to speed up the process of tracking. Given the location of the object in the current frame, a number of scaled samples are generated. The likelihood of these samples is evaluated using the probability distribution model which is learnt by parameter estimation using previous frames and a template. The template is generated using the features extracted from the object in previous frames and combined to represent the appearance of the object. Using the likelihood model we can remove the outliers which have a probability below the threshold. Next, using feature matching we can obtain the score of the samples and select the one with the highest score as the output.
(10) According to the present invention, it is able to estimate the scale of a tracked object accurately and in real time.
(11) Another advantageous effect of the present invention is that there is no assumption on the relationship between the output score and scaled samples, unlike NPL 1 which assumes that the scores calculated by the filter are symmetric with respect to the scaled samples.
(12) An additional advantageous effect of the present invention is that the model parameter updating involves fixed sized vectors and matrices unlike in NPL 2 where the number of support vectors can increase after every frame.
(13) An additional advantageous effect of the present invention is that there is no need to calculate the projection matrix and hence no need for knowing the calibration information.
(14) Another advantageous effect of the present invention is that illumination change does not affect the scale estimation since all the calculation involves features which are invariant to illumination changes.
First Exemplary Embodiment
(15) Hereinafter, a first exemplary embodiment of the present invention will be described in detail.
(16)
(17) The input unit 101 receives a series of frames i.e. images, for example, frames of a video, still images or the like, in tracking phase. The input unit 101 may receive a series of frames i.e. training frames, for example, in learning phase or before the learning phase. In the following description, the frames and a frame in the frames may be referred to as images and an image respectively. The training frames and a training frame in the training frames are referred to as training images and a training image respectively.
(18) The object tracking unit 102 tracks a region of an object, such as a face or one of other objects which may include several parts, in the frames. In the following explanation, the object tracking unit 102 tracks a region of a face in the frame. It provides the location of the face in the frame, i.e. the x and y co-ordinates.
(19) The feature extraction unit 103 is used to extract the features from the region of interest that are provided to it. Using the location provided by the object tracking unit; scaled samples are generated. These samples are then normalized to lie in the same coordinate system. The coordinates are defined in a coordinate system set in advance in the frames. Finally, the features are extracted from these samples. These features can be a combination of edge, texture, color and/or temporal information from the samples.
(20) The learning unit 104 learns the model by one or more series of training frames. More specifically, the learning unit 104 learns the model which will be used for calculating the likelihood of future samples, by features extracted from training frames. The learning unit 104 may calculate the mean vector and the covariance matrix from the features of the samples as part of the parameter learning for the model.
(21) The model essentially captures the distribution of the features of the scaled samples. More specifically it captures the likelihood of a sample given the intra class variation. The intra class variations are the differences between the features of the same object, whereas the interclass variations which are caused by features of other objects are assumed to be outliers since the tracking unit 102 already has given the location. The model storage unit 105 stores the model's parameters which are used to evaluate the model on any input sample.
(22) The maximum likelihood estimation unit 106 derives the probability of a scale sample using the model parameters stored in the model storage unit 105. The probability is used to eliminate the outliers by thresholding. This procedure eliminates the scale samples that are not consistent with the appearance of the object as represented by the features.
(23) The samples which are passed by the maximum likelihood estimation unit 106 are the input of the feature matching unit 107. In this unit each of the features of the samples are directly matched and their similarity is calculated. The feature matching unit may use for example, a histogram intersection kernel or a Gaussian kernel to calculate the similarity score of the samples.
(24) The estimation unit 108 selects the sample with the highest score as the estimated scale output. The features of the object at this scale are then combined with the previous frames estimate linearly and this forms the template. The template is stored in the template storage unit 109.
(25) The output unit 110 outputs the final output state of the object i.e. the position and the scale. The output unit 110 may plot predetermined marks on the frame at positions represented by the x, y coordinates and the scale (width, height) of the object in the output the frame with the plotted marks.
(26) Next, an operation of the image processing device 100 according to the first exemplary embodiment will be explained in detail with reference to drawings.
(27)
(28) The operation of the image processing device 100 according to the first exemplary embodiment of the present invention can be broadly divided into training and evaluation phase. In this paragraph an overview of the invention will be described with reference to
(29) The estimation processing will be explained in detail later along with the drawings in
(30) Next, the output unit 110 outputs the estimated scale i.e. the final output described above (Step S107). When processing of the image processing device 100 is not finished (No in Step S108), the input unit 101 receives a next frame (Step S101). When processing of the image processing device 100 is finished by an instruction from a user of the image processing device 100 via a input device (not illustrated) (YES in Step S108), the image processing device 100 stops the processing shown in
(31) Next, an operation of the image processing device 100 according to the first exemplary embodiment in the training phase will be described in detail with reference to drawings.
(32)
(33) As described above, the models are needed to be learnt. So, before the scale estimation can be applied, a training phase is necessary, in which the models in the first exemplary embodiment will be learnt. Given the frame and the object location, scaled samples are generated by the step S201. These samples are extracted around the region given by the object location and the scale of the previous frame. Next, the features are extracted from theses samples (Step S202). Extracted features refer to features such as HOG (Histogram of Oriented Gradients), LBP (Local Binary Patterns), normalized gradients etc. In Step S203 we check if the template already exists i.e. if we are in the first frame or not. If the template does not exist (NO in Step S203) it means we are in the first frame and we need to create the template (Step S204). The template is the features extracted from the current location and scale given by the tracker. Using the template and the features of the samples we can update the model parameters (Step S205). This is done in the following way:
(34)
(35) In the equation shown in Math 1, x bar is the mean or average of the samples. It is one of the parameters of the multivariate Gaussian distribution that is used in modeling. The x.sub.i is the vector of features of the i.sup.th sample and N is the total number of scaled samples.
(36)
(37) In the equation shown in Math 2, sigma is the covariance matrix and T means the vector transpose. Using these two equations we can update the model parameters. Also in case there is already a template i.e. YES for the Step S205, we need to update the template (Step S206) by linear interpolation using the following equation:
I.sub.i=I.sub.i+(1)I.sub.i-1[Math. 3]
(38) Where in equation Math 3, I.sub.i is the template from the current frame and I.sub.i-1 is the template from the previous frame. Here, alpha is a decay factor which is chosen experimentally.
(39) Next, we store the model parameters in the model storage unit (Step S207) in the model storage unit 105.
(40)
(41) Next, the evaluation phase which consists of the estimation processing step is explained. The estimation processing of the image processing device 100 will be described with reference to drawings in
(42)
=x.sub.iI.sub.i[Math. 4]
(43) The equation in Math 4, represents a sample difference beta between the template and the i.sup.th scaled sample. This is also known as the intrapersonal difference. According to this equation, a class of variation can be defined i.e. the intrapersonal variation omega as shown in the following equation:
P(|)=N(,)[Math. 5]
(44) In the equation shown in Math 5, the probability P(betalomega) of the intrapersonal difference, given the intrapersonal variation, is defined as a multivariate normal distribution N(mu,sigma). The parameters of this distribution are given by mu and siguma. The likelihoods of observing beta is:
(45)
(46) In the equation shown in Math 6, d is the dimension of the feature vectors. Using this equation we can get the likelihood of a sample. In the next step we check if any likelihood are below the threshold i.e. Step S304. If there are outliers, YES in Step S304, we can reject them and remove the samples from further processing in Step S305. In S305 thresholding is done to remove the samples, the threshold is selected empirically. If there are no outliers, NO in Step S304, then we choose the sample with the maximum likelihood as the output scale, i.e. Step S306 and end the estimation processing.
(47) Next, in Step S307 feature matching is done between the features of the template and the samples. The matching can be done using the kernel methods such as intersection kernel, Gaussian kernel, polynomial kernel etc.
(48)
(49) The equation in Math 7, gives the matching score s between the template I and the feature x. Here, d is the dimension length of the features and T is the dimension index. In the next Step S308 we select the output as the one with the maximum score.
(50) The first advantageous effect of the present exemplary embodiment is that it is able to estimate the scale of the object accurately and in real time.
(51) Other advantageous effects of the present exemplary embodiment will be explained in the following. The advantage of the present exemplary embodiment is that, there is no need to calculate the projection matrix or the need to use the 3D co-ordinates of a known object as in PTL 1. Also there is no need effect of illumination change since there is no need to calculate the contrast to variance ratio as in PTL 2. Secondly, heavy optimization techniques such as latent support vector machines are not needed and hence real time operation is easily possible. Moreover, rigid and non-rigid shapes can be easily tracked. Furthermore, exemplars for changes in shape, pose and parts is not needed.
Second Exemplary Embodiment
(52) Next, a second exemplary embodiment of the present invention will be described in detail with reference to drawings.
(53)
(54) The second exemplary embodiment has the same advantageous effect as the first advantageous effect of the first exemplary embodiment. The reason causes the advantageous effect is the same as that of the first advantageous effect of the first exemplary embodiment.
Other Exemplary Embodiment
(55) Each of the image processing device 100 and the image processing device 100A can be implemented using a computer and a program controlling the computer, dedicated hardware, or a set of a computer and a program controlling the computer and a dedicated hardware.
(56)
(57) The processor 1000 loads the program, which causes the computer 1000 operates as the image processing device 100 or the image processing device 100A, stored in the storage medium 1005 into the memory 1002. The processor 1000 operates as the image processing device 100 or the image processing device 100A by executing the program loaded in the memory 1002.
(58) The input unit 101, the object tracking unit 102, the feature extraction unit 103, the learning unit 104, the maximum likelihood estimation unit 106, the feature matching unit 107, the estimation unit 108 and the output unit 110 can be realized by a dedicated program that is loaded in the memory 1002 from the storage medium 1005 and can realize each of the above-described units, and the processor 1001 which executes the dedicated program. The model storage unit 105, the template storage unit 109 can be realized by the memory 1002 and/or the storage device such as a hard disk device or the like. A part of or all of the input unit 101, the object tracking unit 102, the feature extraction unit 103, the learning unit 104, the model storage unit 105, the maximum likelihood estimation unit 106, the feature matching unit 107, the estimation unit 108, the template storage unit 109 and the output unit 110 can be realized by a dedicated circuit that realizes the functions of the above-described units.
(59) As a final point, it should be clear that the process, techniques and methodology described and illustrated here are not limited or related to a particular apparatus. It can be implemented using a combination of components. Also various types of general purpose device may be used in accordance with the instructions herein. The present invention has also been described using a particular set of examples. However, these are merely illustrative and not restrictive. For example the described software may be implemented in a wide variety of languages such as C++, Java, Python and Perl etc. Moreover other implementations of the inventive technology will be apparent to those skilled in the art.
(60) While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, the invention is not limited to these embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the claims.
REFERENCE SIGNS LIST
(61) 100 image processing device 100A image processing device 101 input unit 102 object tracking unit 103 feature extraction unit 104 learning unit 105 model storage unit 106 maximum likelihood estimation unit 107 feature matching unit 108 estimation unit 109 template storage unit 110 output unit 1000 computer 1001 processor 1002 memory 1003 storage device 1004 interface 1005 storage medium 1006 bus