Method for setting a tridimensional shape detection classifier and method for tridimensional shape detection using said shape detection classifier
09805256 · 2017-10-31
Assignee
Inventors
- Marcel Alcoverro Vidal (Barcelona, ES)
- Adolfo Lopez Mendez (Granollers, ES)
- Xavier Suau Cuadros (Cervera, ES)
Cpc classification
G06F3/017
PHYSICS
G06V40/28
PHYSICS
International classification
Abstract
Method for setting a tridimensional shape detection classifier for detecting tridimensional shapes from depth images in which each pixel represents a depth distance from a source to a scene, the classifier comprising a forest of at least a binary tree (T) for obtaining the class probability (p) of a given shape comprising nodes associated with a distance function (f) that taking at least a pixel position in a patch calculates a pixel distance. The method comprises per each leaf (L) node of the binary tree the configuration steps of creating candidate groups of parameters; obtaining positive patches (Ip) containing part of the shape to be detected; obtaining negative patches (In) not containing part of the shape to be detected; calculating in the leaf node the distance function of the obtained positive and negative patches comparing the result of the distance function with its pixel distance threshold and computing its statistics; and selecting for the leaf node the candidate group of parameters that best separate the positive and negative patches into two groups for calculating the class probability of the shape in that leaf node using the distance function. It is also disclosed a method for shape detection from a depth image using the shape detection classifier; a data processing apparatus comprising means for carrying out the methods; and a computer program adapted to perform the methods.
Claims
1. A method for setting a tridimensional shape detection classifier for detecting tridimensional shapes from depth images in which each pixel represents a depth distance from a source to a scene, each depth image being dividable into one or more patches (I.sub.p, I.sub.n) of given dimensions, the classifier comprising a forest of at least a binary tree for obtaining the class probability of a given shape comprising nodes associated with a distance function (f) that taking at least a pixel position in a patch calculate a pixel distance, the method comprising: obtaining one or more positive patches (I.sub.p) of the given dimensions from a depth image, said one or more positive patches containing the shape to be detected, and obtaining one or more negative patches (I.sub.n) of the given dimensions from a depth image, said one or more negative patches not containing the shape to be detected, the method comprises: for each obtained positive or negative patch traversing the binary trees starting from their root node using the distance function at each node to decide to continue to one of the child nodes of the next level until a leaf node of each binary tree is reached; calculating in each reached leaf node the distance function for the patch using candidate groups of parameters, each candidate group of parameters comprising: at least a pixel position (u, v) in a patch, a depth clipping window (k) in the patch and a pixel distance threshold (θ); by comparing the result of the distance function using the at least a pixel position and the depth clipping window with the pixel distance threshold of each candidate group, and computing its statistics; and when more than a predefined number of positive or negative patches are applied to a leaf node of the classifier: selecting for that leaf node the candidate group of parameters that best separate the positive and negative patches into two groups for calculating the class probability of the shape in that leaf node using the distance function; creating a new level of the binary tree from that leaf node comprising two newly created leave nodes, thus that leaf node becoming a node, and passing the statistics from that leaf node that has become a node to the newly created leaf nodes.
2. The method according to claim 1, wherein each candidate group of parameters comprises at least a random pixel position (u, v) in a patch, a random depth clipping window (k) in the patch and a random pixel distance threshold (θ).
3. The method according to claim 1, wherein at least part of the one or more positive patches and one or more negative patches are obtained from the same depth image.
4. The method according to claim 3, wherein the obtained one or more negative patches are the ones in each depth image with highest positive class probability according to the statistics at the reached leaf nodes of the binary trees.
5. The method according to claim 1, wherein the distance function calculates the relation between the depths represented by two pixels of the patch located in the random pixel position (u, v) from the center (x) pixel of the patch, normalized with the depth evaluated in the center pixel, each depth being upper and lower limited by the random depth distance clipping window centered in the value of depth represented by the center pixel of the patch, as per the formula:
6. A method for shape detection from a depth image using the shape detection classifier of claim 1, comprising the steps of: dividing the depth image into patches of given dimensions; traversing the binary trees by applying the distance function at each visited node using the associated pixel displacement (u,v) and maximum pixel distance (k) and comparing the result of the distance function with the associated pixel distance threshold (θ) and; obtaining the probability (p) of a specific shape leveraging on the statistics of the leaf nodes reached in each tree.
7. The method according to claim 6, further comprising: averaging the probabilities of a specific shape obtained from the different leaf nodes of binary trees of the forest according to:
8. The method according to claim 6, further comprising the steps of: casting a vote for a target shape per each patch whenever the probability of said shape is higher than the probability of not being said shape or being another shape, and estimating a probability density for the target shape using said votes.
9. The method according to claim 6, further comprising taking into account temporal consistency by recursively updating the probability density with votes aggregated from past time instants as follows:
p′(c|I.sub.t)=αp(c|I.sub.t)+(1−α)p′(c|I.sub.t-1).
10. The method according to claim 6, further comprising detecting the pixel location of the target shape as the pixel with maximum probability.
11. The method according to claim 6, wherein the probability is thresholded by locally integrating the probability measure through a circular surface of radius inversely proportional to the depth and centered at the global maximum
12. The method according to claim 6, wherein the shape is a hand gesture.
13. A data processing apparatus comprising means for carrying out the method of claim 1.
14. A computer program stored on a non-transitory computer-readable medium and adapted to perform the method of claim 1.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Variants of the method for setting a tridimensional shape detection classifier and method for tridimensional shape detection using said shape detection classifier are illustrated by way of non-limiting example in the attached drawings. Specifically:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
DETAILED DESCRIPTION OF THE DRAWINGS
(13)
(14) It can be seen that in the scene shown in the depth image of
(15) The classifier of the present invention comprises a forest that will have at least a binary tree used for obtaining the class probability of the shape. The nodes of the binary trees will use a distance function that taking at least a pixel position in a patch calculates a pixel distance. The distance function is also known as binary test or binary function.
(16) In the depicted embodiment, as can be seen in
(17)
(18) The distance function would be calculated per every patch of the depth image using groups of random parameters, also known as number of random tests, as explained below.
(19) The method for setting the shape detection classifier of the present invention comprises creating candidate groups of parameters, each group comprising at least a random pixel position u, v in a patch, random depth clipping window k in the patch and a random pixel distance threshold θ that will be used for classifying the result of the distance in a given node. Namely, between 100 and 5000 groups of parameters are advisable, normally taking 200 as a default value that allows a detection that may be useful for an average detection. If differences between the gesture and the background or negative examples are subtle more groups of parameters are advisable.
(20) Obviously, the random pixel position u, v has to be a pixel within the patch, so for a squared patch and a pixel position u, v being a vertical or horizontal distance from the center pixel of the patch, the random values of pixel position u, v has to point to a pixel within the patch. Although vertical or horizontal distance from the center pixel of the patch can be used, other positions, relative or absolute, using Cartesian or polar coordinates can be used as long as the pointed pixels are the same in different patches.
(21) Also, the random depth clipping window k has to be between 0 and the maximum distance of the depth image. Typically for a depth image being a grayscale image with 8 bits per pixel, k has to give a total window between 0 and 256.
(22) As shown in
(23) In the embodiment depicted in
(24) For setting the classifier, one or more positive patches Ip containing the shape, or part of the shape, to be detected and one or more negative patches In not containing the shape to be detected must be obtained and processed in each node using the distance function with each group of random parameters.
(25) It is advisable that the one or more negative patches are the ones with highest positive class probability according to the statistics at the reached leaf nodes of the binary trees.
(26) Therefore, as shown in
(27) Following this procedure it can be seen that some candidate groups will fail to separate the positive and negative patches, thus not being appropriated as no information can be extracted.
(28) A rule for deciding that the positive and negative patches are correctly separated could be if at least 50% of positive patches and at least 50% of negative patches are correctly grouped and separated. Also the candidate group that maximizes the information gain in said node could be decided to be the one that separates the positive and negative patches. As shown in
(29) When more than a predefined number of positive or negative patches are applied to a leaf node of the shape detection classifier, namely between 100 and 2000, preferably 300, and also preferably if the information gain in said node reaches an information threshold between 0 and 0.1, a new level of the binary tree is created from the leaf node, for processing further one or more positive patches Ip and negative patches In which are separated in the two groups of patches in the parent node in each respective children node as shown in
(30) Also, the statistics from the parent node are passed from the parent node to each respective children node and the configuration steps previously detailed for the root node are applied, obtaining a candidate group of parameters that best separate the positive and negative patches into two groups in each children node
(31) Therefore, all candidate groups of random parameters (u.sub.0 v.sub.0 k.sub.0 θ.sub.0); (u.sub.1 v.sub.1 k.sub.1 θ.sub.1) and (u.sub.2 v.sub.2 k.sub.2 θ.sub.2) can be tested in a children node. However, no more information could be obtained from the group of random parameters already used in any parent node, so it can be skipped. It is also possible creating new candidate groups of random parameters per each leaf node. In fact, it is preferable as
(32) This step can be reproduced again, recursively, creating new levels of the binary trees, as shown in
(33)
(34)
(35) As shown in
(36)
(37) Where m=1 . . . M represents the different leaf nodes of the binary trees.
(38) Once the probability of a specific shape is obtained from the classifier, so it can be determined if a patch contains of not the hand gesture, for example if p>0.5, it is advisable to evaluate the patch surrounding said patch. It can be done by casting a vote for a target shape per each patch whenever the probability of said shape is higher than the probability of not being said shape or being another shape, and estimating a probability density for the target shape using said votes. This way, if more than one patch covers the gesture, the result can be verified by evaluating the votes of the neighboring patches with p>0.5.
(39) Also, temporal consistency can be taken into account by recursively updating the probability density with votes aggregated from past time instants as follows:
p′(c|I.sub.t)=αp(c|I.sub.t)+(1−α)p′(c|I.sub.t-1)
(40) It is also advisable detecting the pixel location of the target shape as the pixel with maximum probability, as it may help on tracking the gesture over time.
(41) Also, the probability can thresholded by locally integrating the probability measure through a circular surface of radius inversely proportional to the depth and centered at the global maximum.
(42) Also, a system to perform the previous method is described. The system has two distinct modes of operation, the Training Mode and the Detection Mode. Depending on the mode it operates distinct processing units are involved, as it is shown in the system diagram in
(43) In training mode, the system generates a decision forest data structure obtained by learning from the examples captured by the capturing device.
(44) In detection mode, the system generates events for the gestures detected in the depth maps captured by the capturing device.
(45) In the following, the processing units of the system are described in more detail.
(46) Training Mode
(47) The training mode allows the user to train the system to detect new gestures. Hence, the result of the process is a decision forest data structure that would be used to detect the specific gesture by the detection unit introduced above.
(48) The system operations are governed by the Interactive Machine Learning (IML) state machine. Depending on the state defined by the IML state machine, different sorts of information are displayed to the user, and the training unit acts accordingly. The goal is to provide the training unit with annotated positive and negative examples of the gesture, extracted from the images captured by the depth capturing device, in a semi-automatic (or fully-automatic) manner.
(49) An example of a state machine realization is described by the following set of states, with transitions described in
(50) This sequence of states can be executed several iterations to capture more variability of the gesture, such as distinct contexts, backgrounds, distances or slightly modified poses. Also, more negative examples can be added at each iteration, improving the generalization of the detector. At the end of a training session, the application launches a testing mode that allows to check the performance of the current detector and realize its main failures. The random forest also can be stored in disk, such that the application can be stopped, and the user can change the settings, the scenario, and for example other user can continue the training. The stored random forest is loaded and can be updated again with new data, so the trees continue growing from the point they were left in the previous iteration.
(51) Training Unit. The training unit is based on Random Forests online learning. The procedure to train a classifier for a new gesture consists in growing the decision trees upon arrival of new samples. A tree starts with only one root node with a set of binary tests with the variables u, v, k and θ have been randomly selected. Two conditions should apply before splitting a node: 1) a minimum number of samples has been already seen by the node, 2) the split achieves a minimum information gain.
(52) When such conditions are fulfilled, the test that produces the greatest information gain is chosen. Such process is applied to the right and left newly generated leaf nodes, and so until the tree has grown to the required depth. At each split, the statistics for each class label of the parent node are propagated to children such that leaf nodes can perform classification on-the-fly, even before observing new samples.
(53) Hard negative mining using on-the-fly detection. The set of negative samples used for training is highly relevant for the performance of a detector. From the annotated images, collecting all the patches of the whole image is not practical, so methods rely on randomly sampling patches, as it is not clear in advance which patches from this images would be more useful as training samples.
(54) In this invention, the method collects the negative patches from the training images using the prediction of the online forest during the training phase. In this manner, the training process is done in a single iteration and the set of examples used to update the trees is reduced, so that redundant or non informative patches are not used. The procedure is applied during training for each negative training image I.sub.neg as follows: 1. The probability for the positive class c for each pixel x in I.sub.neg, p(c|x) is computed on-the-fly using the statistics collected at the current leaf nodes. 2. A pseudo-probability value for each pixel is computed using a Parzen estimator with a Gaussian kernel. Then we obtain the location with maximal probability m.sub.c. We denote maxp the probability at m.sub.c. 3. A set of N.sub.neg patches are collected within a neighborhood centered at m.sub.c. The number of patches collected is proportional to maxp, so in this manner the worse is the failure of the detector, more negative samples which produce the failure are used for training.
Depth Filtering Unit
(55) In the system, a depth data processing unit filters the data captured by the sensor in order to correct depth measurement errors. Such process exploits local context and prior knowledge of the sensor inaccuracies.
(56) Motion Detection Unit
(57) The action of performing a gesture implies body movement from the stand by pose to the gesture pose. In this invention, a motion detection unit determines regions of the image where objects or people in the scene have moved. Such information is then exploited to detect the still gestures within such regions. In this manner the computation effort of the system is reduced and it has a very low computational load while there is no activity in the scene.
(58) Gesture Detection Unit
(59) Given a set of patches from the depth image, the Gesture detection unit generates hypothesis of the position of a gesture in the image and its class. In order to generate the hypothesis the unit relies on a patch classifier and a function that integrates classifier votes at the different pixel locations.
(60) The patch classifier is based on a Random forest classifier. Random forests are an ensemble of m randomized binary trees. Nodes n in each tree have associated a weak classifier that consists in a binary function (from now on binary test) of feature vectors obtained from the image. Moreover, each node n have a learned probability distribution p(c|I,x) that reflects how likely is a class c given a pixel x in the image I. To obtain the class probability for a patch, every tree is traversed recursively from the root, branching left or right depending on the result of the binary test applied to the patch at each node, until a leaf node is reached.
(61) The robustness of forests is based on the combination of several classification trees. The combination is obtained by averaging the distributions over the leaf nodes {I1 . . . I.sub.M} reached in all the M trees:
(62)
(63) In this invention, the binary tests rely in the depth value difference between two pixels located within a patch of a given size. In order to avoid background clutter we impose a local context also in 3D space, by introducing the clipping parameter that defines a maximum displacement in the Z axis, i.e depth values.
(64) Given a depth image I, the binary test for a patch centered at x is the result of the comparison f(I, x)>θ, where f(I, x) is formulated as follows:
(65)
where u and v are pixel displacements that fall within a patch size. Pixel displacements are normalized with the depth evaluated at pixel x in order to make the test features invariant to depth changes. The set of parameters u, v, k and the threshold θ for each of the nodes n are defined by during the learning process, which is performed by the learning unit. For each node, the process builds a set of test functions with randomly generated parameters. Then, the optimal test for each node is selected according to information gain criteria.
Gesture Localization Function
(66) For gesture detection and localization, a set of patches are provided to the detection forest, which casts a vote whenever a positive class has more probability than the negative class and other positive classes. Requiring the positive scores to be larger than the negative scores yields a sparsified voting across the image (i.e., only a few patches happen to have larger positive scores). To detect a gesture, the function first estimates a probability density using the votes within a frame and then takes into account temporal consistency by recursively updating this distribution with votes aggregated from past time instants. In order to construct the probability density, it uses a Parzen estimator with Gaussian kernel. In order to account for the time component of the approximated density, the density p(c|I.sub.t) is sequentially updated as follows:
p′(c|I.sub.t)=αp(c|I.sub.t)+(1−α)p′(c|I.sub.t-1)
(67) This is a simple yet effective method to keep temporal consistency of the casted votes, as it requires storing a single probability map.
(68) The pixel location g.sub.c of a gesture class c>0 is computed as the pixel location with maximum probability. To ensure that such a maximum represents a target gesture, the probability volume V is thresholded by locally integrating the estimated pseudo-probability measure:
(69)
where S is a circular surface element of radius inversely proportional to the depth, and centered at the global maximum. In this way, the localization is depth-invariant.
Gesture Tracking-by-Detection
(70) An aspect of the present invention relates to tracking of gestures. Tracking implies obtaining the trajectories of gestures present in the depth image. Trajectories are further used as mouse like controls (e.g., slider sliders, pointers, etc.).
(71) Tracking is formulated as an energy minimisation problem with the following functional:
(72)
where x denotes the set of hand locations and Z the full set of observations (or evidence, measurements); x.sub.j are candidate locations in the current frame, z.sub.j are observations in the current frame; x.sub.i and z.sub.i are candidates and observations on the previous tracked frame.
(73) Hence, first term represents unary potentials, the second term gathers pairwise potentials.
(74) Currently, unary potentials are detection responses obtained with the present invention, whereas pairwise are modelled as a sum of the following terms: Distance cost: Measures the 2D distance between 2 tracks x.sub.i; x.sub.j in consecutive frames; Appearance cost: Measures the L1 norm between two candidate patches (depth data only);
(75) In practice, the problem is equivalent to minimise the energy in a bipartite graph (association), and hence it is solved with the Hungarian or Munkres algorithm.
(76) A second step, called Track Management, follows the energy minimisation. It basically removes tracks when: Overlap between two hand positions/candidates (duplicates) Tracks in death zones (borders) Tracks are lost for a given period of time
(77) Track management works by considering two track states, active and lost: Active tracks: Energy minimization is solved only for active tracks, which are typically visualized in the screen with a color representing their ID. Lost tracks: After getting low tracking scores, the manager declares a track “lost”. Lost tracks stay in a pool waiting for a new incoming hand detection to “resurrect” them and go back to active. However, lost tracks are removed if no detection appears after a few frames or if the track was lost near to a death zone.
(78) Alongside, the gesture detector is run at every frame in order to incorporate new detections as new potential tracks.
APPLICATION EXAMPLES
(79) The system described herein may be used for a wide range of applications. For example, the system can be used to activate functions of a device remotely. The device comprises an electronic mechanism of activation of its functions. A set of gestures are defined and associated to distinct functions of the device. The present invention generates the electronic signals that activate the function of the device when the user performs the specific gesture. Home appliances are devices suitable to be controlled with the present invention. For example, the system can be used to modify the intensity of light bulbs, operate the thermostat and heating system or to choose songs and tune volume of the audio equipment.
(80) Gestures convey a semantic component usually related to cultural context, which may differ from user to user. The present invention allows the user to train the system to detect the specific gesture the user prefers in order to activate certain device function such that it remains consistent with semantic prior conditions.
(81) Interaction with large screens is another application example of the present invention. Large screens are usually observed from distance, hence gestural interaction fulfill this need of remote control. In unconstrained conditions, users use gestures and hand movements for non-verbal communication, but the system should only react to gestures when the user actually wants to interact with the application. By means of the system presented herein, the interactive application can be activated or deactivated using a specific gesture not usually employed in natural conversation. In this manner, the gestural interaction interface of the system is not interfered or activated by gestures unintentionally performed by people in front of the screen.