Fast and precise object alignment and 3D shape reconstruction from a single 2D image
10380788 ยท 2019-08-13
Assignee
Inventors
Cpc classification
G06T17/10
PHYSICS
G06T3/08
PHYSICS
International classification
Abstract
The innovation describes and discloses systems and methods related to deep neural networks employing machine learning to detect item 2D landmark points from a single image, such as those of an image of a face, and to estimate their 3D coordinates and shape rapidly and accurately. The system also provides for mapping by a feed-forward neural network that defines two criteria, one to learn to detect important shape landmark points on the image and another to recover their depth information. An aspect of the innovation may utilize camera models in a data augmentation approach that aids machine learning of a complex, non-linear mapping function. Other augmentation approaches are also considered.
Claims
1. A computer-implemented method for mapping a computerized 2D image into a 3D shape comprising: applying machine learning to a predetermined sample size of computerized 2D images, that application developing two criteria, a detection criterion and a transform criterion for a multi-layer deep neural network (DNN), receiving a non-sample computerized 2D image through a detector of the DNN that has been trained by the application of machine learning, regressing, with the detection criterion, the non-sample image that yields detected landmark points through the DNN recovering, with the transform criterion, depth information from image attributes; and mapping, with the transform criterion, the landmark points through linear or nonlinear functions of the DNN that yield a 3D shape corresponding to the non-sample image.
2. The method of claim 1, wherein the application of machine learning comprises augmentation steps of at least one of camera model, bounding box, local/global, gradient descent and Gaussian noise in a feed-forward manner.
3. The method of claim 2, further comprising generating additional samples from the predetermined sample size by applying the data augmentation step of camera model that applies affine (or Euclidean or projective) transformations to the sample images of the predetermined sample size and training the detector with the additional samples.
4. The method of claim 2, wherein the non-sample image is a face and the augmentation step of bounding box centers and resizes the non-sample image for regressing and mapping by the DNN.
5. The method of claim 2, wherein the non-sample image is any rigid or non-rigid object and the augmentation step of bounding box centers and resizes the non-sample image for regressing and mapping by the DNN.
6. The method of claim 2, wherein the augmentation step of local/global increases accuracy of landmark detection by the detector and provides improved efficiency of machine learning.
7. The method of claim 6, wherein the application of local/global is based on at least one of pairs, triplets and complex quadrilaterals.
8. The method of claim 2, wherein the augmentation step of gradient descent is applied to the transform criterion.
9. The method of claim 2, wherein the augmentation of Gaussian noise compensates for detection errors, missing or occluded landmark points related to either sample or non-sample image(s), or both.
10. The method of claim 1, wherein the application of machine learning comprises an augmentation step of at least applying learned weights from a recurrent layer of the DNN in a back-propagation manner.
11. The method of claim 9, wherein the recurrent layer uses backpropagation to supply learned weights to the landmark criterion and enables intermediate supervision of the machine learning.
12. A Deep Neural Network (DNN) system for mapping a computerized 2D image into a 3D shape comprising: a machine learning component that receives a predetermined sample size of computerized 2D images and that develops a detection criterion and a transform criterion, a detector for receiving a non-sample computerized 2D image, the detector having been trained by the machine learning component; and a functional mapping component that regresses and detects, with the detection criterion, landmark points of the non-sample image, recovers, with the transform criterion, depth information from attributes of the non-sample image, and maps, with the transform criterion, the landmark points and the recovered depth information into a 3D shape that corresponds to the non-sample image.
13. The DNN system of claim 12, wherein the machine learning component provides augmentation steps of at least one of camera model, bounding box, local/global, gradient descent and Gaussian noise in a feed-forward manner.
14. The DNN system of claim 13, wherein the machine learning component further generates additional samples from the predetermined sample size by applying the augmentation step of camera model that applies affine transformations to the sample images of the predetermined sample size; and the detector is trained with the additional samples.
15. The DNN system of claim 13, wherein the non-sample image is a face and the bounding box augmentation centers and resizes the non-sample image for regressing and mapping by the DNN.
16. The DNN system of claim 13, wherein the non-sample image is any rigid or non-rigid object and the bounding box augmentation centers and resizes the non-sample image for regressing and mapping by the DNN.
17. The DNN system of claim 13, wherein the local/global augmentation increases accuracy of landmark detection by the detector and provides improved efficiency of machine learning.
18. The DNN system of claim 17, wherein the local/global augmentation is based on at least one of pairs, triplets and complex quadrilaterals.
19. The DNN system of claim 13, wherein the gradient descent augmentation is applied to the transform criterion.
20. The DNN system of claim 13, wherein the Gaussian noise augmentation compensates for detection errors, missing or occluded landmark points related to either sample or non-sample image(s), or both.
21. The DNN system of claim 12, wherein the application of machine learning comprises an augmentation step of at least applying learned weights from a recurrent layer of the DNN in a back-propagation manner.
22. The DNN system of claim 21, wherein the recurrent layer uses backpropagation to supply learned weights to the landmark criterion and enables intermediate supervision of the machine learning.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION
(10) In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the subject innovation. It may be evident, however, that the innovation can be practiced without these specific details. In other instances, well-known structures and devices are shown and/or described in order to facilitate describing the innovation.
(11) While specific characteristics are described herein, it is to be understood that the features, functions and benefits of the innovation can employ characteristics that vary from those described herein. These alternatives are to be included within the scope of the innovation and claims appended hereto.
(12)
(13) The innovation applies a novel algorithm that provides a fast and precise estimation of the 3D shape of an item, for example, a face, from a single 2D image of the item. As illustrated in .sup.p (p the number of pixels), with a deep neural network defining a function s=(a) may yield the 3D coordinates of the l landmark points defining the shape of the face, s
.sup.3l As should be appreciated, given the large number of possible identities, illuminations, poses and expressions, a particular functional mapping (.Math.) may be difficult to estimate, and the innovation resolves this problem using a deep neural network. A deep neural network is a regression approach to estimate non-linear mappings of the form s=(a), where a is the input and s is the output. A deep neural network may have p input and 3l output nodes. Complex 2D image to 3D shape mapping may be learned with a number of hidden layers and non-linear functions between layers of a deep neural network. It is to be appreciated that the term learn and its derivatives signify the application of machine learning techniques as the innovation is directed to machine processing of images. This innovation is in sharp contrast to linear regression methods attempted previously as well as non-linear attempts to model 2D shape from a single image or 3D shape from multiple images.
(14) Further, compared to previous approaches, an innovative approach of the deep neural network is also able to learn from a variety of number of 3D sample shapes, from small to large numbers of shapes. A small number of samples might not seem sufficient for learning a regressor, but the innovative approach of the deep neural network may also comprise data augmentation methods that circumvent a problem of otherwise too limited sample size. For example, an embodiment of an innovative augmentation may use a camera model to generate multiple views of the same 3D shape and the matching 2D landmark point on the original sample image. Successfully and accurately recovering 3D shape of faces from a single view has been demonstrated. Further, a deep neural network derived with multi-layers can be trained very quickly and testing runs faster than real-time (greater than thirty frames per second).
(15) In another embodiment, deep neural networks enable modeling of complex, non-linear functions from large numbers of samples. Samples may include 2D images of faces a.sub.i, i=1, . . . , n, and n=n.sub.1+n.sub.2, with the first n.sub.1 images with their corresponding 2D and 3D shapes, s.sub.i, and the second n.sub.2 images with just 2D shapes.
(16)
(17) Next, for an embodiment of a deep neural network, optimization criteria is defined. For the proposed approach, two optimization criteria are defined. First, a criterion for accurate detection of 2D landmark points on an aligned image is derived. Second, a criterion for converting these 2D landmark points to 3D is defined. These two criteria are illustrated in
(18)
(19) An example deep neural network for the detection of facial landmark points according to aspects of the innovation may provide a deep convolutional networking defined with p input nodes, 2l output nodes and 6 layers (as shown in
(20) Turning next to the second criterion of optimization, a deep neural network may employ machine learning to detect 2D landmark points of an input image accurately. In an embodiment, image samples and their corresponding 2D output variable (i.e., 2D landmark points) may be defined as the set {(a.sub.l; o.sub.l) . . . , (a.sub.n; o.sub.n)}, where o.sub.i is the true (desirable) location of the 2D landmark points of the face. Note that o.sub.i is a vector of 2l image coordinates, o.sub.i=ui.sub.l, vi.sub.l, . . . , u.sub.il, v.sub.il).sup.T, where (u.sub.ij, v.sub.ij).sup.T is the j.sup.th landmark point.
(21) As a goal of a computer vision system is to identify the vector of mapping functions f(a.sub.i, w)=(.sub.l(a.sub.i, w.sub.l), . . . , .sub.r(a.sub.i, w.sub.l)).sup.T that converts an input image a.sub.i to an output vector o.sub.i of detections, with w=(w.sub.l, . . . , w.sub.l).sup.T as a vector of parameters of the mapping functions. Hence, .sub.j(a.sub.i, w.sub.j)=(.sub.ij, {circumflex over (v)}.sub.ij).sup.T are the estimates of the 2D image coordinates u.sub.ij and v.sub.ij, and w.sub.j, and are the parameters of the function .sub.j.
(22) For a fixed mapping function f(a.sub.i, w) (e.g., as may be used in a convolutional neural network), a goal of optimizing w may be formally stated:
(23)
(24) where .sub.local(.Math.) denotes a loss function. Specifically, we use the L.sup.2-loss defined as,
(25)
(26) where o.sub.ij is the j.sup.th element of o.sub.i, i.e., o.sub.ij.sup.2.
(27) Without loss of generality, and to simplify notation, the innovative approach uses f.sub.i in lieu of f(a.sub.i, w) and .sub.ij instead of .sub.j(a.sub.i, w.sub.j). Note that the functions .sub.ij are the same for all i, but may be different for distinct values of j.
(28) The above derivations correspond to a local fit. That is, (1) and (2) attempt to optimize the fit of each one of the outputs independently and then take the average fit over all outputs. This approach has several solutions, even for a fixed fitting error. For example, the error can be equally distributed across all outputs .sub.ijo.sub.ij.sub.2.sub.iko.sub.ik.sub.2, j,k, where .sub.2 is the 2-norm of a vector. Or, most of the error may be in one (or a few) of the estimates: .sub.ijo.sub.ij.sub.2>>.sub.iko.sub.ik.sub.2 and .sub.iko.sub.ik.sub.20, kj. In general, for a fixed fitting error, the latter example is less preferable, because it leads to large errors in one of the output variables. Large errors may indicate that an algorithm did not converge as expected, and its results may be less useful.
(29) A possible solution to this problem is to add an additional constraint to minimize
(30)
(31) with c1. However, this approach typically results in very slow training, limiting the amount of training data that can be efficiently used. By reducing the number of training samples, generalization to unseen samples worsens, resulting typically in less accurate detections. Another typical problem of this equation is that the constraint is not flexible enough for current optimization algorithms. The innovative approach resolves these problems by adding a global fitting criterion that instead of slowing or halting desirable convergences, it speeds them up.
(32) An aspect of the innovative approach is to note that the constraint in (2) is local because it measures the fit of each element of o.sub.i (i.e., o.sub.ij) independently. By local, it is to be appreciated that only that one local result is aimed for. The same criterion can nonetheless be used differently to measure the fit of pairs of points; formally:
(33)
(34) where g(d, e)=de.sub.b is the b-norm of de (e.g., the 2-norm, g(d, e)={square root over (((de).sup.T(de))))}).
(35) An aspect of the innovative approach for these derivations is the realization that (4) is no longer local, since it takes into account the global structure of each pair of elements. Resolving the problems of (2) enumerated above with the addition of (4) yields accurate detections of landmark points and fast training.
(36) In some embodiments of a deep neural network, layers may be h(.sub.ij)=.sub.ij in .sup.2 for landmark detection. In other embodiments, a global criterion may be extended to triplets; formally:
(37)
Here g(x, z, u) is a function that computes the similarity between its three entries. Applying the function in detection of landmark points, this means a norm can be computed as above, e.g., g(x, z, u)=(xz)+(zu).sub.b, but also the area of a triangle defined by each triplet of landmark points can be calculated; formally, g(x, z, u)=|(xz)(xu)|, where we assume the three landmark points are non-co-linear.
(38) In still other embodiments, global criterion may be extended to four and more points. For instance, as applied to convex quadrilaterals as g(x, z, u, v)=|(xu)(zv)|. In such embodiments, for t landmark points, the area of the polygon envelope can be computed, i.e., a non-self-intersecting polygon contained by the t landmark points {x.sub.il, . . . , x.sub.it}. This polygon may be computed as follows. First, a Delaunay triangulation of image (for example a face image) landmark points is computed. A polygon envelop is easily obtained by connecting the lines of the set of t landmark points in counter-clockwise order. Denoting this ordered set of landmark points {tilde over (x)}.sub.i={{tilde over (x)}.sub.i1, . . . , {tilde over (x)}.sub.it}, the area is then given by:
(39)
where we used the subscript a to denote area and {tilde over (x)}.sub.ik=({tilde over (x)}.sub.ik1,{tilde over (x)}.sub.ik2).sup.T.
(40) In some embodiments, we may use the combined local and global loss function functions given by, (f.sub.i,y.sub.i)=.sub.0
.sub.local(f.sub.i,y.sub.i)+
.sub.global(f.sub.i,y.sub.i), with the global loss defined as
.sub.global(f.sub.i,y.sub.i)=
(41)
(42) In an example implementation that demonstrates aspects of the innovation, l was set to 66 and n.sub.1+n.sub.2=18,600 samples were used. Additionally, the deep neural network used four convolutional layers, two max pooling layers and two fully connected layers. It is to be appreciated that normalization may be applied, with dropout, and rectified linear units (ReLU) at the end of each convolutional layer. An advantage of the embodiment is that learning from even very large datasets can be efficiently performed. In order to have a landmark detector invariant to any affine transformation and partial occlusions, a data augmentation approach may be used (as will also be discussed in relation to section Missing Data herein). Specifically, an additional 80,000 images were generated by applying two-dimensional affine transformations to an existing training set, i.e., scale, reflection, translation and rotation; scale was between 2 and 0.5, rotation was 10 to 10, and translation and reflection were randomly generated. This is equivalent to using a camera model. In order to make the network more robust to partial occlusions, random occluding boxes of dd pixels may be added, and in an example embodiment of an item being a face, d may be set between 0.2 and 0.4 times an inter-eye distance; in the example embodiment, 25% of training images had partial occlusions.
(43) Picking back up on a discussion of the second criterion, the recovery of 3D information (i.e., the depth value) related to 2D landmark points (as detected above, for example) is described. Note that the n 2D landmark points on the i.sup.th image in matrix form can be written as
(44)
(45) in order to recover the 3D coordinates of these 2D landmark points,
(46)
(47) where (x.sub.ij, y.sub.ij, z.sub.ij).sup.T are the 3D coordinates of the j.sup.th face landmark.
(48) With an embodiment using a weak-perspective camera model, with calibrated camera matrix
(49)
the weak-perspective projection of the face 3D landmark points may be given by
U.sub.i=MS.sub.i.(7)
(50) This result is defined up to scale, since u.sub.i=x.sub.i and v.sub.i=y.sub.i, where x.sub.i.sup.T=(x.sub.i1, x.sub.i2, . . . , x.sub.in), y.sub.i.sup.T=(y.sub.i1, y.sub.i2, . . . , y.sub.in), z.sub.i.sup.T=(z.sub.i1, z.sub.i2, . . . , z.sub.in), u.sub.i.sup.T=(u.sub.i1, u.sub.i2, . . . , u.sub.in) and V.sub.i.sup.T=(v.sub.i1, v.sub.i2, . . . , v.sub.in).
(51) It is to be appreciated that this approach requires that variables are to be standardized when deriving the algorithm.
(52) Continuing with the description of the embodiment of a proposed neural network, it is to be noted that given a training set with n 3D landmark points {S.sub.i}.sup.n.sub.i=1, the aim is to learn the function : .sup.2n.fwdarw.
.sup.3n, that is,
{circumflex over (z)}.sub.i=({circumflex over (x)}.sub.i,.sub.i),(8)
(53) where {circumflex over (x)}.sub.i, .sub.i, and {circumflex over (z)}.sub.i, are obtained by standardizing x.sub.i, y.sub.i and z.sub.i as follows,
(54)
(55) where
(56) It is to be appreciated that x.sub.i, y.sub.i and z.sub.i are standardized to eliminate the effect of scaling and translation of the 3D face, as noted above. In this manner, the embodied deep neural network models the function (.Math.) using multi-layers. As discussed previously in regards to
a.sup.(m+1)=tan h(.sup.(m)a.sup.(m)+b.sup.(m)),
(57) where a.sup.(m).sup.d is an input vector, a.sup.(m+1)
.sup.r is the output vector, d and r specify the number of input and output nodes, respectively, and
.sup.rd and b
.sup.r are network parameters, with the former a weighting matrix and the latter a basis vector. An embodiment of the deep neural network may use a Hyperbolic Tangent function, tan h(.Math.).
(58) Deep neural network model parameter optimization includes an objective to minimize the sum of the Euclidean distances between the predicted depth location a.sub.i.sup.(m) and the ground-truth {circumflex over (z)}.sub.i, of our l 3D landmark points, formally:
(59)
(60) with .Math..sub.2 the Euclidean distance of two vectors. The RMSProp algorithm, as discussed in Tieleman and Hinton's Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude in COURSERA: Neural Networks for Machine Learning (2012), which is incorporated by reference herein in its entirety, may be utilized to optimize model parameters. In other embodiments, alternative optimization algorithms may be used. In a multi-layer neural network, an appropriate learning rate may vary widely between learning (training) as well as between different parameters. RMSProp is a technique that updates parameters of a neural network to improve learning, and can adaptively adjust a learning rate of each parameter separately to improve convergence to a solution.
(61) It is to be appreciated that implementation details of an embodiment of a deep neural network may contain six or more layers in a feed-forward embodiment. The number of nodes may be 2n in each layer except the last one wherein the number of nodes is n. In other embodiments, the number of nodes may be different in each layer, but will generally be n in the last layer since this is the number of landmark points to be reconstructed in 3D. In other embodiments, the number of layers may be 4 or more.
(62) When testing on the t.sup.th face, we have u.sub.t and v.sub.t, and want to estimate x.sub.t, y.sub.t and z.sub.t. From Eq. (7) we have u.sub.t=x.sub.t and v.sub.t=y.sub.t.
(63) Thus, we first standardize the data,
(64)
(65) This yields {circumflex over (x)}.sub.t=.sub.t and .sub.t={circumflex over (v)}.sub.t. Therefore, we can directly feed (.sub.t, {circumflex over (v)}.sub.t) into a trained neural network to obtain its depth {circumflex over (z)}.sub.t. Then, the 3D shape of an image, for example, a face, can be recovered as (.sub.t.sup.T, {circumflex over (v)}.sub.t, .sup.T, {circumflex over (z)}.sub.t.sup.T).sup.T, a result that is defined up to scale.
(66) Training data may be divided into a training set and a validation set. In each of these two sets, data augmentation may be performed. Generally, augmentation may include algorithm treatment for noise in data, or missing data, as well as handling a variable number of training samples. Specifically, weak-perspective camera models defined above may be used to generate new 2D views of the 3D landmark points given in the training set. This process may help the deep neural network's applied algorithms learn how each 3D shape is seen from a large variety of 2D views (translation, rotation, scale). Early stopping may be enabled to prevent overfitting and accelerate the training process. For example, training process may be stopped if the validation error does not decrease after 10 iterations. A learning rate may be set, for example at 0.01.
(67) Missing Data
(68) To aid in solving a problem of missing data, an embodiment of a deep neural network may add a recurrent layer on top of a previous multi-layer neural network to jointly estimate both the 2D coordinates of missing 2D landmarks and their depth. The complete network may be trained in an end-to-end fashion.
(69) Turning now to .sub.2 is the loss function used.
(70) In the recurrent layer, we use the notation .sub.ij.sup.(s) and {circumflex over (v)}.sub.ij.sup.(s) to specify the estimated values of .sub.ij and {circumflex over (v)}.sub.ij at iteration s. Here, i specifies the i.sup.th sample. The input to our above embodied deep neural network can then be written as d.sub.i.sup.(0)=(.sub.i1.sup.(0), {circumflex over (v)}.sub.i1.sup.(0), . . . , .sub.in.sup.(0), {circumflex over (v)}.sub.in.sup.(0)), with s=0 specifying the initial input. If the values of u.sub.ij and v.sub.ij are missing, then .sub.ij.sup.(0) and {circumflex over (v)}.sub.ij.sup.(0) are set to zero. Otherwise the values of u.sub.ij and v.sub.ij are standardized using Eq. (12) to obtain .sub.ij.sup.(0) and {circumflex over (v)}.sub.ij.sup.(0).
(71) In subsequent iterations, from s1 to s, if the j.sup.th landmark is not missing, .sub.ij.sup.(s)=.sub.ij.sup.(s-1) and {circumflex over (v)}.sub.ij.sup.(s)={circumflex over (v)}.sub.ij.sup.(s-1). If the j.sup.th landmark is missing, then .sub.ij.sup.(s)=g(.sub.k=1.sup.2nw.sub.k(2j-1)d.sub.ik.sup.(s-1)), {circumflex over (v)}.sub.ij.sup.(s)=g(.sub.k=1.sup.2nw.sub.k(2j)d.sub.ik.sup.(s-1)), where g(.Math.) can be the identity (linear) function or a nonlinear function (e.g. tan h(.Math.)) and w.sub.k(2j-1), w.sub.k(2j), k=1, . . . , 2n, j=1, . . . , n are the parameters of the recurrent layer.
(72) We set the number of iterations to , which yields d.sub.i=.sub.s=1.sup..sub.sd.sub.i.sup.(s) as the final output of the recurrent layer, where .sub.s are learned weights. We initialize .sub.s as 0<.sub.1< . . . <.sub. and .sub.s=1.sup..sub.s=1. The vector =(.sub.1, . . . , .sub.).sup.T is then learned using backpropagation. By using the weighted sum of the output at each step rather than the output at the last step as final output of the recurrent layer, we can enforce intermediate supervision to make the recurrent layer gradually converge to the desirable output.
(73) Data Augmentation Approach
(74) In many applications, a number of available training samples (i.e., 2D and corresponding 3D landmark points) may be small. However, any regressor designed to learn a mapping function (.Math.) may require a large number of training samples with the 2D landmarks as seen from as many cameras and views (i.e., translation, rotation, scale) as possible to reach an acceptable performance level. The trade-off may be resolved with a seemingly simple, yet efficient data augmentation approach.
(75) A key to our approach is to note that, for a given object, its 3D structure does not change. What changes are the 2D coordinates of the landmark points in an image of the given object. For example, scaling or rotating an object in 3D yields different 2D coordinates of the same object landmarks. Thus, our task is to generate as many of these possible sample views (of a given object) as possible.
(76) We do this with a camera model. Herein as described, we use an affine camera model to generate a very large number of images of the known 3D sample objects. In other embodiments, a different camera model may be used. We model the intrinsic (e.g., focal length) as well as the extrinsic parameters (e.g., 3D translation, rotation and scale). A specific embodiment is the use of the weak-perspective camera model.
(77) Another data augmentation concerns the modeling of imprecisely localized 2D landmark points. All detection algorithms yield imprecise detections (even when fiducial detections are done by humans). An embodiment of a deep neural network may address this problem by modeling the detection error as Gaussian noise, with zero mean and variance a. A particular embodiment may use a small variance equivalent to about 3% the size of the object. This means that, in addition to the 2D landmark points given by the camera models used above, a deep neural network will incorporate 2D landmark points that have been altered by adding this random Gaussian noise. This allows our neural network to learn to accurately recover the 3D shape of an object from imprecisely localized 2D landmark points.
(78) It is important to note that, when the original training set is small, the deep neural network can still train efficiently using this method. In fact, we have found experimentally that we do not need a large number of training samples to obtain extremely low reconstruction errors. This is significant because deep neural nets most usually require very large training sets to learn to detect and recognize objects in images. Of course, even when the number of samples is large, our approach helps reduce the 3D reconstruction error by incorporating intrinsic and extrinsic camera parameters and detection errors which may not be well represented in the samples.
(79) Applied Noise and Missing Data to an Embodiment
(80) To determine how sensitive the proposed neural network is to inaccurate 2D landmark detections, we add independent random Gaussian noise with variance a to the elements in the databases as described in the preceding sections. That is, we add noise to the training samples. Specifically, we apply Gaussian noise to the 2D landmarks.
(81) Performance degrades little as increases when noise is added to the CMU Motion Capture database. The average height of subjects in this dataset is 1,500 mm, and the variance of the noise added is about 3%. The proposed algorithm has been found to be robust to these inaccurate 2D landmark positions, with relative reconstruction error averaged across the testing subjects for each landmark with and without noise to be favorable. Results on publically available databases, for example, the BU-3DFE Face Database, FG3DCar Database and Flag Flapping in the Wind sequence have been obtained. The average width of the faces in BU-3DFE is 140 mm, hence, the variance of the detection error (noise) is 5%. The mean width of the car models in FG3DCar is 569 pixel, hence, the variance is 2%. The mean width of the flags is 386 mm, hence, the variance is 3%.
(82) Additionally, we tested the ability of the trained system to deal with missing data. Here, each training and validation sample had one or more randomly selected landmark points missing during training and testing. Comparative results with different number of missing landmark points are in Table 1. Reference (1) is Zhou, Leonardos, Hu, and Daniilidis, 3d shape estimation from 2d landmarks: A convex relaxation approach, published in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. Reference (2) is Zhou, Zhu, Leonardos, Derpanis, and Daniilidis, Sparseness meets deepness: 3d human pose estimation from monocular video, published in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. Reference (3) is Ramakrishna, Kanade, and Sheikh, Reconstructing 3d human pose from 2d image landmarks, published in ECCV, 2012, pp. 573-586. As can be seen in the table, our example deep neural network results achieve smaller reconstruction errors compared with Zhou's et al. even when our inputs had missing data and that of Zhou's et al. did not. We also compare our approach with a simple nearest neighbor approach. In the simple neighbor approach, for each testing sample, its 3D reconstruction is the 3D shape in the training set whose 2D projection has the smallest Euclidean distance with that of the test image.
(83) TABLE-US-00001 TABLE 1 CMU MoCap Human Subject Subject Subject BU-3DFE Flag Method 3.6M 13 14 15 Face FG3DCar Flapping Disclosed 0.0120 0.0231 0.0200 0.0095 0.0032 0.0020 0.0004 Embodiment Disclosed 0.0314 0.0413 0.0396 0.0307 0.0035 0.0079 0.0038 Embodiment (with one missing) Disclosed 0.0383 0.0728 0.0694 0.0693 0.0037 0.0086 0.0054 Embodiment (with two missing) Nearest 0.0426 0.0983 0.0844 0.0497 0.0112 0.0129 0.0101 Neighbor (with one missing) Nearest 0.0428 0.0992 0.0859 0.0509 0.0106 0.0123 0.0101 Neighbor (with two missing) Zhou et al (1) 0.0653 0.0643 0.0405 0.0053 0.0042 Zhou et al (2) 0.0359 Ramakrishna 0.0983 0.0979 0.0675 et al (3)
(84)
(85) Detection error may be evaluated using Ground Truth Error (GTE) and Cross View Ground Truth Consistency Error (CVGTCE). For example, evaluating error based on eye distance, GTE is the average point-to-point Euclidean error between prediction and ground truth normalized by the Euclidean distance between the outer corners of the eyes. Formally,
(86)
(87) where is the L.sub.2-norm, S and {tilde over (S)} are the 3D prediction and ground truth, s.sub.k and {tilde over (s)}.sub.k are the k.sup.th 3D point of S and {tilde over (S)} respectively, and d is the Euclidean distance between the outer corners of the eyes.
(88) CVGTCE is a measurement that evaluates cross-view consistency of the predicted landmarks by comparing the prediction and ground truth from a different view of the same target. Formally,
(89)
(90) where P={c, R, t} encodes a rigid transformation, i.e., scale (c), rotation (R), and translation (t) between S and {tilde over (S)}. These can be obtained by optimizing the following:
{c,R,t}=.sub.c,R,t.sup.argmin.sub.k=1.sup.n{tilde over (s)}.sub.k(cRs.sub.k+t).
(91) GTE and CVGTCE for testing images of the applied embodiment were 5.88% and 3.97%, respectively.
(92) TABLE-US-00002 TABLE 2 Comparisons of the GTE and CVGTCE on 3DFAW challenge dataset. Participant CVGTCE GTE psxab5 3.4767 4.5623 Disclosed Embodiment 3.9700 5.8835 rpiisl 4.9488 6.2071 trigeorgis 5.4595 7.6403 olgabellon 5.9093 10.8001
(93) In another aspect of the innovation, and to compare with the state-of-the-art methods, across database testing was performed, using the images of the BP4D-S database of Zhang, Yin, Cohn, Canavan, Reale, Horowitz, Liu, and Girard, discussed previously. An embodiment of the approach using the pre-trained model on the 3DFAW dataset of the previous section was tested. That is, no images or 3D data from BP4D-S were used as part of the training procedures, i.e., the experiment is across datasets. The procedure by Jourabloo and Liu in Pose-invariant 3d face alignment from The International Conference on Computer Vision (ICCV), (2015) was used to ensure a fair comparison. 100 images with yaw angle between 0 and 10, 500 images with yaw angle between 10 and 20 and 500 other images with yaw angle between 20 and 30 were randomly selected for a total of 1,100 images. Since the landmarks in BP4D-S database are different from the challenge database, 45 overlapping landmarks were selected to test an embodiment of the innovative approach. The reported error in Jourabloo and Liu was calculated using the average of point-wise estimation error (APE) as follows:
(94)
(95) As shown in Table 3, the embodiment of the approach described herein achieves the smallest APE compared with Jourabloo and Liu and the baseline (i.e., using the 3D mean face of the samples in Zhang, Yin, Cohn, Canavan, Reale, Horowitz, Liu, and Girard).
(96) TABLE-US-00003 TABLE 3 Comparisons of the APE on BP4D-S database. Disclosed Embodiment PIFA Jourabloo and Liu Baseline 4.14 4.75 5.02
(97) The various tests confirm that embodiments of the innovative approach detect 3D landmarks of face with large head pose and facial expressions precisely.
(98) Turning now to
(99) A trained deep neural network system 602 then may receive an incoming non-sample 2D image (here, pictured as from 2D incoming image 614) at detector 606. In one embodiment, if the incoming image is a face, then augmentation 610C may be applied. It is to be appreciated that augmentation 610C may also be applied to the predetermined image data set. Per discussion previously presented, functional mapping component 616 may use the optimized landmark and transform criterions in regressing 2D image characteristics to detect landmark points, in recovering depth information from image attributes and in mapping a yield of a 3D shape. Here, the yielded shape is shown as being exported to a process image database 622. It is to be appreciated that the yield may be used in other manners, including real time or near-real time display and use. It is to be appreciated that the system works for any type of objects other than faces too. Substituting the word face for any other object in the figure (e.g., car) yields an algorithm to recover the 3D shape of any object from a single 2D image.
(100) It is to be further appreciated that the functional mapping component 616 can engage in a backpropagation manner by providing learned weights 624 back to the data augmentation component 610, as has been discussed in embodiments previously.
(101) Turning to
(102) Functional mapping component 706 is here pictured in an alternative view, in that the component may be comprised of multiple layers and functions. Here, layers 1 through M 712 and functions 1 through N 714 are shown as being associated with a landmark criterion and layers M+1 through P and functions N+1 through Q are shown as being associated with a transform criterion (M, N, P, and Q being integers). It is to be appreciated that the earlier discussions concerning layers and functions are intended to be reflected in this alternative portrayal.
(103) Turning now to
(104) Example Computing Device
(105)
(106) Processor 921 may include one or more processors, each configured to execute instructions and process data to perform one or more functions associated with a computer for indexing images. Processor 921 may be communicatively coupled to RAM 922, ROM 923, storage 924, database 925, I/O devices 926, and interface 927. Processor 921 may be configured to execute sequences of computer program instructions to perform various processes. The computer program instructions may be loaded into RAM 922 for execution by processor 921. As used herein, processor refers to a physical hardware device that executes encoded instructions for performing functions on inputs and creating outputs.
(107) RAM 922 and ROM 923 may each include one or more devices for storing information associated with operation of processor 921. For example, ROM 923 may include a memory device configured to access and store information associated with controller 920, including information for identifying, initializing, and monitoring the operation of one or more components and subsystems. RAM 922 may include a memory device for storing data associated with one or more operations of processor 921. For example, ROM 923 may load instructions into RAM 922 for execution by processor 921.
(108) Storage 924 may include any type of mass storage device configured to store information that processor 921 may need to perform processes consistent with the disclosed embodiments. For example, storage 924 may include one or more magnetic and/or optical disk devices, such as hard drives, CD-ROMs, DVD-ROMs, or any other type of mass media device.
(109) Database 925 may include one or more software and/or hardware components that cooperate to store, organize, sort, filter, and/or arrange data used by controller 920 and/or processor 921. For example, database 925 may store hardware and/or software configuration data associated with input-output hardware devices and controllers, as described herein. It is contemplated that database 925 may store additional and/or different information than that listed above. It is to be appreciated that database 925 is portrayed in dashed lines. As discussed herein in relation to several embodiments, database 925 may be co-located within workspace 902, or similar to network 928 (i.e., the Internet) and computing device 929, may exist outside of workspace 902.
(110) I/O devices 926 may include one or more components configured to communicate information with a user associated with controller 920. For example, I/O devices may include a console with an integrated keyboard and mouse to allow a user to maintain a database of images, update associations, and access digital content. I/O devices 926 may also include a display including a graphical user interface (GUI) for outputting information on a monitor. I/O devices 926 may also include peripheral devices such as, for example, a printer for printing information associated with controller 920, a user-accessible disk drive (e.g., a USB port, a floppy, CD-ROM, or DVD-ROM drive, etc.) to allow a user to input data stored on a portable media device, a microphone, a speaker system, or any other suitable type of interface device.
(111) Interface 927 may include one or more components configured to transmit and receive data via a communication network, such as the Internet, a local area network, a workstation peer-to-peer network, a direct link network, a wireless network, or any other suitable communication platform. For example, interface 927 may include one or more modulators, demodulators, multiplexers, de-multiplexers, network communication devices, wireless devices, antennas, modems, and any other type of device configured to enable data communication via a communication network.
(112) While the methods and systems have been described in connection with preferred embodiments and specific examples, it is not intended that the scope be limited to the particular embodiments set forth, as the embodiments herein are intended in all respects to be illustrative rather than restrictive.
(113) Unless otherwise expressly stated, it is in no way intended that any method set forth herein be construed as requiring that its steps be performed in a specific order. Accordingly, where a method claim does not actually recite an order to be followed by its steps or it is not otherwise specifically stated in the claims or descriptions that the steps are to be limited to a specific order, it is no way intended that an order be inferred, in any respect. This holds for any possible non-express basis for interpretation, including: matters of logic with respect to arrangement of steps or operational flow; plain meaning derived from grammatical organization or punctuation; the number or type of embodiments described in the specification. Throughout this application, various publications may be referenced. The disclosures of these publications are incorporated by reference herein in their entirety into this application in order to more fully describe the state of the art to which the methods and systems pertain. It will be apparent to those skilled in the art that various modifications and variations can be made without departing from the scope or spirit. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims.
(114) What has been described above includes examples of the innovation. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the subject innovation, but one of ordinary skill in the art may recognize that many further combinations and permutations of the innovation are possible. Accordingly, the innovation is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the term includes is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term comprising as comprising is interpreted when employed as a transitional word in a claim.