Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech
09613450 ยท 2017-04-04
Assignee
Inventors
- Lijuan Wang (Beijing, CN)
- Frank Soong (Beijing, CN)
- Qiang HUO (Beijing, CN)
- Zhengyou Zhang (Bellevue, WA)
Cpc classification
G10L2021/105
PHYSICS
International classification
Abstract
Dynamic texture mapping is used to create a photorealistic three dimensional animation of an individual with facial features synchronized with desired speech. Audiovisual data of an individual reading a known script is obtained and stored in an audio library and an image library. The audiovisual data is processed to extract feature vectors used to train a statistical model. An input audio feature vector corresponding to desired speech with which the animation will be synchronized is provided. The statistical model is used to generate a trajectory of visual feature vectors that corresponds to the input audio feature vector. These visual feature vectors are used to identify a matching image sequence from the image library. The resulting sequence of images, concatenated from the image library, provides a photorealistic image sequence with facial features, such as lip movements, synchronized with the desired speech. This image sequence is applied to the three-dimensional model.
Claims
1. A computer-implemented method for generating photo-realistic facial animation synchronized with speech, comprising: storing, in a computer storage device, a statistical model of audiovisual data over time, based on acoustic feature vectors from actual audio data and visual feature vectors of lips images extracted from real sample images of a head and facial features of an individual during a set of utterances by the individual; storing, in an image library, the real sample images of the individual's head and facial features during the set of utterances, including storing for each of the stored real sample images the visual feature vectors obtained from the lips image extracted from the real sample image as used to generate the statistical model; receiving an input set of acoustic feature vectors for the speech with which the facial animation is to be synchronized; using a computer processor, applying the received input set of acoustic feature vectors to the statistical model, the statistical model thereby generating a visual feature vector sequence; selecting, using a computer processor, a sequence of real sample images of the individual's head and facial features from the image library, such that the selected sequence matches the visual feature vector sequence generated using the statistical model by comparing visual feature vectors in the visual feature vector sequence with visual feature vectors associated with the real sample images in the image library: and using a computer processor, applying the selected sequence of real sample images to the three dimensional model of a head to provide the photo-realistic facial animation synchronized with the speech.
2. The computer-implemented method of claim 1, further comprising generating the statistical model, wherein generating the statistical model comprises: obtaining actual audiovisual data including a plurality of samples including real sample images of the individual's facial features for a set of utterances; extracting the acoustic feature vectors and the visual feature vectors for each sample of the audiovisual data; and training the statistical model using the acoustic feature vectors and the visual feature vectors.
3. The computer-implemented method of claim 1, wherein generating the visual feature vector sequence comprises maximizing a likelihood function with respect to the input acoustic feature vectors and the statistical model.
4. The computer-implemented method of claim 1, wherein selecting the sequence of real sample images comprises selecting a set of real sample images that minimizes a cost function.
5. The computer-implemented method of claim 4, wherein the cost function comprises a target cost indicative of a difference between a visual feature vector in the generated visual feature vector sequence and a visual feature vector related to a real sample image.
6. The computer-implemented method of claim 5, wherein the cost function comprises a concatenation cost indicative of a difference between adjacent real sample images in the selected sequence of real sample images.
7. The computer-implemented method of claim 1, wherein selecting the sequence of real sample images from the image library comprises identifying a sequence of real sample images from the image library that matches the generated visual feature vector sequence based on both a target cost and a concatenation cost.
8. The computer-implemented method of claim 1, wherein applying the selected sequence of real sample images comprises: generating, using a computer processor, a sequence of images of the individual's head and facial features from the selected sequence of real sample images; accessing an animated three-dimensional model of a head of the individual comprising a plurality of frames corresponding to the generated sequence of images; and using a computer processor, applying the generated sequence of images to the three dimensional model as a texture, such that different frames of the animated three-dimensional model are textured by different images of the generated sequence of images, to provide the photo-realistic facial animation synchronized with the speech.
9. A computer system for generating photo-realistic facial animation synchronized with speech, comprising: a computer storage device storing a statistical model of audiovisual data over time, based on acoustic feature vectors from actual audio data and visual feature vectors of lips images extracted from real sample images of a head and facial features of an individual during a set of utterances by the individual; an image library storing real sample images of the individual's head and facial features during the set of utterances, the image library further storing for each of the stored real sample images the visual feature vectors obtained from the lips image extracted from the real sample image as used to generate the statistical model; a synthesis module having an input for receiving an input set of feature vectors for speech with which the facial animation is to be synchronized, and providing as an output a visual feature vector sequence corresponding to the input set of feature vectors according to the statistical model; an image selection module having an input for receiving the visual feature vector sequence from the output of the synthesis module, and accessing the image library using the received visual feature vector sequence to generate an output providing a sequence of real sample images of the individual's head and facial features from the image library having visual feature vectors that match the visual feature vectors in the visual feature vector sequence received from the synthesis module by comparing visual feature vectors in the visual feature vector sequence with visual feature vectors associated with the real sample images in the image library; and an animation module having an input for receiving a three dimensional model of a head and the sequence of real sample images from the image selection module, and an output providing the facial animation synchronized with the speech.
10. The computer system of claim 9, further comprising: a training module having an input receiving acoustic feature vectors and visual feature vectors from the audiovisual data of an individual's facial features during the set of utterances and providing as an output a statistical model of the audiovisual data over time.
11. The computer system of claim 10, wherein the training module comprises: a feature extraction module having an input for receiving the audiovisual data and providing an output including the acoustic feature vectors and the visual feature vectors corresponding to each sample of the audiovisual data; and a statistical model training module having an input for receiving the acoustic feature vectors and the visual feature vectors and providing as an output the statistical model.
12. The computer system of claim 9, wherein the synthesis module implements a maximum likelihood function with respect to the input acoustic feature vectors and the statistical model.
13. The computer system of claim 9, wherein the image selection module implements a cost function and identifies a set of real sample images that minimizes the cost function.
14. The computer system of claim 13, wherein the cost function comprises a target cost indicative of a difference between a visual feature vector in the visual feature vector sequence and a visual feature vector related to a real sample image.
15. The computer system of claim 14, wherein the cost function comprises a concatenation cost indicative of a difference between adjacent real sample images in the sequence of real sample images.
16. The computer system of claim 9, wherein the image selection module accesses the image library using the visual feature vector sequence to identify a sequence of real sample images from the image library that matches the visual feature vector sequence based on both a target cost and a concatenation cost.
17. A computer program product comprising: a computer storage device comprising at least one of a memory device or storage device; computer program instructions stored on the computer storage device that, when processed by a computing device, instruct the computing device to perform a method for generating photo-realistic facial animation synchronized with speech, comprising: storing, in a computer storage device, a statistical model of audiovisual data over time, based on acoustic feature vectors from actual audio data and visual feature vectors of lips images extracted from real sample images of a head and facial features of an individual during a set of utterances by the individual; storing, in an image library, real sample images of the individual's head and facial features during the set of utterances, the image library further storing for each of the stored real sample images the visual feature vectors obtained from the lips image extracted from the real sample image as used to generate the statistical model; receiving an input set of acoustic feature vectors for the speech with which the facial animation is to be synchronized; using a computer processor, applying the received input set of acoustic feature vectors to the statistical model, the statistical model thereby generating a visual feature vector sequence; selecting, using a computer processor, a sequence of real sample images of the individual's head and facial features from the image library, such that the selected sequence matches the visual feature vector sequence generated using the statistical model by comparing visual feature vectors in the visual feature vector sequence with visual feature vectors associated with the real sample images in the image library: and using a computer processor, applying the selected sequence of real sample images to the three dimensional model of a head to provide the photo-realistic facial animation synchronized with the speech.
18. The computer program product of claim 17, further comprising generating the statistical model, wherein generating the statistical model comprises: obtaining audiovisual data including a plurality of samples, including real sample images of the individual's facial features for the set of utterances; extracting the acoustic feature vectors and the visual feature vectors for each sample of the audiovisual data; and training the statistical model using the acoustic feature vectors and the visual feature vectors.
19. The computer program product of claim 17, wherein selecting the sequence of real sample images comprises selecting a set of real sample images that minimizes a cost function.
20. The computer program product of claim 19, wherein the cost function comprises a target cost indicative of a difference between a visual feature vector in the generated visual feature and a visual feature vector related to a real sample image, and a concatenation cost indicative of a difference between adjacent images in the sequence of real sample images.
Description
DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION
(8) The following section provides an example system environment in which generation of photorealistic three-dimensional animation can be used.
(9) Referring now to
(10) The application 100 can use a talking head for a variety of purposes. For example, the application 100 can be a computer assisted language learning applications, a language dictionary (e.g., to demonstrate pronunciation), an email reader, a news reader, a book reader, a text-to-speech system, an intelligent voice agent, an avatar of an individual for a virtual meeting room, a virtual agent in dialogue system, video conferencing, online chatting, gaming, movie animation, or other application that provides visual and speech-based interaction with an individual.
(11) In general, such an application 100 provides an input, such as text 110, or optionally speech 112, to a synthesis module 104, which in turn generates an image sequence 106 with lip movements synchronized with speech that matches the text or the input speech. The synthesis module 104 relies on a model 108, described in more detail below. The operation of the synthesis module also is described in more detail below. The image sequence is applied to a three dimensional model 130 of the individual's head by a 3d animation system 132 to provide 3d animation 134
(12) When text is provided by the application 100, the text 110 is input to a text-to-speech conversion module 114 to generate speech 112. The application 100 also might provide a speech signal 112, in which case the text-to-speech conversion is not used and the synthesis module generates an image sequence 106 using the speech signal 112.
(13) The speech signal 112 and the three dimensional animation 134 are played back using a synchronized playback module 120, which generates audiovisual signals 122 that output to the end user through an audiovisual output device 102. The synchronized playback module may reside in a computing device at the end user's location, such as a general purpose computer or game console, such as the XBOX or KINECT consoles from Microsoft Corporation, or may be in a remote computer.
(14) Having now described the application environment in which the synthesis of image sequences may be used and applied to a three dimensional model of a head, how such three dimensional animations are generated will now be described.
(15) Referring now to
(16)
(17) The model 204 is used by a synthesis module 206 to generate a visual feature vector sequence corresponding to an input set of feature vectors for speech with which the facial animation is to be synchronized. The input set of feature vectors for speech is derived from input 208, which may be text or speech. The visual feature vector sequence is used to select an image sample sequence from an image library (part of the model 204). This image sample sequence is processed to provide the photo-realistic image sequence 210 to be synchronized with speech signals corresponding to the input 208 of the synthesis module.
(18) The training module, in general, would be used once for each individual for whom a model is created for generating photorealistic image sequences. The synthesis module is used each time a new text or speech sequence is provide for which a new image sequence is to be synthesized from the model. It is possible to create, store and re-use image sequences from the synthesis module instead of recomputing them each time.
(19) Also shown in
(20) For example where the input is a video sequence containing a face rotating from frontal view to profile view before a fixed camera, the techniques described in Le Xin, Qiang Wang, Jianhua Tao, Xiaoou Tang, Tieniu Tan, and Harry Shum, Automatic 3D Face Modeling from Video, in Proc. ICCV'05, may be used. This technique involves performing automatic initialization in the first frame with approximately frontal face. Then, to handle the case of low quality image captured by low cost camera, the 2D feature matching, head poses and underlying 3D face shape are estimated and refined iteratively in an efficient way based on image sequence segmentation. Finally, to take advantage of the sparse structure of the proposed algorithm, sparse bundle adjustment technique is further employed to speed up the computation.
(21) In some cases, the three dimensional model of the head can be generated from a single frontal image of the individual, as described in Yuxiao Hu, Dalong Jiang, Shuicheng Yan, Lei Zhang, Hongjiang Zhang, Automatic 3D Reconstruction for Face Recognition, in Proc. of the Sixth IEEE international Conference on Automatic Face and Gesture Recognition (FGR'04). In particular, a frontal face image of a subject with normal illumination and neutral expression is input. A semi-supervised ranking prior likelihood models for accurate local search and a robust parameter estimation approach is used for face alignment. Based on this 2D alignment algorithm, 83 key feature points are automatically located. The feature points are accurate enough for face reconstruction in most cases. A general 3D face model is applied for personalized 3D face reconstruction. The 3D shapes have been compressed by the Principal Component Analysis (PCA). After the 2D face alignment, the key feature points are used to compute the 3D shape coefficients of the eigenvectors. Then, the coefficients are used to reconstruct the 3D face shape. Finally, the face texture is extracted from the input image. By mapping the texture onto the 3D face geometry, the 3D face model for the input 2D face image is reconstructed.
(22) Alternatively, the three-dimensional model can be created through other techniques, such as sampling or motion capture or other common modeling techniques. The output image sequence 210 from the synthesis module is applied to the three-dimensional model 222 of the head in an animation module 224 to provide animation 226.
(23) Training of the statistical model will be described first in connection with
(24) In
(25) Because a reader typically moves his or her head naturally during recording, the images can be normalized for head position by a head pose normalization module 302. For example, the poses of each frame of the recorded audio visual content are normalized and aligned to a full-frontal view. An example implementation of head pose normalization is to use the techniques found in Q. Wang, W. Zhang, X. Tang, H. Y. Shum, Real-time Bayesian 3-d pose tracking, IEEE Transactions on Circuits and Systems for Video Technology 16(12) (2006), pp. 1533-1541. Next, the images of just the articulators (i.e., the mouth, lips, teeth, tongue, etc.) are cropped out with a fixed rectangle window and a library of lips sample images is made. These images also may be stored in the audiovisual database 300 and/or passed on to a visual feature extraction module 304.
(26) Using the library of lips sample images, visual feature extraction module 304 generates visual feature vectors for each image. In one implementation, eigenvectors of each lips image are obtained by applying principal component analysis (PCA) to each image. From experiments, the top twenty eigenvectors contained about 90% of the accumulated variance. Therefore, twenty eigenvectors are used for each lips image. Thus the visual feature vector for each lips image S.sup.T is described by its PCA vector,
V.sup.T=S.sup.TW(1)
where W is the projection matrix made by the top 20 eigenvectors of the lips images.
(27) Acoustic feature vectors for the audio data related to each of the lips sample images also are created, using conventional techniques such as by computing the Mel-frequency cepstral coefficients (MFCCs).
(28) Next, the audio and video feature vectors 305 (which also may be stored in the audiovisual library) are used by a statistical model training module 307 to generate a statistical model 306. In one implementation, acoustic vectors A.sub.t=[.sub.t.sup.T, .sub.t.sup.T, .sub.t.sup.T].sup.T and visual vectors V.sub.t=[.sub.t.sup.T, .sub.t.sup.T,.sub.t.sup.T].sup.T are used, which are formed by augmenting the static features and their dynamic counterparts to represent the audio and video data. Audio-visual hidden Markov models (HMMs), , are trained by maximizing the joint probability p (A, V/) over the acoustic and visual training vectors. In order to capture the contextual effects, context dependent HMMs are trained and tree-based clustering is applied to acoustic and visual feature streams separately to improve the corresponding model robustness. For each audiovisual HMM state, a single Gaussian mixture model (GMM) is used to characterize the state output. The state q has a mean vectors .sub.q.sup.(A) and .sub.q.sup.(V). In one implementation, the diagonal covariance matrices for .sub.q.sup.(AA) and .sub.q.sup.(VV), null covariance matrices for .sub.q.sup.(AV) and .sub.q.sup.(VA), are used by assuming the independence between audio and visual streams and between different components. Training of an HMM is described, for example, in Fundamentals of Speech Recognition by Lawrence Rabiner and Biing-Hwang Juang, Prentice-Hall, 1993.
(29) Referring now to
(30) Having now described how a statistical model is trained using audiovisual data, the process of synthesizing an image sequence using this model will now be described in more detail.
(31) Referring now to
(32) An implementation of module 500 is as follows. Given a continuous audiovisual HMM , and acoustic feature vectors A=[A.sub.1.sup.T, A.sub.2.sup.T, . . . , A.sub.T.sup.T].sup.T, the module identifies a visual feature vector sequence V=[V.sub.1.sup.T, V.sub.2.sup.T, . . . , V.sub.T.sup.T].sup.T such that the following likelihood function is maximized:
p(V|A,)=.sub.all Qp(Q|A,).Math.p(V|A,Q,),(2)
(33) Equation (2) is maximized with respect to V, where Q is the state sequence. In particular, at frame t, p (V.sub.t|A.sub.t, q.sub.t, ) are given by:
p(V.sub.t|A.sub.t,q.sub.t,)=N(V.sub.t;{circumflex over ()}.sub.q.sub.
{circumflex over ()}.sub.q.sub.
.sub.q.sub.
(34) We consider the optimal state sequence Q by maximizing the likelihood function p (Q|A, ) with respect to the given acoustic feature vectors A and model . Then, the logarithm of the likelihood function is written as
log p(V|A,Q,)=log p(V|{circumflex over ()}.sup.(V),.Math..sup.(VV))=V.sup.T.Math..sup.(VV).sup.
where
{circumflex over ()}.sup.(V)=[{circumflex over ()}.sub.q.sub.
.Math..sup.(VV).sup.
(35) The constant K is independent of V. The relationship between a sequence of the static feature vectors C=[.sub.1.sup.T, .sub.2.sup.T, . . . , .sub.T.sup.T].sup.T and a sequence of the static and dynamic feature vectors V can be represented as a linear conversion,
V=W.sub.cC,(9)
(36) where W.sub.c is a transformation matrix, such as described in K. Tokuda, H. Zen, etc., The HMM-based speech synthesis system (HTS). By setting
(37)
{circumflex over (V)}.sub.opt that maximizes the logarithmic likelihood function is given by
{circumflex over (V)}.sub.opt=W.sub.cC.sub.opt=W.sub.c(W.sub.c.sup.T.Math..sup.(VV).sup.
(38) The visual feature vector sequence 506 is a compact description of articulator movements in the lower rank eigenvector space of the lips images. However, the lips image sequence to which it corresponds, if used as an output image sequence, would be blurred due to: (1) dimensionality reduction in PCA; (2) maximum likelihood (ML)-based model parameter estimation and trajectory generation. Therefore, this trajectory is used to guide selection of the real sample images, which in turn are concatenated to construct the output image sequence. In particular, an image selection module 508 receives the visual feature vector sequence 506 and searches the audiovisual database 510 for a real image sample sequence 512 in the library which is closest to the predicted trajectory as the optimal solution. Thus, the articulator movement in the visual trajectory is reproduced and photo-realistic rendering is provided by using real image samples.
(39) An implementation of the image selection module 508 is as follows. First, the total cost for a sequence of T selected samples is the weighted sum of the target and concatenation costs:
C({circumflex over (V)}.sub.1.sup.T,.sub.1.sup.T)=.sub.i=1.sup.T.sup.tC.sup.t({circumflex over (V)}.sub.i,.sub.i)+.sub.i=2.sup.T.sup.cC.sup.c(.sub.i1,.sub.i)(11)
(40) The target cost of an image sample is measured by the Euclidean distance between their PCA vectors.
C.sup.t({circumflex over (V)}.sub.i,.sub.i)={circumflex over (V)}.sub.i.sub.i.sup.TW(12)
(41) The concatenation cost is measured by the normalized 2-D cross correlation (NCC) between two image samples .sub.i and .sub.j as Equation 13 below shows. Since the correlation coefficient ranges in value from 1.0 to 1.0, NCC is by nature a normalized similarity score.
(42)
(43) Assume that the corresponding samples of .sub.i and .sub.j in the sample library are S.sub.p and S.sub.q, i.e., .sub.i=S.sub.p, .sub.j and =S.sub.q, where, p and q are the sample indexes in video recording. And hence S.sub.p and S.sub.p+1, S.sub.q1 and S.sub.q are consecutive frames in the original recording. As defined in Eq. 14, the concatenation cost between .sub.i and .sub.j is measured by the NCC of the S.sub.p and the S.sub.q1 and the NCC of the S.sub.p+1 and S.sub.q.
C.sup.c(.sub.i,.sub.j)=C.sup.c(S.sub.p,S.sub.q)=1[NCC(S.sub.p,S.sub.q1)+NCC(S.sub.p+1,S.sub.q)](14)
(44) Because NCC (S.sub.p, S.sub.p)=NCC(S.sub.q, S.sub.q)=1, then C.sup.c(S.sub.p, S.sub.p+1)=C.sup.c(S.sub.q1, S.sub.q)=0, so that the selection of consecutive frames in the original recording is encouraged.
(45) The sample selection procedure is the task of determining the set of image sample .sub.1.sup.T so that the total cost defined by Equation 11 is minimized, which is represented mathematically by Equation 15:
.sub.1.sup.T=argmins.sub..sub.
(46) Optimal sample selection can be performed with a Viterbi search. However, to obtain near real-time synthesis on large dataset, containing tens of thousands of samples, the search space is pruned. One example of such pruning is implemented in two parts. First, for every target frame in the trajectory, K-nearest samples are identified according to the target cost. The beam width K can be, for example, between 1 and N (the total number of images). The number K can be selected so as to provide the desired performance. Second, the remaining samples are pruned according to the concatenation cost.
(47) The operation of a system such as shown in
(48) As a result of this image selection technique, a set of real images closely matching the predicted trajectory and smoothly transitioning between each other provide a photorealistic image sequence with lip movements that closely match the provided audio or text. This sequence is then applied to the three-dimensional model of the head.
(49) In particular, instead of using a precise model of the geometry mesh deformation, local facial motion is obtained by overlaying a dynamic, time varying texture (the image sequence generated by the synthesis module) on the structure. Unlike traditional texture mapping which generates a single texture for a surface, multiple textures are used in rendering, at least one for each frame. The selection mechanism of the HMM enables a texture to be chosen from multiple textures according to the desired facial motions and expressions at different times. The selection mechanism of the HMM in the synthesis module can also be applied to selection of images for different parts of the face, such as the eyes, wrinkle areas, etc. By using dynamic texture mapping, we bypass several difficulties in rendering soft tissues like lips, tongue, eyes, wrinkles, and make the 3D talking head look photorealistic. With the auto-reconstructed 3D geometry model, the head pose, illumination, and facial expressions of the 3D talking head can be freely controlled. In particular, head movement can be controlled by rotating and translating the head mesh model by viewing it as a rigid object. Different illumination can be realized by changing the lighting in the 3D rendering. Variant facial expressions like happy or sad can be controlled by deforming the mesh model.
(50) The system for generating photorealistic three dimensional animations is designed to operate in a computing environment. The following description is intended to provide a brief, general description of a suitable computing environment in which this system can be implemented. The system can be implemented with numerous general purpose or special purpose computing hardware configurations. Examples of well known computing devices that may be suitable include, but are not limited to, personal computers, server computers, hand-held or laptop devices (for example, media players, notebook computers, cellular phones, personal data assistants, voice recorders), multiprocessor systems, microprocessor-based systems, set top boxes, game consoles, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
(51)
(52) With reference to
(53) Device 700 may also contain communications connection(s) 712 that allow the device to communicate with other devices. Communications connection(s) 712 is an example of communication media. Communication media typically carries computer program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, thereby changing the configuration or state of the receiving device of the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
(54) Device 700 may have various input device(s) 714 such as a display, a keyboard, mouse, pen, camera, touch input device, and so on. Output device(s) 716 such as speakers, a printer, and so on may also be included. All of these devices are well known in the art and need not be discussed at length here.
(55) The system for generating photorealistic animation may be implemented in the general context of software, including computer-executable instructions and/or computer-interpreted instructions, such as program modules, being processed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, and so on, that, when processed by the computing device, perform particular tasks or implement particular abstract data types. This system may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
(56) Any or all of the aforementioned alternate embodiments described herein may be used in any combination desired to form additional hybrid embodiments. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. The specific features and acts described above are disclosed as example forms of implementing the claims.