Face replacement and alignment

Abstract

A face replacement system for replacing a target face with a source face can include a facial landmark determination model having a cascade multichannel convolutional neural network (CMC-CNN) to process both the target and the source face. A face warping module is able to warp the source face using determined facial landmarks that match the determined facial landmarks of the target face, and a face selection module is able to select a facial region of interest in the source face. An image blending module is used to blend the target face with the selected source region of interest.

Claims

1. A method comprising: accessing a source video frame including a source face of a person, the source face having source face shape; accessing a target video frame including a target face of another person; detecting a plurality of 2D source facial landmarks in the source face using a cascade multichannel convolutional neural network (CMC-CNN); detecting a corresponding plurality of 2D target facial landmarks in the target face using the CMC-CNN and that correspond to the plurality of 2D source facial landmarks; using a regressor to incrementally predict a target face shape based on the target face and an initial target face shape; finding another source video frame including another source face of the person that has another source face shape that is more similar to the target face shape than the source face shape; warping the other source face using the plurality of 2D source facial landmarks to match the corresponding plurality of 2D target facial landmarks; selecting a source facial region of interest in the other source face; and blending the target face with the source facial region of interest.

2. The method of claim 1, further comprising inputting the source video frame, the target video frame, the initial target face shape, and a ground truth shape to the CMC-CNN.

3. The method of claim 1, further comprising iterating through a cascade regression process to detect the corresponding plurality of 2D target facial landmarks.

4. The method of claim 1, wherein warping the other source face comprises using a Delaunay triangulation on the plurality of 2D source facial landmarks to maximize a minimum angle for a constructed triangle.

5. The method of claim 1, further comprising creating a binary mask from the source facial region of interest.

6. The method of claim 1, wherein blending the target face with the source facial region of interest comprises blending using a Poisson Image Editing.

7. A face replacement system for replacing a target face with a source face, comprising: a frame access module configured to access a source video frame including a source face of a person, wherein the source face includes a source face shape, and configured to access a target video frame includes a target face of another person; a facial landmark determination model configured to detect a plurality of 2D source facial landmarks in the source face using a cascade multichannel convolutional neural network (CMC-CNN) to process both the target and the source face and configured to detect a corresponding plurality of 2D target facial landmarks in the target face using the CMC-CNN and that correspond to the plurality of 2D source facial landmarks; a regressor configured to incrementally predict a target face shape based on the target face and an initial target face shape; a similarity module configured to find another source video frame including another source face of the person that has another source face shape that is more similar to the target face shape than the source face shape; a face warping module configured to warp the other source face using the plurality of 2D source facial landmarks to match the the corresponding plurality of 2D target facial landmarks; a face selection module configured to select a source facial region of interest in the other source face; and an image blending module configured to blend the target face with the source facial region of interest.

8. The face replacement system of claim 7, wherein the source video frame, the target video frame, the initial target face shape, and a ground truth shape are input to the CMC-CNN.

9. The face replacement system of claim 7, wherein the facial landmark determination model being configured to detect a plurality of 2D source facial landmarks comprises the facial landmark determination model being configured to iterate through a cascade regression process.

10. The face replacement system of claim 7, wherein the face warping module being configured to warp the other source face comprises the face warping module being configured to warp the source face using a Delaunay triangulation on the plurality of 2D source facial landmarks to maximize a minimum angle for a constructed triangle.

11. The face replacement system of claim 7, further comprising the face selection module being configured to create a binary mask from the selected facial region of interest in the source face.

12. The face replacement system of claim 7, wherein the image blending module being configured to blend the target face with the source facial region of interest comprising the image blending module being configured to use a Poisson Image Editing is used to blend the target face with the source facial region of interest selection.

13. A system comprising: a processor; and system memory coupled to the processor and storing instructions configured to cause the processor to: access a source video frame including a source face of a person, the source face having source face shape; access a target video frame including a target face of another person; detect a plurality of 2D source facial landmarks in the source face using a cascade multichannel convolutional neural network (CMC-CNN); detect a corresponding plurality of 2D target facial landmarks in the target face using the CMC-CNN and that correspond to the plurality of 2D source facial landmarks; use a regressor to incrementally predict a target face shape based on the target face and an initial target face shape; find another source video frame including another source face of the person that has another source face shape that is more similar to the target face shape than the source face shape; warp the other source face using the plurality of 2D source facial landmarks to match the corresponding plurality of 2D target facial landmarks; select a source facial region of interest in the other source face; and blend the target face with the source facial region of interest.

14. The system of claim 13, further comprising instructions configured to input the source video frame, the target video frame, the initial target face shape, and a ground truth shape to the CMC-CNN.

15. The system of claim 13, further comprising instructions configured to iterate through a cascade regression process to detect the corresponding plurality of 2D target facial landmarks.

16. The system of claim 13, wherein instructions configured to warp the other source face comprise instructions configured to use a Delaunay triangulation on the plurality of 2D source facial landmarks to maximize a minimum angle for a constructed triangle.

17. The system of claim 13, further comprising instructions configure to create a binary mask from the source facial region of interest.

18. The system of claim 13, wherein instructions configured to blend the target face with the source facial region of interest comprises instructions configured to blend using a Poisson Image Editing.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a system and method for video face replacement;

(2) FIG. 2 illustrates face replacement with facial landmark detection and seamless face blending;

(3) FIG. 3 illustrates Delaunay triangulation and local image warp;

(4) FIG. 4 illustrates binary mask construction and use;

(5) FIG. 5 illustrates face replacement results for male and female images.

(6) FIG. 6 illustrates a face alignment recurrent neural network architecture for a single image;

(7) FIG. 7 illustrates an algorithm usable with a face alignment recurrent neural network architecture for a single image;

(8) FIG. 8 illustrates a face alignment recurrent neural network architecture for video images; and

(9) FIG. 9 illustrates an algorithm usable with a face alignment recurrent neural network architecture for a video images.

DETAILED DESCRIPTION

(10) FIG. 1 is a cartoon illustrating a method 100 for an automatic face replacement system. Source and target face images are determined (step 110), and facial landmarks are determined using a cascade multichannel convolutional neural network (CMC-CNN) in step 112. The source face facial landmarks are warped to match target face facial landmarks (step 114), and face region of interest (RoI) is selected in step 116. In a final step, the target face is blended with source region of interest selection to complete the automatic face replacement.

(11) The described method of FIG. 1 can be implemented with system architectures such as discussed with respect to FIG. 2. As seen in FIG. 2, two videos can be input into the system 200: a source video 210 and a target video 220. The system 200 acts to select a complete face from source video 210 and transplant it into the target video 220 to provide a resultant video with a face replacement. A processing module 230 includes both facial landmark detection 240 and seamless face blending 250. The processing module 230 detects facial landmarks using a Cascade Multi-Channel Convolutional Neural Network (CMC-CNN) model based on deep learning in both source video and target video. In contrast to conventional CNN architectures, the CMC-CNN provides a multi-channel cascade able to reject non-useful facial landmarks at early, low resolution stages, while verifying facial landmarks during later high-resolution stages. Once determined, the facial landmarks can be used to construct 2D face models to capture facial motion in video. The processing module 230 also selects the candidate image from the source video.

(12) More specifically, the CMC-CNN model takes a single video frame I, an initial face shape S.sup.0 and the ground truth shapes as inputs, where S.sup.2*p denotes the 2D positions of facial landmarks, and p is the number of facial landmarks. The whole model works as a cascade.

(13) For an input facial image I.sub.i and the corresponding initial shape S.sub.i.sup.0, face shape can be predicted S.sub.i in a cascade manner. At stage t, the facial shape S.sub.i.sup.t is updated by refining S.sub.i.sup.t-1 with the shape increment S.sub.i.sup.t. The process can be presented as follows:
S.sub.i.sup.t=S.sub.i.sup.t-1+R.sup.t(I.sub.i,S.sub.i.sup.t-1)

(14) where R.sup.t denotes the regressor at stage t, which computes the shape increment S.sub.i.sup.t based on the image I.sub.i and previous facial shape S.sub.i.sup.t-1.

(15) In the training process, the t.sub.th stage regressor R.sup.t is learned by minimizing the alignment error on the training set {I.sub.i,.sub.i,S.sub.i.sup.0}.sub.i=1.sup.N. This process can be expressed as follows:

(16) $R^{t} = \underset{R^{t}}{\arg \min} {.Math.}_{i = 1}^{N} .Math. {\hat{S}}_{i} - {(S_{i}^{t - 1} + R^{t} (I_{i}, S_{i}^{t - 1}) .Math.}_{2}$

(17) where .sub.i denotes the ground truth shape of image I.sub.i.

(18) The predicted facial shape S.sub.i will be more and more close to the ground truth shape .sub.i through the cascade regression process. The process iterates until the predicted shape S.sub.i converges.

(19) Seamless face blending includes the steps of 1) face selection, 2) image warp, and 3) image blending. Face selection for a facial image I in the target video proceeds by obtaining its face shape S, and then finding a most similar image in the source video. First, all shapes are normalized by a meanshape. Then, l2norm is used to represent the similarity. More specifically, x.sub.i can be the position of the i.sup.th landmark in the image I and x.sub.i can be the position of the i.sup.th landmark of the meanshape. The scale parameter s and translation t can be calculated as follows:

(20) $s, t = \underset{s, t}{\arg \min} {.Math.}_{i = 1}^{p} {.Math. {\overline{x}}_{i} - (s .Math. x_{i} + t) .Math.}_{2}$

(21) Then the most similar face image in the source video can be retrieved as follows:

(22) ${idx = \underset{i}{\arg \min} .Math. S_{in}^{} - S_{i}^{}) .Math.}_{2}, i = 1, 2, 3 . . . M$

(23) Where S is the normalized shape, and M is the number of face images in the source video.

(24) Image warp proceeds by taking p facial landmarks and constructing a triangulation that would cover the convex hull of all the facial points. To achieve this, a Delaunay triangulation, which follows the max-min criterion can be constructed to maximize the minimum angles in all triangles. Next, a linear interpolation between two triangles is made. For instance, [(X.sub.1,Y.sub.1),(x.sub.1,y.sub.1)], [(X.sub.2,Y.sub.2),(x.sub.2,y.sub.2)] and [(X.sub.3,Y.sub.3),(x.sub.3,y.sub.3)] are three corresponding control points' coordinates, for which a linear interpolation function X=f(x,y) and Y=g(x,y) that overlays the triangles can be provided. This problem can be solved as follows:
Ax+By+CX+D=0

(25) where

(26) $A = .Math. \begin{matrix} y_{1} & X_{1} & 1 \\ y_{2} & X_{2} & 1 \\ y_{3} & X_{3} & 1 \end{matrix} .Math.; B = - .Math. \begin{matrix} x_{1} & X_{1} & 1 \\ x_{2} & X_{2} & 1 \\ x_{3} & X_{3} & 1 \end{matrix} .Math.; C = .Math. \begin{matrix} x_{1} & y_{1} & 1 \\ x_{2} & y_{2} & 1 \\ x_{3} & y_{3} & 1 \end{matrix} .Math.;$ $D = .Math. \begin{matrix} x_{1} & y_{1} & X_{1} \\ x_{2} & y_{2} & X_{2} \\ x_{3} & y_{3} & X_{3} \end{matrix} .Math. .$

(27) FIG. 3 illustrates the previously discussed Delaunay triangulation and local image warp with several pictures 300A, 300B, and 300C that include a face, facial landmarks, and triangulation. Picture 300A is the target face Delaunay triangulation; Picture 300B is the source face Delaunay triangulation; and Picture 300C is the warped source face referring to the target face Picture 300A.

(28) Image blending results in a blend of the warped source face into the target face, and is needed to produce natural and realistic face replacement results. The details of image blending are illustrated in 400A, 400B, and 400C of FIG. 4. Based on the detected landmarks in a face such as shown in 400A, a binary facial mask 400B is created. Some morphological operations such as corrosion and dilation can be applied to the binary mask to eliminate noise around the boundary of the mask. Then, utilizing the binary facial mask, an accurate facial Region of Interest (ROI) as seen in 400A. The system can automatically select the facial ROI, according to the detected facial landmarks. After getting the facial ROI in source image and target image, a Poisson Image Editing technique or other picture compositing method can be used to seamlessly blend the source face into target face. Some replacement results are shown in FIG. 5, where two sets of pictures (female and male) are illustrated. Pictures 500A are the target face images; 500B are the source face images; 500C are the replacement results without image blending; and 500D are the replacement results using Poisson Image Editing.

(29) Face alignment of video images, where key facial points are identified in a 2D image, can be approached with various methods. In one embodiment, for a single image version, given a data set with N training samples, denoted as {I.sub.i,.sub.i, S.sub.i.sup.0}.sub.i=1.sup.N, a network's parameter can be optimized as follows:

(30) $= \arg \min_{} f (I_{i}, {\hat{S}}_{i}, S_{i}^{0}, T,),$

(31) where .sub.i indicates the ground truth shape of image I.sub.i,S.sub.i.sup.0 indicates the initial shape, T indicates the stage number. In experiments, mean shape S is employed as the initial shape, which can be calculated as follows:

(32) $\overline{S} = \frac{1}{N} {.Math.}_{i = 1}^{N} {\hat{S}}_{i} .$

(33) f can be defined as:

(34) $f {.Math.}_{t = 1}^{T}_{t} {.Math.}_{i = 1}^{N} || (\hat{S_{i}} - S_{i}^{t - 1}) - R (I_{i}, S_{i}^{t - 1}, x_{i}^{t - 1}, .Math.,$

(35) where .sub.t indicates the factor of each stage, R indicates the regressor with parameter , x.sub.i.sup.t-1 indicates the middle level feature of stage t1. Also:

(36) $\begin{matrix} x_{i}^{t} = g (I_{i}, S_{i}^{t - 1}, x_{i}^{t - 1},), \\ x_{i}^{t - 1} = g (I_{i}, S_{i}^{t - 2}, x_{i}^{t - 2},), \\ .Math. \\ x_{i}^{1} = g (I_{i}, S_{i}^{0}, x_{i}^{0},), \\ x_{i}^{0} 0. \end{matrix}$

(37) which indicates that current stage t shape S.sub.i.sup.t is not only dependent on the stage t1 shape and middle-level feature X.sub.i.sup.t-1 but also all previous stage shapes and middle-level information.

(38) FIG. 6 and representative Algorithm 1 seen in FIG. 7, together illustrate an embodiment of a Face Alignment Recurrent Networks (FARN) system 600 for one single image. In the training process, the system 600 takes an entire image, an initial face shape, and ground truth shape as inputs 610. The system 600 first processes the whole image with several convolutional layers and max pooling layers 620 to produce a feature map 630. Then, for the initial face shape, Region of Interest (RoI) pooling 640 is provided around the region of each landmark. Then, these RoI pooling features are concatenated and mapped into fully-connected layers 642 and 646, and a long short-term memory (LSTM) layer 646. The network then outputs the predicted shape increment for the initial shape. In this embodiment, the initial shape can be updated (module 654). First, the above described method is recurrently processed, but with the initial shape 650 processed to provide an updated shape on the convolutional feature map. The process is set to recur T times. Note that ground truth shape increment is calculated (module 670) at each stage. LSTM layers of different stages can share weights and the network is end-to-end trained.

(39) The described system 600 and algorithm can be extended from image to video and can fully make use of information among frames.

(40) Similar to the previously discussed image version, given N.sub.V, N.sub.F long training video samples as {{I.sub.i,f,.sub.i,f}.sub.f=1.sup.N.sup.F,S.sub.i.sup.0}.sub.i=1.sup.N.sup.V, the same optimized function can be used as shown below. To make full use of the information between frames in videos, we define the f.sup.th frame of the video i image I.sub.i,f's initial shape S.sub.i,f.sup.0 as follows:
S.sub.i,f.sup.0=S.sub.i,f-1.sup.T

(41) Middle level information of previous frames is defined as follows:

(42) $\begin{matrix} x_{i, f}^{t} = g (I_{i, f}, S_{i, f}^{t - 1}, x_{i, f}^{t - 1},), \\ x_{i, f}^{t - 1} = g (I_{i, f}, S_{i, f}^{t - 2}, x_{i, f}^{t - 2},), \\ .Math. \\ \begin{matrix} x_{i, f}^{1} = g (I_{i, f}, S_{i, f}^{0}, x_{i, f}^{0},), \\ x_{i, f}^{0} = g (I_{i, f - 1}, S_{i, f - 1}^{T}, x_{i, f - 1}^{T},) . \end{matrix} \\ x_{i, 0}^{T} 0. \end{matrix}$

(43) The current stage t is not only dependent on the previous stage shapes and middle level information, but also on shapes and information in previous frames.

(44) FIG. 8 and representative Algorithm 2 seen in FIG. 9, together illustrate an embodiment of a Face Alignment Recurrent Networks system 800 suitable for video images. Similar to the network discussed with respect to FIG. 6, the system 800 processes multiple video image frames using initial face shape and ground truth shape as inputs. The system uses feature maps that are provided Region of Interest (RoI) pooling and concatenation layers. These layers are mapped into fully-connected layers, and a long short-term memory (LSTM) layer. The network then outputs the predicted shape increment for the initial shape.

(45) In effect, the disclosed methods turn existing cascade shape regression into a recurrent network-based approach, which can be jointly trained among stages to avoid over-strong/weak regressors as in the cascade fashion. In this way, the last several stage regressors can be trained well. Advantageously, in a deep neural network, the extracted middle level representation brings useful information for the shape estimation of the next stage. Such information can be modeled well in the LSTM layer. For face landmarks tracking, the current frame's results are dependent not only on the former frames' result, but also on the middle level information. Turning the existing cascade shape regression into a recurrent network based approach allows joint training between stages to avoid over-strong/weak regressors as in the cascade fashion.

(46) In the above disclosure, reference has been made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific implementations in which the disclosure may be practiced. It is understood that other implementations may be utilized and structural changes may be made without departing from the scope of the present disclosure. References in the specification to one embodiment, an embodiment, an example embodiment, etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

(47) Implementations of the systems, devices, and methods disclosed herein may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed herein. Implementations within the scope of the present disclosure may also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computer-executable instructions are computer storage media (devices). Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, implementations of the disclosure can comprise at least two distinctly different kinds of computer-readable media: computer storage media (devices) and transmission media.

(48) Computer storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phase-change memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.

(49) An implementation of the devices, systems, and methods disclosed herein may communicate over a computer network. A network is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links, which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer-readable media.

(50) Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

(51) Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, an in-dash vehicle computer, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, various storage devices, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

(52) Further, where appropriate, functions described herein can be performed in one or more of: hardware, software, firmware, digital components, or analog components. For example, one or more application specific integrated circuits (ASICs) can be programmed to carry out one or more of the systems and procedures described herein. Certain terms are used throughout the description and claims to refer to particular system components. As one skilled in the art will appreciate, components may be referred to by different names. This document does not intend to distinguish between components that differ in name, but not function.

(53) It should be noted that the sensor embodiments discussed above may comprise computer hardware, software, firmware, or any combination thereof to perform at least a portion of their functions. For example, a sensor may include computer code configured to be executed in one or more processors, and may include hardware logic/electrical circuitry controlled by the computer code. These example devices are provided herein purposes of illustration, and are not intended to be limiting. Embodiments of the present disclosure may be implemented in further types of devices, as would be known to persons skilled in the relevant art(s).

(54) At least some embodiments of the disclosure have been directed to computer program products comprising such logic (e.g., in the form of software) stored on any computer useable medium. Such software, when executed in one or more data processing devices, causes a device to operate as described herein.

(55) While various embodiments of the present disclosure have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various changes in form and detail can be made therein without departing from the spirit and scope of the disclosure. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents. The foregoing description has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. Further, it should be noted that any or all of the aforementioned alternate implementations may be used in any combination desired to form additional hybrid implementations of the disclosure.

Face replacement and alignment

Assignee

Inventors

Cpc classification

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06T2207/20104

PHYSICS

Classification Explorer

G06T3/18

PHYSICS

Classification Explorer

G06V30/19173

PHYSICS

Classification Explorer

G06T3/153

PHYSICS

Classification Explorer

G06V40/161

PHYSICS

Classification Explorer

G06T2207/30201

PHYSICS

International classification

Classification Explorer

G06K9/00

PHYSICS

Classification Explorer

G06K9/66

PHYSICS

Classification Explorer

G06T3/00

PHYSICS

Abstract

Claims

Description