Apparatus and method for generating speech synthesis image
12573119 ยท 2026-03-10
Assignee
Inventors
Cpc classification
G06T7/246
PHYSICS
G06T7/262
PHYSICS
International classification
G06T7/246
PHYSICS
Abstract
An apparatus according to an embodiment is a speech synthesis image generating apparatus based on machine learning. The apparatus includes a first global geometric transformation predictor to receive a source image and a target image including the same person, and predict a global geometric transformation for a global motion of the person between the source image and the target image based on the source image and the target image, a local geometric transformation predictor to predict a local geometric transformation for a local motion of the person between the source image and the target image based on preset input data, a geometric transformation combiner to calculate a full motion geometric transformation for a full motion of the person by combining the global geometric transformation and the local geometric transformation, and an image generator to reconstruct the target image based on the source image and the full motion geometric transformation.
Claims
1. An apparatus for generating a speech synthesis image based on machine learning, the apparatus comprising: at least one processor configured to implement; a first global geometric transformation predictor configured to be trained to receive each of a source image and a target image including the same person, and predict a global geometric transformation for a global motion of the person between the source image and the target image, based on the source image and the target image; a local geometric transformation predictor configured to be trained to predict a local geometric transformation for a local motion of the person between the source image and the target image, based on preset input data; a geometric transformation combiner configured to calculate a full motion geometric transformation for a full motion of the person by combining the global geometric transformation and the local geometric transformation; and an image generator configured to be trained to reconstruct the target image, based on the source image and the full motion geometric transformation, wherein the first global geometric transformation predictor is further configured to extract a geometric transformation into a source image heat map from a preset reference probability distribution, based on the source image, extract a geometric transformation into a target image heat map from the preset reference probability distribution, based on the target image, and calculate the global geometric transformation, based on the geometric transformation into the source image heat map from the reference probability distribution and the geometric transformation into the target image heat map from the reference probability distribution.
2. The apparatus according to claim 1, wherein the global motion is a motion of the person with an amount greater than or equal to a preset threshold amount of motion, and the local motion is a motion of a face when the person is speaking.
3. The apparatus according to claim 2, wherein the source image heat map is a probability distribution map in an image space as to whether each pixel in the source image is a pixel related to the global motion of the person, and the target image heat map is a probability distribution map in the image space as to whether each pixel in the target image is a pixel related to the global motion of the person.
4. The apparatus according to claim 2, wherein the local geometric transformation predictor comprises a first local geometric transformation predictor configured to be trained to predict a first local geometric transformation for a local speech motion of the person between the source image and the target image, based on the preset input data, and the local speech motion is a motion related to speech of the local motion of the person.
5. The apparatus according to claim 4, wherein the first local geometric transformation predictor is further configured to receive each of the source image and the target image and to be trained to predict the first local geometric transformation, based on the source image and the target image.
6. The apparatus according to claim 5, wherein the first local geometric transformation predictor is further configured to estimate, from the source image, source local geometric transformations that are a plurality of geometric transformations for the local speech motion of the person, estimates, from the target image, target local geometric transformations that are a plurality of geometric transformations for the local speech motion of the person, and calculate the first local geometric transformation, based on the source local geometric transformations and the target local geometric transformations.
7. The apparatus according to claim 4, wherein the local geometric transformation predictor further comprises a second local geometric transformation predictor configured to be trained to predict a second local geometric transformation for a local non-speech motion of the person between the source image and the target image, based on the preset input data, and the local non-speech motion is a motion not related to speech of the local motion of the person.
8. The apparatus according to claim 7, wherein the second local geometric transformation is further configured to receive each of a source partial image including only a motion not related to speech of the person from the source image and a target partial image including only a motion not related to speech of the person from the target image, and to be trained to predict the second local geometric transformation, based on the source partial image and the target partial image.
9. The apparatus according to claim 8, wherein the second local geometric transformation predictor is further configured to estimate, from the source partial image, source partial geometric transformations that are a plurality of geometric transformations for the local non-speech motion of the person, estimate, from the target partial image, target partial geometric transformations that are a plurality of geometric transformations for the local non-speech motion of the person, and calculate the second local geometric transformation, based on the source partial geometric transformations and the target partial geometric transformations.
10. The apparatus according to claim 7, wherein the geometric transformation combiner is further configured to calculate a full local geometric transformation by combining the first local geometric transformation and the second local geometric transformation, and calculate the full motion geometric transformation by combining the full local geometric transformation and the global geometric transformation.
11. The apparatus according to claim 1, wherein the first global geometric transformation predictor is further configured to calculate a geometric transformation into any i-th (i{1, 2, . . . , n}) (n is a natural number equal to or greater than 2) frame heat map in an image having n frames from a preset reference probability distribution when the image is input, and calculate a global geometric transformation between two adjacent frames in the image, based on the geometric transformation into the i-th frame heat map from the reference probability distribution.
12. The apparatus according to claim 11, further comprising a second global geometric transformation predictor configured to receive sequential voice signals corresponding to the n frames, and to be trained to predict a global geometric transformation between two adjacent frames in the image from the sequential voice signals.
13. The apparatus according to claim 12, wherein the second global geometric transformation predictor is further configured to adjust a parameter of an artificial neural network to minimize a difference between the global geometric transformation between the two adjacent frames which is predicted in the second global geometric transformation predictor and the global geometric transformation between the two adjacent frames which is calculated in the first global geometric transformation predictor.
14. The apparatus according to claim 13, wherein in a test process for speech synthesis image generation, the second global geometric transformation predictor is further configured to receive sequential voice signals of a person, calculate a global geometric transformation between two adjacent frames in an image corresponding to the sequential voice signals from the sequential voice signals, and calculate a global geometric transformation between a preset target frame and a preset start frame, based on the global geometric transformation between the two adjacent frames, the local geometric transformation predictor is further configured to receive each of the start frame and the target frame, and calculate a local geometric transformation between the target frame and the start frame, based on the start frame and the target frame, the geometric transformation combiner is further configured to calculate a full motion geometric transformation by combining the global geometric transformation and the local geometric transformation, and the image generator is further configured to receive the start frame and the full motion geometric transformation, and reconstruct the target frame from the start frame and the full motion geometric transformation.
15. A method for generating a speech synthesis image, based on machine learning that is performed in a computing device including one or more processors and a memory storing one or more programs executed by the one or more processors, the method comprising: training a first global geometric transformation predictor to receive each of a source image and a target image including the same person, and predict a global geometric transformation for a global motion of the person between the source image and the target image, based on the source image and the target image; training a local geometric transformation predictor to predict a local geometric transformation for a local motion of the person between the source image and the target image, based on preset input data; calculating, in a geometric transformation combiner, a full motion geometric transformation for a full motion of the person by combining the global geometric transformation and the local geometric transformation; and training an image generator to reconstruct the target image, based on the source image and the full motion geometric transformation, wherein the training of the first global geometric transformation predictor comprises: extracting a geometric transformation into a source image heat map from a preset reference probability distribution, based on the source image; extracting a geometric transformation into a target image heat map from the preset reference probability distribution, based on the target image; and calculating the global geometric transformation, based on the geometric transformation into the source image heat map from the reference probability distribution and the geometric transformation into the target image heat map from the reference probability distribution.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
DETAILED DESCRIPTION
(6) Hereinafter, a specific embodiment of the present disclosure will be described with reference to the drawings. The following detailed description is provided to aid in a comprehensive understanding of the methods, apparatus and/or systems described herein. However, this is illustrative only, and the present disclosure is not limited thereto.
(7) In describing the embodiments of the present disclosure, when it is determined that a detailed description of related known technologies related to the present disclosure can unnecessarily obscure the subject matter of the present disclosure, a detailed description thereof will be omitted. In addition, terms to be described later are terms defined in consideration of functions in the present disclosure, which can vary according to the intention or custom of users or operators. Therefore, the definition should be made based on the contents throughout this specification. The terms used in the detailed description are only for describing embodiments of the present disclosure, and should not be limiting. Unless explicitly used otherwise, expressions in the singular form include the meaning of the plural form. In this description, expressions such as comprising or including are intended to refer to certain features, numbers, steps, actions, elements, some or combination thereof, and it is not to be construed to exclude the presence or possibility of one or more other features, numbers, steps, actions, elements, some or combinations thereof, other than those described.
(8) In addition, terms such as the first and second can be used to describe various components, but the components should not be limited by the terms. The above terms can be used for the purpose of distinguishing one component from another component. For example, without departing from the scope of the present disclosure, a first component can be referred to as a second component, and similarly, the second component can also be referred to as the first component.
(9) In the embodiments disclosed herein, the speech synthesis image is an image obtained by synthesizing a speech scene of a person through a machine learning model, and may also be referred to as a lip-sync image. The speech synthesis image may be an upper body image including the head and chest of a person, but is not limited thereto.
(10) In addition, a global motion may refer to a large motion of the person in the image in the overall frame. When the speech synthesis image is an upper body image, a global motion may refer to the motion of the entire upper body of the person in the image (e.g., the motion such as changing the posture of the upper body of the person in the image or turning the head of the person in the image). The global motion is a motion with an amount greater than or equal to a preset threshold amount, and the threshold amount may be set to represent a large motion of the person in the overall frame.
(11) In addition, a local motion may refer to a facial motion when a person in the image is speaking. That is, the local motion may refer to a change in facial expression, mouth and jaw motions, or the like that appear on the person's face when the person in the image is speaking. The local motion may be a motion with an amount below a threshold amount.
(12) In the embodiments disclosed herein, at the time of generating a speech synthesis image, generation of the global motion of a person in the image may be controlled by performing separation and estimation from the input image using a geometric transformation bottleneck, and generation of the local motion of the person in the image may be controlled using an input speech voice or the like.
(13) Specifically, when the full motion of a specific object in the image consists of a combination of motions of N independent elements, N geometric transformations may be required to fully estimate the full motion. Here, when the motions of the N independent elements are quantitatively different (for example, when the areas or volumes of parts of the object in the image to which the motions are applied are different or the sizes of the motions are different, and so on), the error associated with the motion of each element is proportional to the amount of the motion of the element.
(14) In this case, when the motion of an object is estimated through the machine learning model using K geometric transformations, where K is a number smaller than N (that is, K<N), in a process in which learning about the part with a large error among the motions of the elements is prioritized in the machine learning model, the K geometric transformations are derived to capture the element motion that is the largest.
(15) Therefore, when a bottleneck is formed by limiting the number of geometric transformations and element transformations constituting the geometric transformation, it is possible to separate and estimate the global motion of the person in the image (that is, large motion such as the motion of the head and torso of the person) from the local motion with a relatively small motion size (that is, facial motion when a person is speaking). The global motion may be a motion of a single element having the largest motion size, or may be a set of motions of a plurality of elements having a motion size greater than or equal to a preset size.
(16) In an exemplary embodiment, a single geometric transformation consisting of element transformations such as parallel translation, rotation, and scaling may be used to capture the global motion of a person in an image. Here, the parallel translation may be used to capture the overall movement of the upper body of the person in the image. Rotation and horizontal scaling may be used to capture the changes caused by the rotation of the head of the person in the image. The vertical scaling may be used to capture the vertical length change of the entire head and torso caused by the person in the image raising or lowering his or her head.
(17) For another example, the motion of the head of the person in the image may be captured through a geometric transformation consisting of parallel translation and rotation, and the motion of the torso of the person in the image may be captured through another geometric transformation consisting of only parallel translation.
(18) For another example, when a speech synthesis image includes only the head and the upper part of the neck of the person, the position of the neck is dependent on the head motion, and thus, the motion of the head of the person in the image may be captured using a single geometric transformation consisting of parallel translation and rotation.
(19) Meanwhile, the local motion of the person in the image may be divided into two motions. That is, the local motion may be divided into a motion related to speech (e.g., motions of the mouth (including lips) and jaw of the person) and a motion not related to speech (e.g., blinking of the eyes, eyebrow motion, frown, or the like, of the person). Hereinafter, the motion related to speech among the local motions may be referred to as a local speech motion. The motion not related to speech among the local motions may be referred as a local non-speech motion.
(20) Here, the local speech motion can allow the geometric transformation for the local speech motion to be used in combination with the geometric transformation for the global motion by adding an artificial neural network that uses image information (source image and target image, etc.) of a person as input to the machine learning model and allowing the added artificial neural network to output a geometric transformation (local speech motion) for the local motion related to speech. In this case, the components and the number of geometric transformations for predicting the local speech motion of the person can be appropriately set as needed.
(21) In addition, for the local non-speech motion, the geometric transform for the local non-speech motion can be allowed to be used in combination with the geometric transform for the global motion and the geometric transform for the local speech motion by adding an artificial neural network that uses an image (e.g., an image that contains only the areas around the eyes and eyebrows of the person) (or feature points showing only the non-speech motion of the person) containing only motions not related to speech of the person as input to the machine learning model, and allowing the added artificial neural network to output geometric transformations for local motions (local non-speech motion) not related to speech.
(22)
(23) Referring to
(24) The first global geometric transformation predictor 102 may receive each of a source image Is and a target image Id. Here, the source image Is and the target image Id are a pair of images including the same person, and the speech synthesis image generating apparatus 100 may include an artificial neural network for generating the target image Id as the speech synthesis image by using the source image Is as an input.
(25) The source image Is and the target image Id may be a video part of an image (that is, including video and audio) in which a person is speaking. The source image Is and the target image Id may be images including the face and upper body of the person, but are not limited thereto.
(26) The first global geometric transformation predictor 102 may calculate a geometric transformation (hereinafter, may be referred to as a global geometric transformation) for a global motion between the source image I.sub.s and the target image I.sub.d. That is, the first global geometric transformation predictor 102 may calculate a global geometric transformation capable of expressing a difference in the global motion of the person between the source image I.sub.s and the target image I.sub.d (that is, a large motion such as the motion of the head and torso of the person). Hereinafter, the first global geometric transformation predictor 102 is described as calculating the global geometric transformation into the source image I.sub.s from the target image I.sub.d by way of example, but is not limited thereto, and may also calculate the global geometric transformation into the target image I.sub.d from the source image I.sub.s.
(27) Specifically, the first global geometric transformation predictor 102 may receive each of the source image Is and the target image Id, and may extract heat maps for the source image Is and the target image Id. That is, the first global geometric transformation predictor 102 may extract a heat map Hs (source image heat map) for the source image Is from the source image Is. The first global geometric transformation predictor 102 may extract a heat map Hd (target image heat map) for the target image Id from the target image Id.
(28) In an exemplary embodiment, the first global geometric transformation predictor 102 may be constituted by an artificial neural network based on a convolutional neural network (CNN), but is not limited thereto. The first global geometric transformation predictor 102 may extract each of the source image heat map Hs and the target image heat map Hd through Equation 1 below.
(29)
(30) Here, each of the source image heat map H.sub.s and the target image heat map H.sub.d may be a map represented by a probability distribution in an image space. That is, the source image heat map H.sub.s may be a probability distribution map in the image space as to whether each pixel in the source image I.sub.s is a pixel related to the global motion of the person. The target image heat map H.sub.d may be a probability distribution map in the image space as to whether each pixel in the target image I.sub.d is a pixel related to the global motion of the person. In order to achieve the above, an output end of the first global geometric transformation predictor 102 may include a 2D softmax layer.
(31) The first global geometric transformation predictor 102 may calculate each of a probability mean .sub.s of the source image heat map H.sub.s and a probability mean .sub.d of the target image heat map H.sub.d through Equation 2.
(32)
(33) The first global geometric transformation predictor 102 may calculate a covariance matrix of the source image heat map H.sub.s based on the probability mean .sub.s of the source image heat map H.sub.s, and may calculate a covariance matrix of the target image heat map H.sub.d based on the probability mean pd of the target image heat map H.sub.d. The first global geometric transformation predictor 102 may calculate a covariance matrix K.sub.s of the source image heat map H.sub.s and a covariance matrix K.sub.d of the target image heat map H.sub.d through Equation 3.
(34)
(35) Here, the covariance matrix K.sub.s of the image heatmap H.sub.s and the covariance matrix K.sub.d of the target image heatmap H.sub.d can be decomposed as expressed by Equation 4 below through singular value decomposition, respectively.
(36)
(37) When the covariance matrix K.sub.s of the image heatmap H.sub.s and the covariance matrix K.sub.d of the target image heatmap H.sub.d are mn matrices, respectively, U.sub.s and U.sub.d may be unitary matrices having a size of mm, .sub.s and .sub.d may be diagonal matrices having a size of mn, and V.sub.s and V.sub.d may be unitary matrices having a size of nn.
(38) The first global geometric transformation predictor 102 may calculate a geometric transformation into the source image heat map H.sub.s from a preset reference probability distribution H.sub.r based on the unitary matrix U.sub.s and the diagonal matrix .sub.s according to the singular value decomposition of the covariance matrix K.sub.s of the source image heat map H.sub.s, and the probability mean .sub.s of the source image heat map H.sub.s. Here, the preset reference probability distribution H.sub.r may be a probability distribution in which a probability mean is 0, the covariance matrix is an identity matrix, and the main axis is aligned with an image axis.
(39) The first global geometric transformation predictor 102 may calculate a geometric transformation
(40)
into the source image heat map H.sub.s from the preset reference probability distribution H.sub.r through Equation 5 below.
(41)
(42) Further, the first global geometric transformation predictor 102 may calculate a geometric transformation into the target image heat map H.sub.d from the preset reference probability distribution H.sub.r based on the unitary matrix U.sub.d and the diagonal matrix .sub.d according to the singular value decomposition of the covariance matrix K.sub.d of the target image heat map H.sub.d, and the probability mean pd of the target image heat map H.sub.d. The first global geometric transformation predictor 102 may calculate a geometric transformation
(43)
into the target image heat map H.sub.d from the preset reference probability distribution H.sub.r through Equation 6 below.
(44)
(45) Meanwhile, it has been described here that the artificial neural network of the first global geometric transformation predictor 102 receives the source image I.sub.s and the target image I.sub.d to extract the source image heat map H.sub.s and the target image heat map H.sub.d, respectively, and the subsequent process is performed through calculations, but the embodiment is not limited thereto, and the artificial neural network of the first global geometric transformation predictor 102 may receive the source image I.sub.s and the target image I.sub.d to extract the geometric transformation
(46)
into the source image heat map H.sub.s from the preset reference probability distribution H.sub.r and the geometric transformation
(47)
into the target image heat map H.sub.d from the preset reference probability distribution H.sub.r, respectively.
(48) The first global geometric transformation predictor 102 may calculate a global geometric transformation into the source image I.sub.s from the target image I.sub.d based on the geometric transformation
(49)
into the source image heat map H.sub.s from the reference probability distribution H.sub.r and the geometric transformation
(50)
into the target image heat map H.sub.d from the reference probability distribution H.sub.r. The first global geometric transformation predictor 102 may calculate a global geometric transformation
(51)
into the source image I.sub.s from the target image I.sub.d through Equation 7 below.
(52)
(53) Meanwhile, it has been described here that the neural network of the first global geometric transformation predictor 102 receives the source image I.sub.s and the target image I.sub.d to extract the source image heat map H.sub.s and the target image heat map H.sub.d, respectively (that is, calculate the global geometric transformation based on the heat map), but the embodiment is not limited thereto, and a method of directly estimating the global geometric transformation from the source image I.sub.s and target image I.sub.d without a heatmap by using an artificial neural network such as an encoder-predictor structure may also be used.
(54) The first local geometric transformation predictor 104 may include an artificial neural network for estimating the local motion of a person in the speech synthesis image. In an exemplary embodiment, the artificial neural network may be trained to estimate a local speech motion of the person (motion related to speech, such as motions of the mouth and jaw of the person) from input image information.
(55) Specifically, the first local geometric transformation predictor 104 may receive the source image I.sub.s and the target image I.sub.d, respectively. That is, the first local geometric transformation predictor 104 may receive the same source image I.sub.s and target image I.sub.d as input to the first global geometric transformation predictor 102, respectively.
(56) The first local geometric transformation predictor 104 may estimate a plurality of geometric transformations (hereinafter, can be referred to as source local geometric transformation) for the local motion of the person from the source image Is. That is, the first local geometric transformation predictor 104 may estimate a plurality of geometric transformations (n geometric transformations, n is a natural number greater than or equal to 2) capable of representing local motions related to the speech based on the source image I.sub.s when a person speaks a voice. In this case, the number of geometric transformations may be appropriately set as needed.
(57) In addition, the first local geometric transformation predictor 104 may estimate a plurality of geometric transformations (hereinafter, can be referred to as target local geometric transformation) for the local motion of the person from the target image I.sub.d. That is, the first local geometric transformation predictor 104 may estimate a plurality of geometric transformations (n geometric transformations, n is a natural number greater than or equal to 2) capable of representing local motions related to the speech based on the target image I.sub.d when a person speaks a voice. In this case, the number of geometric transformations may be appropriately set as needed.
(58) The first local geometric transformation predictor 104 may estimate a source local geometric transformation
(59)
from the source image I.sub.s and may estimate a target local geometric transformation
(60)
from the target image I.sub.d through the following Equations 8 and 9.
(61)
(62) Here, k{1, . . . , n} (n is a natural number greater than or equal to 2), and F.sup.local1 is an artificial neural network constituting the first local geometric transformation predictor 104.
(63) The first local geometric transformation predictor 104 may calculate a local geometric transformation capable of expressing a difference in the local motion of a person (here, the local speech motion of the person) between the source image I.sub.s and the target image I.sub.d based on the source local geometric transformation
(64)
and the target local geometric transformation
(65)
Hereinafter, the first local geometric transformation predictor 104 is described as calculating the local geometric transformation into the source image I.sub.s from the target image I.sub.d by way of example, but is not limited thereto, and may also calculate the local geometric transformation into the target image I.sub.d from the source image I.sub.s.
(66) The first local geometric transformation predictor 104 may calculate a local geometric transformation
(67)
into the source image I.sub.s from the target image I.sub.d through Equation 10 below.
(68)
(69) Meanwhile, here, although the description is made in such a way that the artificial neural network of the first local geometric transformation predictor 104 receives the source image I.sub.s and the target image I.sub.d, respectively, and calculates the local geometric transformation, the present disclosure is not limited thereto. The artificial neural network of the first local geometric transformation predictor 104 may calculate the local geometric transformation by receiving feature points respectively extracted from the source image I.sub.s and the target image I.sub.d.
(70) In addition, the first local geometric transformation predictor 104 may calculate the local geometric transformation by receiving a partial image (e.g., an image in which parts of the source image I.sub.s and the target image I.sub.d are masked or cropped, etc.) containing only a part of the source image I.sub.s and the target image I.sub.d rather than the entire image.
(71) In addition, the first local geometric transformation predictor 104 may receive only the target image I.sub.d and calculate the local geometric transformation. In this case, the local geometric transformation
(72)
can be expressed as
(73)
(74) The geometric transformation combiner 106 may calculate a geometric transformation for the full motion of the person (that is, motion including both the global motion and the local motion) by combining the global geometric transformation calculated by the first global geometric transformation predictor 102 and the local geometric transformation calculated by the first local geometric transformation predictor 104. Hereinafter, the geometric transformation for the full motion of the person may be referred to as a full motion geometric transformation.
(75) In an exemplary embodiment, when the first global geometric transformation predictor 102 calculates the global geometric transformation
(76)
into the source image I.sub.s from the target image I.sub.d and the first local geometric transformation predictor 104 calculates the local geometric transformation
(77)
into the source image I.sub.s from the target image I.sub.d, the geometric transformation combiner 106 may calculate the full motion geometric transformation into the source image I.sub.s from the target image I.sub.d by combining the global geometric transformation
(78)
and the local geometric transformation
(79)
However, the embodiment is not limited thereto, and the full motion geometric transformation into the target image I.sub.d from the source image I.sub.s may be calculated.
(80) For example, as shown in Equation 11 below, the geometric transformation combiner 106 may calculate a full motion geometric transformation
(81)
by sequentially multiplying a plurality of (n) local geometric transformations
(82)
by the global geometric transformation
(83)
(84)
(85) For another example, as shown in Equation 12 below, the geometric transformation combiner 106 may calculate the full motion geometric transformation
(86)
by composing a set of geometric transformations with a plurality of (n) local geometric transformations
(87)
and the global geometric transformation
(88)
(89)
(90) For yet another example, as shown in Equation 13 below, the geometric transformation combiner 106 may calculate the full motion geometric transformation
(91)
by converting the coordinates of the target image I.sub.d coordinates into coordinates of a preset reference coordinate system through
(92)
applying the local geometric transformation
(93)
to the preset reference coordinate system, and then converting the coordinates of the preset reference coordinate system into the coordinates of the source image I.sub.s through
(94)
(95)
(96) In Equation 13, a homogeneous coordinates method may be used for multiplication between geometric transformations.
(97) The optical flow predictor 108 may calculate an optical flow representing a motion (or change amount) in units of pixels between the source image I.sub.s and the target image I.sub.d by using the full motion geometric transformation calculated by the geometric transformation combiner 106 and the source image I.sub.s.
(98) In an exemplary embodiment, when the geometric transformation combiner 106 calculates the full motion geometric transformation
(99)
into the source image I.sub.s from the target image I.sub.d, the optical flow predictor 108 may calculate the optical flow from the target image I.sub.d to the source image I.sub.s based on the full motion geometric transformation
(100)
and the source image I.sub.s, which will be described below. However, the embodiment is not limited thereto, and the optical flow from the source image I.sub.s to the target image I.sub.d may be calculated.
(101) Specifically, the optical flow predictor 108 may transform the source image I.sub.s by applying the full motion geometric transformation
(102)
into the source image I.sub.s by using an image warping operator. In this case, the optical flow predictor 108 may transform the source image I.sub.s through Equation 14 below.
(103)
(104)
transformed source image I.sub.s warp( ): operator for image warping k: k{1, . . . , n}(n is a natural number greater than or equal to 2)
(105) In Equation 14, for the operator for image warping, a backward warping operation may be used that calculates coordinates of the source image I.sub.s corresponding to coordinates of the transformed source image
(106)
by applying each of n full motion geometric transformations
(107)
to the coordinates of the transformed source image
(108)
and estimates pixel values of the transformed source image
(109)
from pixel values of the source image I.sub.s using interpolation may be used.
(110) The optical flow predictor 108 may calculate a weighted probability distribution map for estimating the optical flow based on the transformed source image
(111)
In this case, the optical flow predictor 108 may calculate a weighted probability distribution map P having n classes for each pixel by inputting the transformed source image
(112)
to the artificial neural network. This may be expressed by the Equation 15 below.
(113)
(114) Here, the artificial neural network may include a one-dimensional softmax layer at an output end to calculate the weighted probability distribution map P.
(115) Meanwhile, here, the transformed source image
(116)
is used as an input to the artificial neural network F.sup.flow, but the embodiment is not limited thereto and the weighted probability distribution map P may be calculated by using a feature tensor extracted from the source image I.sub.s as an input to the artificial neural network F.sup.flow.
(117) The optical flow predictor 108 may calculate the optical flow from the target image I.sub.d to the source image I.sub.s for each pixel by linearly combining the full motion geometric transformation
(118)
using a weighed probability distribution value corresponding to each of pixel positions of the transformed source image
(119)
The optical flow predictor 108 may calculate the optical flow from the target image I.sub.d to the source image I.sub.s for each pixel through Equation 16.
(120)
(121) The image generator 110 may reconstruct and generate the target image I.sub.d based on the optical flow between the source image I.sub.s and the target image I.sub.d calculated by the optical flow predictor 108 and the source image I.sub.s.
(122) In an exemplary embodiment, when the optical flow predictor 108 calculates the optical flow f.sub.sd(z) from the target image I.sub.d to the source image I.sub.s for each pixel, the image generator 110 may reconstruct the target image I.sub.d based on the optical flow f.sub.s-d(z) from the target image I.sub.d to the source image I.sub.s for each pixel and the source image I.sub.s.
(123) Specifically, the image generator 110 may extract a feature tensor by inputting the source image I.sub.s into an artificial neural network (e.g., an encoder). In this case, the artificial neural network may encode the source image I.sub.s and extract a feature tensor from the source image I.sub.s.
(124) The image generator 110 may transform a feature tensor (I.sub.s) of the source image I.sub.s by using the optical flow f.sub.sd(z) from the target image I.sub.d to the source image I.sub.s for each pixel. The image generator 110 may transform the feature tensor (I.sub.s) of the source image I.sub.s through Equation 17 below.
(125)
(126) Here, as the operator warp( ) for image warping, a backward warping operator can be used.
(127) The image generator 110 may reconstruct the target image I.sub.d by inputting the transformed feature tensor (I.sub.s) of the source image to an artificial neural network (e.g., a decoder). The image generator 110 may learn the artificial neural network to minimize the difference between the reconstructed target image I.sub.d and the actual target image I.sub.d.
(128) Meanwhile, when the training of the speech synthesis image generating apparatus 100 is completed, an arbitrary target image may be reconstructed from the source image by inputting the source image and the arbitrary target image to the first global geometric transformation predictor 102, and inputting the source image and the arbitrary target image to the first local geometric transformation predictor 104.
(129) According to the disclosed embodiment, the global motion and the local motion of a person in an image are separately estimated at the time of generating a speech synthesis image, thereby making it possible to reduce the overall volume of the machine learning model for generating a speech synthesis image and reduce the number of computations used therefor.
(130) Meanwhile, here, it has been described that the optical flow predictor 106 calculates the optical flow f.sub.sd(z) for each pixel and the image generator 108 reconstructs the target image using the optical flow f.sub.sd(z), but is not limited thereto. The full motion geometric transformation
(131)
and the source image I.sub.s are input to the image generator 108 without the process of calculating the optical flow for each pixel, and the target image may be reconstructed based on this.
(132)
(133) Referring to
(134) That is, the speech synthesis image generating apparatus 100 illustrated in
(135) Here, as described with reference to
(136) The second local geometric transformation predictor 112 may include an artificial neural network for estimating a local non-speech motion of the person in the speech synthesis image. In an exemplary embodiment, the artificial neural network may be trained to estimate the local non-speech motion of the person (e.g., blinking of the eyes, eyebrow motion, frown, or the like of the person) from an input partial image (or feature point).
(137) The second local geometric transformation predictor 112 may receive a partial image including only a motion related to non-speech of the person. In an exemplary embodiment, the second local geometric transformation predictor 112 may receive each of a source partial image I.sub.s.sup.eyes including only areas around the eyes and eyebrows of the person in the source image and a target partial image I.sub.d.sup.eyes including only areas around the eyes and eyebrows of the person in the target image.
(138) Here, the source partial image I.sub.s.sup.eyes and the target partial image I.sub.d.sup.eyes may be images where a mask that covers parts of the source image and the target image except the parts around the eyes and eyebrows of the person is used, or images in which only the parts around the eyes and eyebrows of the person in the source image and the target image are picked up.
(139) Meanwhile, it has been described here that the source partial image I.sub.s.sup.eyes and the target partial image I.sub.d.sup.eyes are input to the second local geometric transformation predictor 112, but the embodiment is not limited thereto, and each of the feature points of the source partial image I.sub.s.sup.eyes and the target partial image I.sub.d.sup.eyes may be input to the second local geometric transformation predictor 112.
(140) When information corresponding to the global motion of the person is present in input data (partial image or feature point) including only motion related to non-speech of the person, the second local geometric transformation predictor 112 may remove information corresponding to the global motion of the person from the input data. For example, when the input data is a partial image, the second local geometric transformation predictor 112 may fix the position and size of a motion part related to non-speech of the person in the partial image and remove information corresponding to the global motion of the person. Further, when the input data is a feature point, the second local geometric transformation predictor 112 may remove a value corresponding to the global motion of the person from feature point coordinates and leave only the motion value related to the non-speech of the person.
(141) The second local geometric transformation predictor 112 may estimate a plurality of geometric transformations (hereinafter, referred to as source partial geometric transformations) for a local non-speech motion of the person from the source partial image I.sub.s.sup.eyes. In addition, the second local geometric transformation predictor 112 may estimate a plurality of geometric transformations (hereinafter, referred to as target partial geometric transformations) for a local non-speech motion of the person from the target partial image I.sub.d.sup.eyes. In this case, the number of geometric transformations may be appropriately set as needed.
(142) The second local geometric transformation predictor 112 may estimate a source partial geometric transformation
(143)
from the source partial image I.sub.s.sup.eyes through Equation 18 below, and estimate a target partial geometric transformation
(144)
from the target partial image I.sub.d.sup.eyes.
(145)
(146) Here, k{1, . . . , n} (n is a natural number greater than or equal to 2), and F.sup.local2 is an artificial neural network constituting the second local geometric transformation predictor 112.
(147) The second local geometric transformation predictor 112 may calculate a second local geometric transformation capable of expressing the difference in the local non-speech motion of the person between the source image I.sub.s and the target image I.sub.d based on the source partial geometric transformation
(148)
and the target partial geometric transformation
(149)
Hereinafter, the second local geometric transformation predictor 112 is described as calculating the second local geometric transformation into the source image I.sub.s from the target image I.sub.d by way of example, but is not limited thereto, and may also calculate the second local geometric transformation into the target image I.sub.d from the source image I.sub.s.
(150) The second local geometric transformation predictor 112 may calculate a second local geometric transformation
(151)
into the source image I.sub.s from the target image I.sub.d through Equation 19 below.
(152)
(153) Meanwhile, the geometric transformation combiner 106 may calculate a full motion geometric transformation by combining the global geometric transformation calculated by the first global geometric transformation predictor 102, the first local geometric transformation calculated by the first local geometric transformation predictor 104, and the second local geometric transformation calculated by the second local geometric transformation predictor 112.
(154) In an exemplary embodiment, the geometric transformation combiner 106 may calculate a full local geometric transformation by combining the first local geometric transformation and the second local geometric transformation, and calculate the full motion geometric transformation by combining the full local geometric transformation and the global geometric transformation. Here, the method of calculating a full motion geometric transformation by combining the full local geometric transformation and the global geometric transformation may be performed in the same or similar manner as in Equations 11 to 13, and thus a detailed description thereof will be omitted.
(155) When the first local geometric transformation into the source image I.sub.s from the target image I.sub.d which is calculated by the first local geometric transformation predictor 104 is
(156)
the full local geometric transformation
(157)
which expresses the difference in the full local motion of the person (that is, including the local speech motion and local non-speech motion) between the source image I.sub.s and the target image I.sub.d, may be expressed as Equation 20 below.
(158)
(159) Furthermore, the optical flow predictor 108 and the image generator 110 are the same as those in the embodiment shown in
(160) It has been described here that both the first local geometric transformation predictor 104 and the second local geometric transformation predictor 112 are included, but the embodiment is not limited thereto, and when the geometric transformation for the local non-speech motion is estimated, the first local geometric transformation predictor 104 may be omitted.
(161) Meanwhile, in the disclosed embodiment, the relative change amount of the global geometric transformation of the person may be learned using a voice signal sequence. That is, a separate artificial neural network that uses a voice signal sequence (sequential voice signal) as an input may be added, and the artificial neural network may be trained to estimate the relative change amount of the global geometric transformation calculated by the first global geometric transformation predictor 102 shown in
(162)
(163) Here, when the first global geometric transformation predictor 102 is in a trained state, and an image I.sub.i, (1in) having n frames is input to the first global geometric transformation predictor 102, the first global geometric transformation predictor 102 may calculate a geometric transformation
(164)
into the i-th frame heat map from the preset reference probability distribution H.sub.r.
(165) In addition, the first global geometric transformation predictor 102 may calculate a global geometric transformation
(166)
between two adjacent frames based on the geometric transformation
(167)
into the i-th frame heat map (i{1, . . . , n}) from the preset reference probability distribution H.sub.r. Here, the first global geometric transformation predictor 102 may calculate the global geometric transformation
(168)
between two frames through Equation 21 below.
(169)
(170) Meanwhile, in the training stage of the second global geometric transformation predictor 114, the second global geometric transformation predictor 114 may receive a sequential voice signal M.sub.i (1in) corresponding to an image I.sub.i (1in) having n frames. The second global geometric transformation predictor 114 may include an artificial neural network F.sup.seq that is trained to estimate the global geometric transformation
(171)
between two frames of the corresponding image from the input sequential voice signal M.sub.i.
(172) In this case, the second global geometric transformation predictor 114 may use the global geometric transformation
(173)
between two frames calculated by the first global geometric transformation predictor 102 as a correct answer value, and may train the artificial neural network F.sup.seq (that is, adjust the parameter or weight of the artificial neural network F.sup.seq) to minimize the difference between the global geometric transformation between the two frames output from the artificial neural network F.sup.seq and the correct answer value.
(174) The second global geometric transformation predictor 114 may estimate the global geometric transformation
(175)
between two frames of the corresponding image from the input sequential voice signal M.sub.i through Equation 22 below.
(176)
(177) As described above, when the training of the second global geometric transformation predictor 114 is completed, the global geometric transformation of the person may be predicted using the source image and the sequential voice signal as inputs. In this case, the global geometric transformation of the person is predicted through the second global geometric transformation predictor 114 instead of the first global geometric transformation predictor 102. The configuration of a speech synthesis image generating apparatus 100 for achieving the above is shown in
(178) Referring to
(179) The second local geometric transformation predictor 114 may receive a sequential voice signal of a predetermined person, and estimate a global geometric transformation
(180)
between two frames of an image corresponding to the sequential voice signal from the received sequential voice signal.
(181) The second local geometric transformation predictor 114 may calculate a global geometric transformation into a start frame (source image) from a target frame (i-th frame) based on the global geometric transformation
(182)
between two frames of the image corresponding to the sequential voice signal.
(183) Here, the start frame may be for providing information about the identity of the person. In this case, in order to provide information about the identity of the person, an embedding vector or the like for the person may be additionally input instead of the start frame or in addition to the start frame.
(184) Specifically, the second local geometric transformation predictor 114 may calculate a global geometric transformation
(185)
into the i-th frame (that is, target frame) from the start frame through Equation 23 below by using the source image as the start frame based on the global geometric transformation
(186)
between two frames of the image corresponding to the sequential voice signal.
(187)
(188) Next, the second local geometric transformation predictor 114 may calculate the global geometric transformation
(189)
into the start frame from the i-th frame, which is the target frame, through Equation 24 below.
(190)
(191) The first local geometric transformation predictor 104 receives a source image and a target image, respectively. Here, the source image may be the start frame. In addition, the target image may be an image corresponding to the i-th frame (i.e., target frame).
(192) The first local geometric transformation predictor 104 may calculate a local geometric transformation
(193)
into the start frame from the target frame based on the source image and the target image.
(194) The geometric transformation combiner 106 may calculate a full motion geometric transformation
(195)
into the start frame from the target frame by combining the global geometric transformation
(196)
and the local geometric transformation
(197)
into the start frame from the target frame.
(198) The optical flow predictor 108 may receive the start frame and the full motion geometric transformation
(199)
into the start frame from the target frame, and calculate an optical flow f.sub.1i from the target frame to the start frame for each pixel from the received ones.
(200) The image generator 110 may receive the start frame and the optical flow f.sub.1i from the target frame to the start frame for each pixel, and reconstruct and generate the target frame therefrom from the received ones. As described above, according to the disclosed embodiment, it is possible to estimate the global motion of the person by using the sequential voice signal as an input, and to generate a speech synthesis image based on the estimation.
(201) Meanwhile, here, the first local geometric transformation predictor 104 is illustrated to estimate the local speech motion of the person, but the embodiment is not limited thereto, and the second local geometric transformation predictor 112 may be added to additionally estimate the local non-speech motion of the person.
(202)
(203) The illustrated computing environment 10 includes a computing device 12. In an embodiment, the computing device 12 can be the apparatus 100 for generating the speech synthesis image.
(204) The computing device 12 includes at least one processor 14, a computer-readable storage medium 16, and a communication bus 18. The processor 14 can cause the computing device 12 to operate according to the exemplary embodiment described above. For example, the processor 14 can execute one or more programs stored on the computer-readable storage medium 16. The one or more programs can include one or more computer-executable instructions, which, when executed by the processor 14, can be configured so that the computing device 12 performs operations according to the exemplary embodiment.
(205) The computer-readable storage medium 16 is configured so that the computer-executable instruction or program code, program data, and/or other suitable forms of information are stored. A program 20 stored in the computer-readable storage medium 16 includes a set of instructions executable by the processor 14. In one embodiment, the computer-readable storage medium 16 can be a memory (volatile memory such as a random access memory, non-volatile memory, or any suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, other types of storage media that are accessible by the computing device 12 and capable of storing desired information, or any suitable combination thereof.
(206) The communication bus 18 interconnects various other components of the computing device 12, including the processor 14 and the computer-readable storage medium 16.
(207) The computing device 12 can also include one or more input/output interfaces 22 that provide an interface for one or more input/output devices 24, and one or more network communication interfaces 26. The input/output interface 22 and the network communication interface 26 are connected to the communication bus 18. The input/output device 24 can be connected to other components of the computing device 12 through the input/output interface 22. The exemplary input/output device 24 can include a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touch pad or touch screen), a speech or sound input device, input devices such as various types of sensor devices and/or photographing devices, and/or output devices such as a display device, a printer, a speaker, and/or a network card. The exemplary input/output device 24 may be included inside the computing device 12 as a component constituting the computing device 12, or may be connected to the computing device 12 as a separate device distinct from the computing device 12.
(208) Although representative embodiments of the present disclosure have been described in detail, those skilled in the art to which the present disclosure pertains will understand that various modifications can be made thereto within the limits that do not depart from the scope of the present disclosure. Therefore, the scope of rights of the present disclosure should not be limited to the described embodiments, but should be defined not only by claims set forth below but also by equivalents to the claims.