G10L2021/105

Tonal deafness compensation in an auditory prosthesis system
09956407 · 2018-05-01 · ·

Embodiments presented herein are generally directed to techniques for compensating for tonal deafness experienced by a recipient of an auditory prosthesis. More specifically, an auditory prosthesis system includes an external device configured to generate a graphical representation that enables the recipient to compensate for reduced tonal perception associated with delivery of the stimulation signals representative of speech signals. The external device is configured to analyze received speech signals to determine vocal articulator movement of the speaker of the speech signals and/or emotion of the speaker. The external device is further configured to display one or more animated visual cues representative of the detected vocal articulator movement and/or emotion.

Computer generated head

A method of animating a computer generation of a head, the head having a mouth which moves in accordance with speech to be output by the head, said method comprising: providing an input related to the speech which is to be output by the movement of the lips; dividing said input into a sequence of acoustic units; selecting expression characteristics for the inputted text; converting said sequence of acoustic units to a sequence of image vectors using a statistical model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to an image vector, said image vector comprising a plurality of parameters which define a face of said head; and outputting said sequence of image vectors as video such that the mouth of said head moves to mime the speech associated with the input text with the selected expression, wherein a parameter of a predetermined type of each probability distribution in said selected expression is expressed as a weighted sum of parameters of the same type, and wherein the weighting used is expression dependent, such that converting said sequence of acoustic units to a sequence of image vectors comprises retrieving the expression dependent weights for said selected expression, wherein the parameters are provided in clusters, and each cluster comprises at least one sub-cluster, wherein said expression dependent weights are retrieved for each cluster such that there is one weight per sub-cluster.

Generation of animation using icons in text
09953450 · 2018-04-24 · ·

There is described a method for creating an animation, comprising: inserting at least one icon within a text related to the animation, the at least one icon being associated with an action to be performed by one of an entity and a part of an entity, at a point in time corresponding to a position of the at least one icon in the text, and a given feature of an appearance of the at least one icon being associated with one of the entity and the part of the entity; and executing the text and the at least one icon in order to generate the animation.

System and method for lip-syncing a face to target speech using a machine learning model

A processor-implemented method for generating a lip-sync for a face to a target speech of a live session to a speech in one or more languages in-sync with improved visual quality using a machine learning model and a pre-trained lip-sync model is provided. The method includes (i) determining a visual representation of the face and an audio representation, the visual representation includes crops of the face; (ii) modifying the crops of the face to obtain masked crops; (iii) obtaining a reference frame from the visual representation at a second timestamp; (iv) combining the masked crops at the first timestamp with the reference to obtain lower half crops; (v) training the machine learning model by providing historical lower half crops and historical audio representations as training data; (vi) generating lip-synced frames for the face to the target speech, and (vii) generating an in-sync lip-synced frames by the pre-trained lip-sync model.

Method and device for generating speech video using audio signal

A device according to an embodiment has one or more processors and a memory storing one or more programs executable by the one or more processors. The device includes a first encoder configured to receive a person background image corresponding to a video part of a speech video of a person and extract an image feature vector from the person background image, a second encoder configured to receive a speech audio signal corresponding to an audio part of the speech video and extract a voice feature vector from the speech audio signal, a combiner configured to generate a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder, and a decoder configured to reconstruct the speech video of the person using the combined vector as an input.

SYSTEMS AND METHODS FOR GENERATING COMPOSITE MEDIA USING DISTRIBUTED NETWORKS
20240371161 · 2024-11-07 ·

A distributed systems and methods for generating composite media including receiving a media context that defines media that is to be generated, the media context including: a definition of a sequence of media segment specifications and, an identification of a set of remote devices. For each media segment specification, a reference segment may be generated and transmitted to at least one remote device. A media segment may be received from each of the remote device, the media segment having been recorded by a camera. Verified media sequences may replace the corresponding reference segment. The media segments may be aggregated and an updated sequence of media segments may be defined. An instance of the media context that includes a subset of the updated sequence of media segments may then be generated.

Generating a visually consistent alternative audio for redubbing visual speech

There are provided systems and methods for generating a visually consistent alternative audio for redubbing visual speech using a processor configured to sample a dynamic viseme sequence corresponding to a given utterance by a speaker in a video, identify a plurality of phonemes corresponding to the dynamic viseme sequence, construct a graph of the plurality of phonemes that synchronize with a sequence of lip movements of a mouth of the speaker in the dynamic viseme sequence, use the graph to generate an alternative phrase that substantially matches the sequence of lip movements of the mouth of the speaker in the video.

Systems and methods for speech animation using visemes with phonetic boundary context

Speech animation may be performed using visemes with phonetic boundary context. A viseme unit may comprise an animation that simulates lip movement of an animated entity. Individual ones of the viseme units may correspond to one or more complete phonemes and phoneme context of the one or more complete phonemes. Phoneme context may include a phoneme that is adjacent to the one or more complete phonemes that correspond to a given viseme unit. Potential sets of viseme units that correspond with individual phoneme string portions may be determined. One of the potential sets of viseme units may be selected for individual ones of the phoneme string portions based on a fit metric that conveys a match between individual ones of the potential sets and the corresponding phoneme string portion.

Image Processing Device
20180061109 · 2018-03-01 ·

Device comprising a memory (2) storing sound data, three-dimensional surface data, and a plurality of control data sets which represent control points defined by data of coordinates which are associated with sound data, and a processor (4) which, on the basis of first and second successive sound data, and of first three-dimensional surface data, selects the control data sets associated with the first and second sound data, and defines second three-dimensional surface data by applying a displacement to each point. The displacement of a given point is calculated as the sum of displacement vectors calculated for each control point on the basis of the sum of first and second vectors, weighted by the ratio between the result of a function with two variables exhibiting a zero limit at infinity applied to the given point and to the control point and the sum of the result of this function applied to the point on the one hand and to each of the control points on the other hand. The first vector represents the displacement of the control point between the first and the second second sound data. The second vector corresponds to the difference between the data of coordinates of the point and the data of coordinates of the control point in the first sound data, multiplied by a coefficient dependent on the gradient of the first vector.

PROVIDING AUDIO AND VIDEO FEEDBACK WITH CHARACTER BASED ON VOICE COMMAND
20180047391 · 2018-02-15 · ·

Provided are methods of dynamically and selectively providing audio and video feedbacks in response to a voice command. A method may include recognizing a voice command in a user speech received through a user device, generating at least one of audio data and video data by analyzing the voice command and associated context information, and selectively outputting the audio data and the video data through at least one of a display device and a speaker coupled to a user device based on the analysis result.