G10L2021/105

Configuration for remote multi-channel language interpretation performed via imagery and corresponding audio at a display-based device

A configuration is implemented to receive, with a processor from a customer care platform, a request for spoken language interpretation of a user query from a first spoken language to a second spoken language. The first spoken language is spoken by a user situated at a display-based device that is remotely situated from the customer care platform. The user query is sent from the display-based device by the user to the customer care platform. The configuration performs, at a language interpretation platform, a first spoken language interpretation of the user query from the first spoken language to the second spoken language. Further, the configuration transmits, from the language interpretation platform to the customer care platform, the first spoken language interpretation so that a customer care representative speaking the second spoken language understands the first spoken language being spoken by the user.

Facial recognition method for video conference and server using the method

A facial recognition method for video conferencing requiring a reduced bandwidth and transmitting video and audio frames synchronously first determines whether a 3D body model of a first user at a local end has been currently retrieved or is otherwise retrievable from a historical database. Multiple audio frames of first user are collected and audio frequency at a specific range are filtered out. An envelope curve of the first audio frames and multiple attacking time periods and multiple releasing time periods of the envelope curve is calculated and correlated with lip movements of first user. Information packets of same and head-rotating and limb-swinging images of the first user are transmitted to a remote second user so that the 3D body model can simulate and show lip shapes and other movement of the first user.

Systems And Methods For Machine-Generated Avatars
20200321020 · 2020-10-08 ·

Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.

USING MACHINE-LEARNING MODELS TO DETERMINE MOVEMENTS OF A MOUTH CORRESPONDING TO LIVE SPEECH
20200294495 · 2020-09-17 ·

Disclosed systems and methods predict visemes from an audio sequence. In an example, a viseme-generation application accesses a first audio sequence that is mapped to a sequence of visemes. The first audio sequence has a first length and represents phonemes. The application adjusts a second length of a second audio sequence such that the second length equals the first length and represents the phonemes. The application adjusts the sequence of visemes to the second audio sequence such that phonemes in the second audio sequence correspond to the phonemes in the first audio sequence. The application trains a machine-learning model with the second audio sequence and the sequence of visemes. The machine-learning model predicts an additional sequence of visemes based on an additional sequence of audio.

Viseme data generation

Systems and methods for viseme data generation are disclosed. Uncompressed audio data is generated and/or utilized to determine the beats per minute of the audio data. Visemes are associated with the audio data utilizing a Viterbi algorithm and the beats per minute. A time-stamped list of viseme data is generated that associates the visemes with the portions of the audio data that they correspond to. An animatronic toy and/or an animation is caused to lip sync using the viseme data while audio corresponding to the audio data is output.

Audio-based face tracking and lip syncing for natural facial animation and lip movement

In one embodiment, a method includes receiving an audio signal comprising a plurality of speech units, processing the audio signal to associate each of the speech units with a corresponding lip animation, determining pitch information associated with each of the plurality of speech units, processing the pitch information of each of the plurality of speech units to associate at least one of the speech units with a facial-component animation, and presenting the audio signal with a displayed animation of a face, wherein the animation of the face displays the lip animation associated with each of the speech units and the facial-component animation associated with the at least one speech unit. The animation of the face may be displayed in real time with the audio signal. The facial component animation may include animation of the lips, eyebrows, eyelids, and other portion of the upper face.

Disambiguation of virtual reality information using multi-modal data including speech

Functionality is disclosed herein for using a framework for a VR/AR application to utilize different services. In some configurations, a VR/AR application can utilize different services, such as an animation service, a multi-modal disambiguation service, a virtual platform service, a recognition service, an automatic speech recognition (ASR) service, a text-to-speech (TTS) service, a search service, as well as one or more other services. Instead of a developer of the VR/AR application having to develop programming code to implement features provided by one or more of services, the developer may utilize functionality of existing services that are available from a service provider network.

Method and Apparatus for Processing Information
20200234478 · 2020-07-23 ·

Embodiments of the present disclosure provide a method and apparatus for processing information. A method may include: generating voice response information based on voice information sent by a user; generating a phoneme sequence based on the voice response information; generating mouth movement information based on the phoneme sequence, the mouth movement information being used for controlling a mouth movement of a displayed three-dimensional human image when playing the voice response information; and playing the voice response information, and controlling the mouth movement of the three-dimensional human image based on the mouth movement information.

PERIOCULAR AND AUDIO SYNTHESIS OF A FULL FACE IMAGE
20200226830 · 2020-07-16 ·

Systems and methods for synthesizing an image of the face by a head-mounted device (HMD) are disclosed. The HMD may not be able to observe a portion of the face. The systems and methods described herein can generate a mapping from a conformation of the portion of the face that is not imaged to a conformation of the portion of the face observed. The HMD can receive an image of a portion of the face and use the mapping to determine a conformation of the portion of the face that is not observed. The HMD can combine the observed and unobserved portions to synthesize a full face image.

System, method, and computer program for transmitting face models based on face data points

A system, method, and computer program are provided for transmitting face models based on face data points. In use, a first image is received and at least one face associated with the first image is identified. Next, a face model is created of the at least one face by determining a structure of the at least one face, wherein the face model includes one or more face data points. The face model is transmitted. Additionally, a real-time stream is enabled of the at least one face, and a real-time face model is determined of the real-time stream using the face model. The real-time face model is then transmitted.