G10L21/10

Telephone system for the hearing impaired

A telephone system is described herein, wherein the telephone system is configured to assist a hearing-impaired person with telephone communications as well as face-to-face conversations. In telephone communication sessions, the telephone system is configured to audibly emit spoken utterances while simultaneously depicting a transcription of the spoken utterances on a display. When the telephone system is not employed in a telephone communication session, the telephone system is configured to display transcriptions of spoken utterances of people who are in proximity to the telephone system.

System and method for data augmentation for multi-microphone signal processing

A method, computer program product, and computing system for receiving a signal from each microphone of a plurality of microphones, thus defining a plurality of signals. One or more inter-microphone gain-based augmentations may be performed on the plurality of signals, thus defining one or more inter-microphone gain-augmented signals.

Systems and methods for machine-generated avatars
11551705 · 2023-01-10 · ·

Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.

Systems and methods for machine-generated avatars
11551705 · 2023-01-10 · ·

Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.

METHOD FOR OUTPUTTING BLEND SHAPE VALUE, STORAGE MEDIUM, AND ELECTRONIC DEVICE

A method for outputting a blend shape value includes: performing feature extraction on obtained target audio data to obtain a target audio feature vector; inputting the target audio feature vector and a target identifier into an audio-driven animation model; inputting the target audio feature vector into an audio encoding layer, determining an input feature vector of a next layer at a (2t−n)/2 time point based on an input feature vector of a previous layer between a t time point and a t-n time point, determining a feature vector having a causal relationship with the input feature vector of the previous layer as a valid feature vector, outputting sequentially target-audio encoding features, and inputting the target identifier into a one-hot encoding layer for binary vector encoding to obtain a target-identifier encoding feature; and outputting a blend shape value corresponding to the target audio data.

System and method for dual mode presentation of content in a target language to improve listening fluency in the target language
11551568 · 2023-01-10 · ·

Embodiments of a language learning system and method for implementing or assisting in self-study for improving listening fluency in a target language are disclosed. Such embodiments may simultaneously present the same piece of content in an auditory presentation and a corresponding visual presentation of a transcript of the auditory presentation, where the two presentations are adapted to work in tandem to increase the effectiveness of language learning for users.

System and method for dual mode presentation of content in a target language to improve listening fluency in the target language
11551568 · 2023-01-10 · ·

Embodiments of a language learning system and method for implementing or assisting in self-study for improving listening fluency in a target language are disclosed. Such embodiments may simultaneously present the same piece of content in an auditory presentation and a corresponding visual presentation of a transcript of the auditory presentation, where the two presentations are adapted to work in tandem to increase the effectiveness of language learning for users.

Systems and methods for animation generation

Systems and methods for animating from audio in accordance with embodiments of the invention are illustrated. One embodiment includes a method for generating animation from audio. The method includes steps for receiving input audio data, generating an embedding for the input audio data, and generating several predictions for several tasks from the generated embedding. The several predictions includes at least one of blendshape weights, event detection, and/or voice activity detection. The method includes steps for generating a final prediction from the several predictions, where the final prediction includes a set of blendshape weights, and generating an output based on the generated final prediction.

SPEECH IMAGE PROVIDING METHOD AND COMPUTING DEVICE FOR PERFORMING THE SAME
20230005202 · 2023-01-05 ·

A computing device according to an embodiment includes one or more processors, a memory storing one or more programs executed by the one or more processors, a standby state image generating module configured to generate a standby state image in which a person is in a standby state, and generate a back-motion image set including a plurality of back-motion images at a preset frame interval from the standby state image for image interpolation between a preset reference frame of the standby state image, a speech state image generating module configured to generate a speech state image in which a person is in a speech state based on a source of speech content, and an image playback module configured to generate a synthetic speech image by combining the standby state image and the speech state image while playing the standby state image.

DIARISATION AUGMENTED REALITY AIDE

An image of a real-world environment including one or more users, is received from an image capture device. A mask status of a first user of is determined by a processor based on the image. A stream of audio including speech from one or more users is captured from one or more audio transceivers. A first user speech from the stream of audio identified by the processor. The stream of audio is parsed, by the processor and based on the first user speech and based on an audio processing technique, to create a first user speech element. An augmented view that includes the first user speech element is generated, for a wearable computing device, based on the first user speech and based on the mask status.