G10L21/10

METHOD AND APPARATUS FOR PROCESSING SPEECH, ELECTRONIC DEVICE AND STORAGE MEDIUM

A method for processing a speech includes: acquiring an original speech; extracting a spectrogram from the original speech; acquiring a speech synthesis model, where the speech synthesis model comprises a first generation sub-model and a second generation sub-model; generating a harmonic structure of the spectrogram, by invoking the first generation sub-model to process the spectrogram; and generating a target speech, by invoking the second generation sub-model to process the harmonic structure and the spectrogram.

FINGERPRINTING DATA TO DETECT VARIANCES
20230017644 · 2023-01-19 ·

A system and method for characterizing the data used to train a model for machine learning inference. Training data and production data may both be fingerprinted, and the fingerprints may be compared to detect undesirable variances between training and production data. This may allow performance issues relating to differences in the training data set versus the production data set to be more easily identified. Parameters used for characterization can be determined based on the type of training data such as numerical data, image data, or audio data.

FINGERPRINTING DATA TO DETECT VARIANCES
20230017644 · 2023-01-19 ·

A system and method for characterizing the data used to train a model for machine learning inference. Training data and production data may both be fingerprinted, and the fingerprints may be compared to detect undesirable variances between training and production data. This may allow performance issues relating to differences in the training data set versus the production data set to be more easily identified. Parameters used for characterization can be determined based on the type of training data such as numerical data, image data, or audio data.

ELECTRONIC DEVICE FOR GENERATING MOUTH SHAPE AND METHOD FOR OPERATING THEREOF

An electronic device includes at least one processor, and at least one memory storing instructions executable by the at least one processor and operatively connected to the at least one processor, where the at least one processor is configured to acquire voice data to be synthesized with at least one first image, generate a plurality of mouth shape candidates by using the voice data, select a mouth shape candidate among the plurality of mouth shape candidates, generate at least one second image based on the selected mouth shape candidate and at least a portion of each of the at least one first image, and generate at least one third image by applying at least one super-resolution model to the at least one second image.

PROCESSING SPEECH SIGNALS OF A USER TO GENERATE A VISUAL REPRESENTATION OF THE USER
20230223022 · 2023-07-13 ·

A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.

Automated transcript generation from multi-channel audio

Systems and methods are described for generating a transcript of a legal proceeding or other multi-speaker conversation or performance in real time or near-real time using multi-channel audio capture. Different speakers or participants in a conversation may each be assigned a separate microphone that is placed in proximity to the given speaker, where each audio channel includes audio captured by a different microphone. Filters may be applied to isolate each channel to include speech utterances of a different speaker, and these filtered channels of audio data may then be processed in parallel to generate speech-to-text results that are interleaved to form a generated transcript.

Automated transcript generation from multi-channel audio

Systems and methods are described for generating a transcript of a legal proceeding or other multi-speaker conversation or performance in real time or near-real time using multi-channel audio capture. Different speakers or participants in a conversation may each be assigned a separate microphone that is placed in proximity to the given speaker, where each audio channel includes audio captured by a different microphone. Filters may be applied to isolate each channel to include speech utterances of a different speaker, and these filtered channels of audio data may then be processed in parallel to generate speech-to-text results that are interleaved to form a generated transcript.

Viseme data generation for presentation while content is output

Systems and methods for viseme data generation are disclosed. Uncompressed audio data is generated and/or utilized to determine the beats per minute of the audio data. Visemes are associated with the audio data utilizing a Viterbi algorithm and the beats per minute. A time-stamped list of viseme data is generated that associates the visemes with the portions of the audio data that they correspond to. An animatronic toy and/or an animation is caused to lip sync using the viseme data while audio corresponding to the audio data is output.

Viseme data generation for presentation while content is output

Systems and methods for viseme data generation are disclosed. Uncompressed audio data is generated and/or utilized to determine the beats per minute of the audio data. Visemes are associated with the audio data utilizing a Viterbi algorithm and the beats per minute. A time-stamped list of viseme data is generated that associates the visemes with the portions of the audio data that they correspond to. An animatronic toy and/or an animation is caused to lip sync using the viseme data while audio corresponding to the audio data is output.

Telephone system for the hearing impaired

A telephone system is described herein, wherein the telephone system is configured to assist a hearing-impaired person with telephone communications as well as face-to-face conversations. In telephone communication sessions, the telephone system is configured to audibly emit spoken utterances while simultaneously depicting a transcription of the spoken utterances on a display. When the telephone system is not employed in a telephone communication session, the telephone system is configured to display transcriptions of spoken utterances of people who are in proximity to the telephone system.