Patent classifications
G10L21/10
DIARISATION AUGMENTED REALITY AIDE
An image of a real-world environment including one or more users, is received from an image capture device. A mask status of a first user of is determined by a processor based on the image. A stream of audio including speech from one or more users is captured from one or more audio transceivers. A first user speech from the stream of audio identified by the processor. The stream of audio is parsed, by the processor and based on the first user speech and based on an audio processing technique, to create a first user speech element. An augmented view that includes the first user speech element is generated, for a wearable computing device, based on the first user speech and based on the mask status.
METHOD AND DEVICE FOR GENERATING SPEECH VIDEO USING AUDIO SIGNAL
A device according to an embodiment has one or more processors and a memory storing one or more programs executable by the one or more processors. The device includes a first encoder configured to receive a person background image corresponding to a video part of a speech video of a person and extract an image feature vector from the person background image, a second encoder configured to receive a speech audio signal corresponding to an audio part of the speech video and extract a voice feature vector from the speech audio signal, a combiner configured to generate a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder, and a decoder configured to reconstruct the speech video of the person using the combined vector as an input.
METHOD AND DEVICE FOR GENERATING SPEECH MOVING IMAGE
A device for generating a speech moving image according to an embodiment includes a first encoder that receives a person background image in which a portion related to speech of a person that is a video part of the speech moving image of the person is covered with a mask, extracts an image feature vector from the person background image, and compresses the extracted image feature vector, a second encoder that receives a speech audio signal that is an audio part of the speech moving image, extracts a voice feature vector from the speech audio signal, and compresses the extracted voice feature vector, a combination unit that generates a combination vector of the compressed image feature vector and the compressed voice feature vector, and an image reconstruction unit that reconstructs the speech moving image of the person with the combination as an input.
System and Method For Identifying Sentiment (Emotions) In A Speech Audio Input
In a system and method for enabling a user to identify the emotions of speakers during a telephone or online conversation, spoken audio input is pre-processed using a one-dimensional Mel Spectrogram and/or a two-dimensional Mel-Frequency Cepstral Coefficient (MFCC) matrix, reducing the two-dimensional matrix to a single dimension output, and identifying at least one emotion in the audio input using a convolutional or recurrent neural network.
System and Method For Identifying Sentiment (Emotions) In A Speech Audio Input
In a system and method for enabling a user to identify the emotions of speakers during a telephone or online conversation, spoken audio input is pre-processed using a one-dimensional Mel Spectrogram and/or a two-dimensional Mel-Frequency Cepstral Coefficient (MFCC) matrix, reducing the two-dimensional matrix to a single dimension output, and identifying at least one emotion in the audio input using a convolutional or recurrent neural network.
METHOD AND A SERVER FOR GENERATING A WAVEFORM
There is provided servers and methods of generating a waveform based on a spectrogram and a noise input. The method includes acquiring a trained flow-based vocoder including invertible blocks, and an untrained feed-forward vocoder including non-invertible blocks, which form a student-teacher network. The method includes executing a training process in the student-teacher network during which the server generates (i) a teacher waveform by the trained flow-based vocoder using a first spectrogram and a first noise input, (ii) a student waveform by the untrained feed-forward vocoder using the first spectrogram and the first noise input, and (iii) a loss value for the given training iteration using the teacher waveform and the student waveform. The server then trains the untrained feed-forward vocoder to generate the waveform. The trained feed-forward vocoder in then used lieu of the trained flow-based vocoder for generating waveforms based on spectrograms and noise inputs.
METHOD AND A SERVER FOR GENERATING A WAVEFORM
There is provided servers and methods of generating a waveform based on a spectrogram and a noise input. The method includes acquiring a trained flow-based vocoder including invertible blocks, and an untrained feed-forward vocoder including non-invertible blocks, which form a student-teacher network. The method includes executing a training process in the student-teacher network during which the server generates (i) a teacher waveform by the trained flow-based vocoder using a first spectrogram and a first noise input, (ii) a student waveform by the untrained feed-forward vocoder using the first spectrogram and the first noise input, and (iii) a loss value for the given training iteration using the teacher waveform and the student waveform. The server then trains the untrained feed-forward vocoder to generate the waveform. The trained feed-forward vocoder in then used lieu of the trained flow-based vocoder for generating waveforms based on spectrograms and noise inputs.
Speech sentiment analysis using a speech sentiment classifier pretrained with pseudo sentiment labels
The present disclosure describes a system, method, and computer program for predicting sentiment labels for audio speech utterances using an audio speech sentiment classifier pretrained with pseudo sentiment labels. A speech sentiment classifier for audio speech (“a speech sentiment classifier”) is pretrained in an unsupervised manner by leveraging a pseudo labeler previously trained to predict sentiments for text. Specifically, a text-trained pseudo labeler is used to autogenerate pseudo sentiment labels for the audio speech utterances using transcriptions of the utterances, and the speech sentiment classifier is trained to predict the pseudo sentiment labels given corresponding embeddings of the audio speech utterances. The speech sentiment classifier is then subsequently fine tuned using a sentiment-annotated dataset of audio speech utterances, which may be significantly smaller than the unannotated dataset used in the unsupervised pretraining phase.
Speech sentiment analysis using a speech sentiment classifier pretrained with pseudo sentiment labels
The present disclosure describes a system, method, and computer program for predicting sentiment labels for audio speech utterances using an audio speech sentiment classifier pretrained with pseudo sentiment labels. A speech sentiment classifier for audio speech (“a speech sentiment classifier”) is pretrained in an unsupervised manner by leveraging a pseudo labeler previously trained to predict sentiments for text. Specifically, a text-trained pseudo labeler is used to autogenerate pseudo sentiment labels for the audio speech utterances using transcriptions of the utterances, and the speech sentiment classifier is trained to predict the pseudo sentiment labels given corresponding embeddings of the audio speech utterances. The speech sentiment classifier is then subsequently fine tuned using a sentiment-annotated dataset of audio speech utterances, which may be significantly smaller than the unannotated dataset used in the unsupervised pretraining phase.
VIRTUAL OBJECT LIP DRIVING METHOD, MODEL TRAINING METHOD, RELEVANT DEVICES AND ELECTRONIC DEVICE
A virtual object lip driving method performed by an electronic device includes: obtaining a speech segment and target face image data about a virtual object; and inputting the speech segment and the target face image data into a first target model to perform a first lip driving operation, so as to obtain first lip image data about the virtual object driven by the speech segment. The first target model is trained in accordance with a first model and a second model, the first model is a lip-speech synchronization discriminative model with respect to lip image data, and the second model is a lip-speech synchronization discriminative model with respect to a lip region in the lip image data.