Patent classifications
G10L15/02
AUDIO MATCHING METHOD AND RELATED DEVICE
Embodiments of the present application disclose an audio matching method and a related device. The audio matching method includes: obtaining audio data and video data; extracting to-be-recognized audio information from the audio data; extracting lip movement information of N users from the video data, where N is an integer greater than 1; inputting the to-be-recognized audio information and the lip movement information of the N users into a target feature matching model, to obtain a matching degree between each of the lip movement information of the N users and the to-be-recognized audio information; and determining a user corresponding to the lip movement information of the user with the highest matching degree as the target user to which the to-be-recognized audio information belongs.
AUDIO MATCHING METHOD AND RELATED DEVICE
Embodiments of the present application disclose an audio matching method and a related device. The audio matching method includes: obtaining audio data and video data; extracting to-be-recognized audio information from the audio data; extracting lip movement information of N users from the video data, where N is an integer greater than 1; inputting the to-be-recognized audio information and the lip movement information of the N users into a target feature matching model, to obtain a matching degree between each of the lip movement information of the N users and the to-be-recognized audio information; and determining a user corresponding to the lip movement information of the user with the highest matching degree as the target user to which the to-be-recognized audio information belongs.
Method and apparatus for predicting customer satisfaction from a conversation
A method and an apparatus for predicting satisfaction of a customer pursuant to a call between the customer and an agent, in which the method comprises receiving a transcribed text of the call, dividing the transcribed text into a plurality of phases of a conversation, extracting at least one call feature for each of the plurality of phases, receiving call metadata, extracting metadata features from the call metadata, combining the call features and the metadata features, and generating an output, using a trained machine learning (ML) model, based on the combined features, indicating whether the customer is satisfied or not. The ML model is trained to generate an output indicating whether the customer is satisfied or not, based on an input of the combined features.
Method and apparatus for predicting customer satisfaction from a conversation
A method and an apparatus for predicting satisfaction of a customer pursuant to a call between the customer and an agent, in which the method comprises receiving a transcribed text of the call, dividing the transcribed text into a plurality of phases of a conversation, extracting at least one call feature for each of the plurality of phases, receiving call metadata, extracting metadata features from the call metadata, combining the call features and the metadata features, and generating an output, using a trained machine learning (ML) model, based on the combined features, indicating whether the customer is satisfied or not. The ML model is trained to generate an output indicating whether the customer is satisfied or not, based on an input of the combined features.
METHOD AND APPARATUS FOR TARGET EXAGGERATION FOR DEEP LEARNING-BASED SPEECH ENHANCEMENT
The present disclosure relates to a speech enhancement apparatus, and specifically, to a method and apparatus for a target exaggeration for deep learning-based speech enhancement. According to an embodiment of the present disclosure, the apparatus for a target exaggeration for deep learning-based speech enhancement can preserve a speech signal from a noisy speech signal and can perform speech enhancement for removing a noise signal.
METHOD AND APPARATUS FOR TARGET EXAGGERATION FOR DEEP LEARNING-BASED SPEECH ENHANCEMENT
The present disclosure relates to a speech enhancement apparatus, and specifically, to a method and apparatus for a target exaggeration for deep learning-based speech enhancement. According to an embodiment of the present disclosure, the apparatus for a target exaggeration for deep learning-based speech enhancement can preserve a speech signal from a noisy speech signal and can perform speech enhancement for removing a noise signal.
Deep multi-channel acoustic modeling using multiple microphone array geometries
Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-geometry/multi-channel DNN) that is trained using a plurality of microphone array geometries. Thus, the first model may receive a variable number of microphone channels, generate multiple outputs using multiple microphone array geometries, and select the best output as a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. The DNN front-end enables improved performance despite a reduction in microphones.
Deep multi-channel acoustic modeling using multiple microphone array geometries
Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-geometry/multi-channel DNN) that is trained using a plurality of microphone array geometries. Thus, the first model may receive a variable number of microphone channels, generate multiple outputs using multiple microphone array geometries, and select the best output as a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. The DNN front-end enables improved performance despite a reduction in microphones.
Automatic synthesis of translated speech using speaker-specific phonemes
An embodiment includes converting an original audio signal to an original text string, the original audio signal being from a recording of the original text string spoken by a specific person in a source language. The embodiment generates a translated text string by translating the original text string from the source language to a target language, including translation of a word from the source language to a target language. The embodiment assembles a standard phoneme sequence from a set of standard phonemes, where the standard phoneme sequence includes a standard pronunciation of the translated word. The embodiment also associates a custom phoneme with a standard phoneme of the standard phoneme sequence, where the custom phoneme includes the specific person's pronunciation of a sound in the translated word. The embodiment synthesizes the translated text string to a translated audio signal including the translated word pronounced using the custom phoneme.
Automatic synthesis of translated speech using speaker-specific phonemes
An embodiment includes converting an original audio signal to an original text string, the original audio signal being from a recording of the original text string spoken by a specific person in a source language. The embodiment generates a translated text string by translating the original text string from the source language to a target language, including translation of a word from the source language to a target language. The embodiment assembles a standard phoneme sequence from a set of standard phonemes, where the standard phoneme sequence includes a standard pronunciation of the translated word. The embodiment also associates a custom phoneme with a standard phoneme of the standard phoneme sequence, where the custom phoneme includes the specific person's pronunciation of a sound in the translated word. The embodiment synthesizes the translated text string to a translated audio signal including the translated word pronounced using the custom phoneme.