G10L17/02

COMPUTERIZED MONITORING OF DIGITAL AUDIO SIGNALS
20220399945 · 2022-12-15 ·

A digital audio quality monitoring device uses a deep neural network (DNN) to provide accurate estimates of signal-to-noise ratio (SNR) from a limited set of features extracted from incoming audio. Some embodiments improve the SNR estimate accuracy by selecting a DNN model from a plurality of available models based on a codec used to compress/decompress the incoming audio. Each model has been trained on audio compressed/decompressed by a codec associated with the model, and the monitoring device selects the model associated with the codec used to compress/decompress the incoming audio. Other embodiments are also provided.

COMPUTERIZED MONITORING OF DIGITAL AUDIO SIGNALS
20220399945 · 2022-12-15 ·

A digital audio quality monitoring device uses a deep neural network (DNN) to provide accurate estimates of signal-to-noise ratio (SNR) from a limited set of features extracted from incoming audio. Some embodiments improve the SNR estimate accuracy by selecting a DNN model from a plurality of available models based on a codec used to compress/decompress the incoming audio. Each model has been trained on audio compressed/decompressed by a codec associated with the model, and the monitoring device selects the model associated with the codec used to compress/decompress the incoming audio. Other embodiments are also provided.

Processing Voice Commands
20220392435 · 2022-12-08 ·

Recorded background noises, and other contextual data, may be used to assist in resolving ambiguity in spoken voice commands. The background noises may comprise sounds from entities in a room other than the user issuing the voice commands. One such entity may be a content item being watched by the user, and the captured background noises may comprise audio of the content item. The content item may be identified based on the captured audio of the content item in the background noises, and the identification may be used to interpret the ambiguous voice command. Additional contextual information associated with the voice commands (e.g., identifications of the users in the room) and/or the content item (e.g., the video quality of the content item, a service outputting the content item, a genre of the content item, etc.) may be used to identify the content item.

Processing Voice Commands
20220392435 · 2022-12-08 ·

Recorded background noises, and other contextual data, may be used to assist in resolving ambiguity in spoken voice commands. The background noises may comprise sounds from entities in a room other than the user issuing the voice commands. One such entity may be a content item being watched by the user, and the captured background noises may comprise audio of the content item. The content item may be identified based on the captured audio of the content item in the background noises, and the identification may be used to interpret the ambiguous voice command. Additional contextual information associated with the voice commands (e.g., identifications of the users in the room) and/or the content item (e.g., the video quality of the content item, a service outputting the content item, a genre of the content item, etc.) may be used to identify the content item.

SYSTEM AND METHOD FOR COOPERATIVE PLAN-BASED UTTERANCE-GUIDED MULTIMODAL DIALOGUE
20220392454 · 2022-12-08 · ·

Methods and systems for multimodal conversational dialogue are disclosed. The multimodal conversational dialogue system includes multiple sensors to detect multimodal inputs from a user. The multimodal conversational dialogue system includes a multimodal sematic parser that performs semantic parsing and multimodal fusion of the multimodal inputs to determine a goal of the user. The multimodal conversational dialogue system includes a dialogue manager that generates a dialogue with the user in real-time. The dialogue includes system-generated utterances that are used to conduct a conversation between the user and the multimodal conversational dialogue system.

SYSTEM AND METHOD FOR COOPERATIVE PLAN-BASED UTTERANCE-GUIDED MULTIMODAL DIALOGUE
20220392454 · 2022-12-08 · ·

Methods and systems for multimodal conversational dialogue are disclosed. The multimodal conversational dialogue system includes multiple sensors to detect multimodal inputs from a user. The multimodal conversational dialogue system includes a multimodal sematic parser that performs semantic parsing and multimodal fusion of the multimodal inputs to determine a goal of the user. The multimodal conversational dialogue system includes a dialogue manager that generates a dialogue with the user in real-time. The dialogue includes system-generated utterances that are used to conduct a conversation between the user and the multimodal conversational dialogue system.

System and method for efficient processing of universal background models for speaker recognition
11521622 · 2022-12-06 · ·

A system and method for efficient universal background model (UBM) training for speaker recognition, including: receiving an audio input, divisible into a plurality of audio frames, wherein at least a first audio frame of the plurality of audio frames includes an audio sample having a length above a first threshold extracting at least one identifying feature from the first audio frame and generating a feature vector based on the at least one identifying feature; generating an optimized training sequence computation based on the feature vector and a Gaussian Mixture Model (GMM), wherein the GMM is associated with a plurality of components, wherein each of the plurality of components is defined by a covariance matrix, a mean vector, and a weight vector; and updating any of the associated components of the GMM based on the generated optimized training sequence computation.

Privacy-Aware Meeting Room Transcription from Audio-Visual Stream

A method for a privacy-aware transcription includes receiving audio-visual signal including audio data and image data for a speech environment and a privacy request from a participant in the speech environment where the privacy request indicates a privacy condition of the participant. The method further includes segmenting the audio data into a plurality of segments. For each segment, the method includes determining an identity of a speaker of a corresponding segment of the audio data based on the image data and determining whether the identity of the speaker of the corresponding segment includes the participant associated with the privacy condition. When the identity of the speaker of the corresponding segment includes the participant, the method includes applying the privacy condition to the corresponding segment. The method also includes processing the plurality of segments of the audio data to determine a transcript for the audio data.

Privacy-Aware Meeting Room Transcription from Audio-Visual Stream

A method for a privacy-aware transcription includes receiving audio-visual signal including audio data and image data for a speech environment and a privacy request from a participant in the speech environment where the privacy request indicates a privacy condition of the participant. The method further includes segmenting the audio data into a plurality of segments. For each segment, the method includes determining an identity of a speaker of a corresponding segment of the audio data based on the image data and determining whether the identity of the speaker of the corresponding segment includes the participant associated with the privacy condition. When the identity of the speaker of the corresponding segment includes the participant, the method includes applying the privacy condition to the corresponding segment. The method also includes processing the plurality of segments of the audio data to determine a transcript for the audio data.

SPEAKER IDENTIFICATION APPARATUS, SPEAKER IDENTIFICATION METHOD, AND RECORDING MEDIUM
20220383880 · 2022-12-01 ·

A speaker identification apparatus that identifies a speaker of utterance data indicating a voice of an utterance subjected to identification includes: an emotion estimator that estimates, from an acoustic feature value calculated from the utterance data, an emotion contained in the voice of the utterance indicated by the utterance data, using a trained deep neural network (DNN); and a speaker identification processor that outputs, based on the acoustic feature value calculated from the utterance data, a score for identifying the speaker of the utterance data, using an estimation result of the emotion estimator.