G10L15/14

LARGE SCALE PRIVACY-PRESERVING SPEECH RECOGNITION SYSTEM USING FEDERATED LEARNING
20220383857 · 2022-12-01 ·

A method for implementing a privacy-preserving automatic speech recognition system using federated learning. The method includes receiving, from respective client devices, at a cloud server, local acoustic model weights for a neural network-based acoustic model of a local automatic speech recognition system running on the respective client devices, wherein the local acoustic model weights are generated at the respective client devices without labelled data, updating a global automatic speech recognition system based on (a) the local acoustic model weights received from the respective client devices and (b) global acoustic model weights of the global automatic speech recognition system derived from labelled data to obtain an updated global automatic speech recognition system, and sending the updated global automatic speech recognition system to the respective client devices to operate as a new local automatic speech recognition system.

DYNAMIC SPEECH RECOGNITION METHODS AND SYSTEMS WITH USER-CONFIGURABLE PERFORMANCE

Methods and systems are provided for assisting operation of a vehicle using speech recognition. One method involves identifying a user-configured speech recognition performance setting value selected from among a plurality of speech recognition performance setting values, selecting a speech recognition model configuration corresponding to the user-configured speech recognition performance setting value from among a plurality of speech recognition model configurations, where each speech recognition model configuration of the plurality of speech recognition model configurations corresponds to a respective one of the plurality of speech recognition performance setting values, and recognizing an audio input as an input state using the speech recognition model configuration corresponding to the user-configured speech recognition performance setting value.

MITIGATING VOICE FREQUENCY LOSS

Computer-implemented methods, computer program products, and computer systems for mitigating frequency loss may include one or more processors configured for receiving first audio data corresponding to unobstructed user utterances, receiving second audio data corresponding to first obstructed user utterances, generating a frequency loss (FL) model representing frequency loss between the first audio data and the second audio data, receiving third audio data corresponding to one or more second obstructed user utterances, processing the third audio data using the FL model to generate fourth audio data corresponding to a frequency loss mitigated version of the second obstructed user utterances, and transmitting the fourth audio data to a recipient computing device. The first obstructed user utterances are obstructed by a facemask and the one or more second obstructed user utterances is obstructed by the facemask. The FL model may be executed as an audio plugin in a web conferencing program.

MITIGATING VOICE FREQUENCY LOSS

Computer-implemented methods, computer program products, and computer systems for mitigating frequency loss may include one or more processors configured for receiving first audio data corresponding to unobstructed user utterances, receiving second audio data corresponding to first obstructed user utterances, generating a frequency loss (FL) model representing frequency loss between the first audio data and the second audio data, receiving third audio data corresponding to one or more second obstructed user utterances, processing the third audio data using the FL model to generate fourth audio data corresponding to a frequency loss mitigated version of the second obstructed user utterances, and transmitting the fourth audio data to a recipient computing device. The first obstructed user utterances are obstructed by a facemask and the one or more second obstructed user utterances is obstructed by the facemask. The FL model may be executed as an audio plugin in a web conferencing program.

APPARATUS AND METHOD FOR COMPOSITIONAL SPOKEN LANGUAGE UNDERSTANDING
20220375457 · 2022-11-24 ·

A method includes identifying multiple tokens contained in an input utterance. The method also includes generating slot labels for at least some of the tokens contained in the input utterance using a trained machine learning model. The method further includes determining at least one action to be performed in response to the input utterance based on at least one of the slot labels. The trained machine learning model is trained to use attention distributions generated such that (i) the attention distributions associated with tokens having dissimilar slot labels are forced to be different and (ii) the attention distribution associated with each token is forced to not focus primarily on that token itself.

APPARATUS AND METHOD FOR COMPOSITIONAL SPOKEN LANGUAGE UNDERSTANDING
20220375457 · 2022-11-24 ·

A method includes identifying multiple tokens contained in an input utterance. The method also includes generating slot labels for at least some of the tokens contained in the input utterance using a trained machine learning model. The method further includes determining at least one action to be performed in response to the input utterance based on at least one of the slot labels. The trained machine learning model is trained to use attention distributions generated such that (i) the attention distributions associated with tokens having dissimilar slot labels are forced to be different and (ii) the attention distribution associated with each token is forced to not focus primarily on that token itself.

Voice Control of a Media Playback System

Multiple aspects of systems and methods for voice control and related features and functionality for various embodiments of media playback devices, networked microphone devices, microphone-equipped media playback devices, and speaker-equipped networked microphone devices are disclosed and described herein, including but not limited to designating and managing default networked devices, audio response playback, room-corrected voice detection, content mixing, music service selection, metadata exchange between networked playback systems and networked microphone systems, handling loss of pairing between networked devices, actions based on user identification, and other voice control of networked devices.

Voice Control of a Media Playback System

Multiple aspects of systems and methods for voice control and related features and functionality for various embodiments of media playback devices, networked microphone devices, microphone-equipped media playback devices, and speaker-equipped networked microphone devices are disclosed and described herein, including but not limited to designating and managing default networked devices, audio response playback, room-corrected voice detection, content mixing, music service selection, metadata exchange between networked playback systems and networked microphone systems, handling loss of pairing between networked devices, actions based on user identification, and other voice control of networked devices.

AUTOMATIC CLASSIFICATION OF PHONE CALLS USING REPRESENTATION LEARNING BASED ON THE HIERARCHICAL PITMAN-YOR PROCESS
20230055948 · 2023-02-23 ·

Embodiments of the disclosed technology include a representation learning model for classification of natural language text. In embodiments, a classification model comprises a feature model and a classifier. The feature model may be hierarchical in nature: data may pass through a series of representations, decreasing in specificity and increasing in generality. Intermediate levels of representation may then be used as automatically learned features to train a statistical classifier. Specifically, the feature model may be based on a hierarchical Pitman-Yor process. In embodiments, once the feature model has been expressed as a Bayesian Belief Network and some aspect of the feature model has been selected for prediction, the feature model may be attached to the classifier. In embodiments, after training, potentially using a mix of labeled and unlabeled data, the classification model can be used to classify documents such as call transcripts based on topics of conversation represented in the transcripts.

AUTOMATIC CLASSIFICATION OF PHONE CALLS USING REPRESENTATION LEARNING BASED ON THE HIERARCHICAL PITMAN-YOR PROCESS
20230055948 · 2023-02-23 ·

Embodiments of the disclosed technology include a representation learning model for classification of natural language text. In embodiments, a classification model comprises a feature model and a classifier. The feature model may be hierarchical in nature: data may pass through a series of representations, decreasing in specificity and increasing in generality. Intermediate levels of representation may then be used as automatically learned features to train a statistical classifier. Specifically, the feature model may be based on a hierarchical Pitman-Yor process. In embodiments, once the feature model has been expressed as a Bayesian Belief Network and some aspect of the feature model has been selected for prediction, the feature model may be attached to the classifier. In embodiments, after training, potentially using a mix of labeled and unlabeled data, the classification model can be used to classify documents such as call transcripts based on topics of conversation represented in the transcripts.