G10L15/16

METHODS AND APPARATUS TO OPERATE A MOBILE CAMERA FOR LOW-POWER USAGE
20230237791 · 2023-07-27 ·

Disclosed examples include accessing sensor data; recognizing, by executing an instruction with programmable circuitry, a feature in the sensor data based on a convolutional neural network; and transitioning, by executing an instruction with the programmable circuitry, a mobile device between at least two of motion feature detection, audio feature detection, or camera feature detection after the feature is recognized in the sensor data, the mobile device to operate at a different level of power consumption after the transition than before the transition.

MINIMUM WORD ERROR RATE TRAINING FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS

Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses a set of speech recognition hypothesis samples, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.

MINIMUM WORD ERROR RATE TRAINING FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS

Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses a set of speech recognition hypothesis samples, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.

Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models

Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.

Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models

Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.

INTELLIGENT SUNSHADING LOUVER CONTROL SYSTEM AND METHOD BASED ON GESTURE RECOGNITION
20230236669 · 2023-07-27 ·

Sunshading louver control system and method based on gesture recognition are provided. Indoor images are collected in real time. Based on the indoor images, an action region of gesture motion is positioned and feature extraction is performed thereon to obtain hand motion parameters. The hand motion parameters are analyzed to obtain gesture information. The gesture information is compared with preset gesture information. When the preset gesture information contains the gesture information, corresponding adjustment is performed on a sunshading louver according to a preset operation logic corresponding to the gesture information. When the preset gesture information does not contain the gesture information, the gesture information is determined to be invalid. The lifting and rotation angles of the louver can be automatically controlled according to various gestures of indoor personnel, thereby rapidly realizing regulation of indoor illumination and natural ventilation quantity, and meeting the change of personnel's demand for indoor environment.

INTELLIGENT SUNSHADING LOUVER CONTROL SYSTEM AND METHOD BASED ON GESTURE RECOGNITION
20230236669 · 2023-07-27 ·

Sunshading louver control system and method based on gesture recognition are provided. Indoor images are collected in real time. Based on the indoor images, an action region of gesture motion is positioned and feature extraction is performed thereon to obtain hand motion parameters. The hand motion parameters are analyzed to obtain gesture information. The gesture information is compared with preset gesture information. When the preset gesture information contains the gesture information, corresponding adjustment is performed on a sunshading louver according to a preset operation logic corresponding to the gesture information. When the preset gesture information does not contain the gesture information, the gesture information is determined to be invalid. The lifting and rotation angles of the louver can be automatically controlled according to various gestures of indoor personnel, thereby rapidly realizing regulation of indoor illumination and natural ventilation quantity, and meeting the change of personnel's demand for indoor environment.

EXTERNAL LANGUAGE MODEL INFORMATION INTEGRATED INTO NEURAL TRANSDUCER MODEL
20230237989 · 2023-07-27 ·

A computer-implemented method for training a neural transducer is provided including, by using audio data and transcription data of the audio data as input data, obtaining outputs from a trained language model and a seed neural transducer, respectively, combining the outputs to obtain a supervisory output, and updating parameters of another neural transducer in training so that its output is close to the supervisory output. The neural transducer can be a Recurrent Neural Network Transducer (RNN-T).

EXTERNAL LANGUAGE MODEL INFORMATION INTEGRATED INTO NEURAL TRANSDUCER MODEL
20230237989 · 2023-07-27 ·

A computer-implemented method for training a neural transducer is provided including, by using audio data and transcription data of the audio data as input data, obtaining outputs from a trained language model and a seed neural transducer, respectively, combining the outputs to obtain a supervisory output, and updating parameters of another neural transducer in training so that its output is close to the supervisory output. The neural transducer can be a Recurrent Neural Network Transducer (RNN-T).

DATA SORTING FOR GENERATING RNN-T MODELS
20230237987 · 2023-07-27 ·

A computer-implemented method for preparing training data for a speech recognition model is provided including obtaining a plurality of sentences from a corpus, dividing each phoneme in each sentence of the plurality of sentences into three hidden states, calculating, for each sentence of the plurality of sentences, a score based on a variation in duration of the three hidden states of each phoneme in the sentence, and sorting the plurality of sentences by using the calculated scores.