G10L15/02

Method for training speech recognition model, method and system for speech recognition

Disclosed are a method for training speech recognition model, a method and a system for speech recognition. The disclosure relates to field of speech recognition and includes: inputting an audio training sample into the acoustic encoder to represent acoustic features of the audio training sample in an encoded way and determine an acoustic encoded state vector; inputting a preset vocabulary into the language predictor to determine text prediction vector; inputting the text prediction vector into the text mapping layer to obtain a text output probability distribution; calculating a first loss function according to a target text sequence corresponding to the audio training sample and the text output probability distribution; inputting the text prediction vector and the acoustic encoded state vector into the joint network to calculate a second loss function, and performing iterative optimization according to the first loss function and the second loss function.

Method for training speech recognition model, method and system for speech recognition

Disclosed are a method for training speech recognition model, a method and a system for speech recognition. The disclosure relates to field of speech recognition and includes: inputting an audio training sample into the acoustic encoder to represent acoustic features of the audio training sample in an encoded way and determine an acoustic encoded state vector; inputting a preset vocabulary into the language predictor to determine text prediction vector; inputting the text prediction vector into the text mapping layer to obtain a text output probability distribution; calculating a first loss function according to a target text sequence corresponding to the audio training sample and the text output probability distribution; inputting the text prediction vector and the acoustic encoded state vector into the joint network to calculate a second loss function, and performing iterative optimization according to the first loss function and the second loss function.

Automatic detection and analytics using sensors

A method and device for automatic meeting detection and analysis. A mobile electronic device includes multiple sensors configured to selectively capture sensor data. A classifier is configured to analyze the sensor data to detect a meeting zone for a meeting with multiple participants. A processor device is configured to control the multiple sensors and the classifier to trigger sensor data capture.

Automatic detection and analytics using sensors

A method and device for automatic meeting detection and analysis. A mobile electronic device includes multiple sensors configured to selectively capture sensor data. A classifier is configured to analyze the sensor data to detect a meeting zone for a meeting with multiple participants. A processor device is configured to control the multiple sensors and the classifier to trigger sensor data capture.

Speech feature extraction apparatus, speech feature extraction method, and computer-readable storage medium

A speech feature extraction apparatus 100 includes a voice activity detection unit 103 that drops non-voice frames from frames corresponding to an input speech utterance, and calculates a posterior of being voiced for each frame, a voice activity detection process unit 106 calculates a function value as weights in pooling frames to produce an utterance-level feature, from a given a voice activity detection posterior, and an utterance-level feature extraction unit 112 that extracts an utterance-level feature, from the frame on a basis of multiple frame-level features, using the function values.

Speech feature extraction apparatus, speech feature extraction method, and computer-readable storage medium

A speech feature extraction apparatus 100 includes a voice activity detection unit 103 that drops non-voice frames from frames corresponding to an input speech utterance, and calculates a posterior of being voiced for each frame, a voice activity detection process unit 106 calculates a function value as weights in pooling frames to produce an utterance-level feature, from a given a voice activity detection posterior, and an utterance-level feature extraction unit 112 that extracts an utterance-level feature, from the frame on a basis of multiple frame-level features, using the function values.

Systems and methods for response selection in multi-party conversations with dynamic topic tracking

Embodiments described herein provide a dynamic topic tracking mechanism that tracks how the conversation topics change from one utterance to another and use the tracking information to rank candidate responses. A pre-trained language model may be used for response selection in the multi-party conversations, which consists of two steps: (1) a topic-based pre-training to embed topic information into the language model with self-supervised learning, and (2) a multi-task learning on the pretrained model by jointly training response selection and dynamic topic prediction and disentanglement tasks.

Systems and methods for response selection in multi-party conversations with dynamic topic tracking

Embodiments described herein provide a dynamic topic tracking mechanism that tracks how the conversation topics change from one utterance to another and use the tracking information to rank candidate responses. A pre-trained language model may be used for response selection in the multi-party conversations, which consists of two steps: (1) a topic-based pre-training to embed topic information into the language model with self-supervised learning, and (2) a multi-task learning on the pretrained model by jointly training response selection and dynamic topic prediction and disentanglement tasks.

System-on-a-chip incorporating artificial neural network and general-purpose processor circuitry

A circuit system and a method of analyzing audio or video input data that is capable of detecting, classifying, and post-processing patterns in an input data stream. The circuit system may consist of one or more digital processors, one or more configurable spiking neural network circuits, and digital logic for the selection of two-dimensional input data. The system may use the neural network circuits for detecting and classifying patterns and one or more the digital processors to perform further detailed analyses on the input data and for signaling the result of an analysis to outputs of the system.

System-on-a-chip incorporating artificial neural network and general-purpose processor circuitry

A circuit system and a method of analyzing audio or video input data that is capable of detecting, classifying, and post-processing patterns in an input data stream. The circuit system may consist of one or more digital processors, one or more configurable spiking neural network circuits, and digital logic for the selection of two-dimensional input data. The system may use the neural network circuits for detecting and classifying patterns and one or more the digital processors to perform further detailed analyses on the input data and for signaling the result of an analysis to outputs of the system.