Patent classifications
G10L25/87
Enhancing signature word detection in voice assistants
Systems and methods detecting a spoken sentence in a speech recognition system are disclosed herein. Speech data is buffered based on an audio signal captured at a computing device operating in an active mode. The speech data is buffered irrespective of whether the speech data comprises a signature word. The buffered speech data is processed to detect a presence of the sentence comprising at least one command and a query for the computing device. Processing the buffered speech data includes detecting the signature word in the buffered speech data, and in response to detecting the signature word in the speech data, initiating detection of the sentence in the buffered speech data.
AUDIO ONSET DETECTION METHOD AND APPARATUS
An audio onset detection method and apparatus, an electronic device, and a computer readable storage medium. The audio onset detection method comprises: determining a first voice frequency spectrum parameter corresponding to each frequency band according to a frequency domain signal corresponding to an audio signal of an audio; for each frequency band, determining a second voice frequency spectrum parameter of a current frequency band according to the first voice frequency spectrum parameter of the current frequency band and the first voice frequency spectrum parameters of frequency bands positioned before the current frequency band according to a time sequence; and determining one or more onset positions of notes and syllables in the audio according to the second voice frequency spectrum parameters corresponding to the frequency bands.
AUDIO ONSET DETECTION METHOD AND APPARATUS
An audio onset detection method and apparatus, an electronic device, and a computer readable storage medium. The audio onset detection method comprises: determining a first voice frequency spectrum parameter corresponding to each frequency band according to a frequency domain signal corresponding to an audio signal of an audio; for each frequency band, determining a second voice frequency spectrum parameter of a current frequency band according to the first voice frequency spectrum parameter of the current frequency band and the first voice frequency spectrum parameters of frequency bands positioned before the current frequency band according to a time sequence; and determining one or more onset positions of notes and syllables in the audio according to the second voice frequency spectrum parameters corresponding to the frequency bands.
VOICE ACTIVITY DETECTION METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM
The present disclosure discloses a voice activity detection method and apparatus, an electronic device and a storage medium, and relates to the field of artificial intelligence, such as deep learning, intelligent voices, or the like. The method may include: acquiring time-aligned voice data and video data; performing a first detection of a voice start point and a voice end point of the voice data using a voice detection model obtained by a training operation; performing a second detection of a lip movement start point and a lip movement end point of the video data; and correcting a result of the first detection using a result of the second detection, and taking a corrected result as a voice activity detection result. The solution of the present disclosure may improve accuracy of the voice activity detection result, or the like.
METHOD FOR FACILITATING SPEECH ACTIVITY DETECTION FOR STREAMING SPEECH RECOGNITION
The present disclosure relates to a system and method for automatic recording of speech. The system is configured for end of sentence detection which may also perform as a punctuation predictor. The system uses interrelated Natural Language Processing (NLP) and Automatic Speech Recognition (ASR) with a switching mechanism. The switching mechanism decides when the ASR should start or stop recording for processing. The decision is made by using a temporal neural network which tells the switching mechanism whether a meaningful sentence is formed or not. The temporal neural network is a sequence to classification network which is trained on a huge dataset for news articles.
FEDERATED LEARNING WITH SOUND TO DETECT ANOMALIES IN THE INDUSTRIAL EQUIPMENT
In some example embodiments, there may be provided a method that includes receiving a machine learning model provided by a central server configured to provide federated learning; receiving first audio data obtained from at least one audio sensor monitoring at least one machine located at the first edge node; training, based on the first audio data, the machine learning model; providing parameter information to the central server in order to enable the federated learning among a plurality of edge nodes; receiving an aggregate machine learning model provided by the central server; detecting an anomalous state of the at least one machine. Related systems, methods, and articles of manufacture are also described.
FEDERATED LEARNING WITH SOUND TO DETECT ANOMALIES IN THE INDUSTRIAL EQUIPMENT
In some example embodiments, there may be provided a method that includes receiving a machine learning model provided by a central server configured to provide federated learning; receiving first audio data obtained from at least one audio sensor monitoring at least one machine located at the first edge node; training, based on the first audio data, the machine learning model; providing parameter information to the central server in order to enable the federated learning among a plurality of edge nodes; receiving an aggregate machine learning model provided by the central server; detecting an anomalous state of the at least one machine. Related systems, methods, and articles of manufacture are also described.
METHOD FOR PROCESSING AN AUDIO SIGNAL, METHOD FOR CONTROLLING AN APPARATUS AND ASSOCIATED SYSTEM
In a method for processing an audio signal, the audio signal is continuously analyzed substantially in real time from a recognized beginning of the speech input to provide a speech analysis result. The speech analysis result is used to dynamically define an end of the speech input. A speech data stream is provided based on the audio signal between the beginning and the end. The speech data stream may be further analyzed to identify one or more speech commands.
SOUND SOURCE LOCALIZATION MODEL TRAINING AND SOUND SOURCE LOCALIZATION METHOD, AND APPARATUS
The present disclosure provides a method for training sound source localization model and a sound source localization method, and relates to the field of artificial intelligence technologies such as voice processing and deep learning. The method for training sound source localization model method includes: obtaining a sample audio according to an audio signal including a wake-up word; extracting an audio feature of at least one audio frame in the sample audio, and marking a direction label and a mask label of the at least one audio frame; and training a neural network model by using the audio feature of the at least one audio frame and the direction label and the mask label of the at least one audio frame, to obtain a sound source localization model. The sound source localization method includes: acquiring a to-be-processed audio signal, and extracting an audio feature of each audio frame in the to-be-processed audio signal; inputting the audio feature of each audio frame into a sound source localization model, to obtain sound source direction information outputted by the sound source localization model for each audio frame; determining a wake-up word endpoint frame in the to-be-processed audio signal; and obtaining a sound source direction of the to-be-processed audio signal according to sound source direction information corresponding to the wake-up word endpoint frame.
SOUND SOURCE LOCALIZATION MODEL TRAINING AND SOUND SOURCE LOCALIZATION METHOD, AND APPARATUS
The present disclosure provides a method for training sound source localization model and a sound source localization method, and relates to the field of artificial intelligence technologies such as voice processing and deep learning. The method for training sound source localization model method includes: obtaining a sample audio according to an audio signal including a wake-up word; extracting an audio feature of at least one audio frame in the sample audio, and marking a direction label and a mask label of the at least one audio frame; and training a neural network model by using the audio feature of the at least one audio frame and the direction label and the mask label of the at least one audio frame, to obtain a sound source localization model. The sound source localization method includes: acquiring a to-be-processed audio signal, and extracting an audio feature of each audio frame in the to-be-processed audio signal; inputting the audio feature of each audio frame into a sound source localization model, to obtain sound source direction information outputted by the sound source localization model for each audio frame; determining a wake-up word endpoint frame in the to-be-processed audio signal; and obtaining a sound source direction of the to-be-processed audio signal according to sound source direction information corresponding to the wake-up word endpoint frame.