Patent classifications
G10L25/27
ELECTRONIC DEVICE AND METHOD FOR ANALYZING SPEECH RECOGNITION RESULTS
An electronic device and a method for analyzing a speech recognition result is provided. The electronic device includes a display module configured to provide information to an outside of the electronic device, a processor electrically connected to the display module, and a memory electrically connected to the processor. The processor is configured to generate feature information of a text corresponding to a user utterance based on the text, determine an output domain for processing the user utterance based on the feature information of the text, identify an expected domain predetermined by a user, extract, from the memory, feature information associated with the output domain and feature information associated with the expected domain, and display the feature information associated with the output domain and the feature information associated with the expected domain using the display module.
Hybrid input machine learning frameworks
There is a need for more accurate and more efficient hybrid-input prediction steps/operations. This need can be addressed by, for example, techniques for efficient joint processing of data objects. In one example, a method includes: processing an audio data object using an audio processing machine learning model to generate an audio-based feature data object, processing an acceleration data object using an acceleration processing machine learning model to generate an acceleration-based feature data object, processing the audio-based feature data object and the acceleration-based feature data object using an feature synthesis machine learning model in order to generate a hybrid-input prediction data object; and performing one or more prediction-based actions based at least in part on the hybrid-input prediction data object.
Hybrid input machine learning frameworks
There is a need for more accurate and more efficient hybrid-input prediction steps/operations. This need can be addressed by, for example, techniques for efficient joint processing of data objects. In one example, a method includes: processing an audio data object using an audio processing machine learning model to generate an audio-based feature data object, processing an acceleration data object using an acceleration processing machine learning model to generate an acceleration-based feature data object, processing the audio-based feature data object and the acceleration-based feature data object using an feature synthesis machine learning model in order to generate a hybrid-input prediction data object; and performing one or more prediction-based actions based at least in part on the hybrid-input prediction data object.
METHODS OF TRAINING ACOUSTIC SCENE CLASSIFICATION MODEL AND CLASSIFYING ACOUSTIC SCENE AND ELECTRONIC DEVICE FOR PERFORMING THE METHODS
Disclosed are methods of training an acoustic scene classification model and classifying an acoustic scene and an electronic device for performing the methods. The training method of an acoustic scene classification model includes inputting training data labeled as an acoustic scene to the acoustic scene classification model that is repeatedly trained by using the training data and outputting a first result predicting the acoustic scene, updating the weight of the auxiliary model configured to induce training of the acoustic scene classification model, based on a weight of the acoustic scene classification model and a weight of an auxiliary model in a previous epoch, inputting the training data to the auxiliary model and outputting a second result, calculating a cost function, based on the first result, the second result, and labeling of acoustic data, and updating the weight of the acoustic scene classification model, based on the cost function.
METHODS OF TRAINING ACOUSTIC SCENE CLASSIFICATION MODEL AND CLASSIFYING ACOUSTIC SCENE AND ELECTRONIC DEVICE FOR PERFORMING THE METHODS
Disclosed are methods of training an acoustic scene classification model and classifying an acoustic scene and an electronic device for performing the methods. The training method of an acoustic scene classification model includes inputting training data labeled as an acoustic scene to the acoustic scene classification model that is repeatedly trained by using the training data and outputting a first result predicting the acoustic scene, updating the weight of the auxiliary model configured to induce training of the acoustic scene classification model, based on a weight of the acoustic scene classification model and a weight of an auxiliary model in a previous epoch, inputting the training data to the auxiliary model and outputting a second result, calculating a cost function, based on the first result, the second result, and labeling of acoustic data, and updating the weight of the acoustic scene classification model, based on the cost function.
SYSTEM AND METHOD FOR CLUSTER-BASED AUDIO EVENT DETECTION
Methods, systems, and apparatuses for audio event detection, where the determination of a type of sound data is made at the cluster level rather than at the frame level. The techniques provided are thus more robust to the local behavior of features of an audio signal or audio recording. The audio event detection is performed by using Gaussian mixture models (GMMs) to classify each cluster or by extracting an i-vector from each cluster. Each cluster may be classified based on an i-vector classification using a support vector machine or probabilistic linear discriminant analysis. The audio event detection significantly reduces potential smoothing error and avoids any dependency on accurate window-size tuning. Segmentation may be performed using a generalized likelihood ratio and a Bayesian information criterion, and the segments may be clustered using hierarchical agglomerative clustering. Audio frames may be clustered using K-means and GMMs.
SYSTEM AND METHOD FOR CLUSTER-BASED AUDIO EVENT DETECTION
Methods, systems, and apparatuses for audio event detection, where the determination of a type of sound data is made at the cluster level rather than at the frame level. The techniques provided are thus more robust to the local behavior of features of an audio signal or audio recording. The audio event detection is performed by using Gaussian mixture models (GMMs) to classify each cluster or by extracting an i-vector from each cluster. Each cluster may be classified based on an i-vector classification using a support vector machine or probabilistic linear discriminant analysis. The audio event detection significantly reduces potential smoothing error and avoids any dependency on accurate window-size tuning. Segmentation may be performed using a generalized likelihood ratio and a Bayesian information criterion, and the segments may be clustered using hierarchical agglomerative clustering. Audio frames may be clustered using K-means and GMMs.
Analyzing changes in vocal power within music content using frequency spectrums
Technologies are described for identifying familiar or interesting parts of music content by analyzing changes in vocal power using frequency spectrums. For example, a frequency spectrum can be generated from digitized audio. Using the frequency spectrum, the harmonic content and percussive content can be separated. The vocal content can then be separated from the harmonic and/or percussive content. The vocal content can then be processed to identify surge points in the digitized audio. In some implementations, the vocal content is included in the harmonic content during the separation procedure and is then separated from the harmonic content.
Analyzing changes in vocal power within music content using frequency spectrums
Technologies are described for identifying familiar or interesting parts of music content by analyzing changes in vocal power using frequency spectrums. For example, a frequency spectrum can be generated from digitized audio. Using the frequency spectrum, the harmonic content and percussive content can be separated. The vocal content can then be separated from the harmonic and/or percussive content. The vocal content can then be processed to identify surge points in the digitized audio. In some implementations, the vocal content is included in the harmonic content during the separation procedure and is then separated from the harmonic content.
Enhanced accuracy of user presence status determination
Technologies are described herein for enhancing a user presence status determination. Visual data may be received from a depth camera configured to be arranged within a three-dimensional space. A current user presence status of a user in the three-dimensional space may be determined based on the visual data. A previous user presence status of the user may be transformed to the current user presence status, responsive to determining the current user presence status of the user.