Patent classifications
G10L25/27
APPARATUS AND METHOD FOR SPEECH-EMOTION RECOGNITION WITH QUANTIFIED EMOTIONAL STATES
A method for training a speech-emotion recognition classifier under a continuous self-updating and re-trainable ASER machine learning model, wherein the training data is generated by: obtaining an utterance of a human speech source; processing the utterance in an emotion evaluation and rating process with normalization; extracting the features of the utterance; quantifying the feature attributes of the extracted features by labelling, tagging, and weighting the feature attributes, with their values assigned under measurable scales; and hashing the quantified feature attributes in a feature attribute hashing process to obtain hash values for creating a feature vector space. The run-time speech-emotion recognition comprising: extracting the features of an utterance; the trained recognition classifier recognizing the emotions and levels of intensity of the utterance units; and computing a quantified emotional state of the utterance by fusing recognized emotions and levels of intensity, and the quantified extracted feature attributes by their respective weightings.
APPARATUS AND METHOD FOR SPEECH-EMOTION RECOGNITION WITH QUANTIFIED EMOTIONAL STATES
A method for training a speech-emotion recognition classifier under a continuous self-updating and re-trainable ASER machine learning model, wherein the training data is generated by: obtaining an utterance of a human speech source; processing the utterance in an emotion evaluation and rating process with normalization; extracting the features of the utterance; quantifying the feature attributes of the extracted features by labelling, tagging, and weighting the feature attributes, with their values assigned under measurable scales; and hashing the quantified feature attributes in a feature attribute hashing process to obtain hash values for creating a feature vector space. The run-time speech-emotion recognition comprising: extracting the features of an utterance; the trained recognition classifier recognizing the emotions and levels of intensity of the utterance units; and computing a quantified emotional state of the utterance by fusing recognized emotions and levels of intensity, and the quantified extracted feature attributes by their respective weightings.
EXTRANEOUS VOICE REMOVAL FROM AUDIO IN A COMMUNICATION SESSION
The technology disclosed herein enables removal of extraneous voices from audio in a communication session. In a particular embodiment, a method includes receiving audio captured from an endpoint operated by a user on a communication session. The method further includes identifying an extraneous voice in the audio, wherein the voice is from a person other than the user, and removing the extraneous voice from the audio. After removing the extraneous voice, the method includes transmitting the audio to another endpoint on the communication session.
EXTRANEOUS VOICE REMOVAL FROM AUDIO IN A COMMUNICATION SESSION
The technology disclosed herein enables removal of extraneous voices from audio in a communication session. In a particular embodiment, a method includes receiving audio captured from an endpoint operated by a user on a communication session. The method further includes identifying an extraneous voice in the audio, wherein the voice is from a person other than the user, and removing the extraneous voice from the audio. After removing the extraneous voice, the method includes transmitting the audio to another endpoint on the communication session.
ACTIVITY RECOGNITION IN DARK VIDEO BASED ON BOTH AUDIO AND VIDEO CONTENT
Videos captured in low light conditions can be processed in order to identify an activity being performed in the video. The processing may use both the video and audio streams for identifying the activity in the low light video. The video portion is processed to generate a darkness-aware feature which may be used to modulate the features generated from the audio and video features. The audio features may be used to generate a video attention feature and the video features may be used to generate an audio attention feature. The audio and video attention features may also be used in modulating the audio video features. The modulated audio and video features may be used to predict an activity occurring in the video.
ACTIVITY RECOGNITION IN DARK VIDEO BASED ON BOTH AUDIO AND VIDEO CONTENT
Videos captured in low light conditions can be processed in order to identify an activity being performed in the video. The processing may use both the video and audio streams for identifying the activity in the low light video. The video portion is processed to generate a darkness-aware feature which may be used to modulate the features generated from the audio and video features. The audio features may be used to generate a video attention feature and the video features may be used to generate an audio attention feature. The audio and video attention features may also be used in modulating the audio video features. The modulated audio and video features may be used to predict an activity occurring in the video.
AUDIO SAMPLES TO DETECT DEVICE ANOMALIES
Example implementations relate to audio samples to detect device anomalies. For example, computing device, comprising: a processing resource and a non-transitory computer readable medium storing instructions executable by the processing resource to: generate a matrix of audio information for a plurality of audio samples of a device, select audio information from one of the plurality of audio samples, generate a plurality of principal components for the selected audio information utilizing a principal component expansion, select a principal component from the plurality of principal components based on a quantity of variance, and detect an anomaly of the device based on a comparison between a real time audio sample of the device and the selected principal component.
PATTERN RECOGNITION DEVICE, PATTERN RECOGNITION METHOD, AND COMPUTER PROGRAM PRODUCT
According to an embodiment, a pattern recognition device is configured to divide an input signal into a plurality of elements, convert the divided elements into feature vectors having the same dimensionality to generate a set of feature vectors, and evaluate the set of feature vectors using a recognition dictionary including models corresponding to respective classes, to output a recognition result representing a class or a set of classes to which the input signal belongs. The models each include sub-models each corresponding to one of possible division patterns in which a signal to be classified into a class corresponding to the model can be divided into a plurality of elements. A label expressing a model including a sub-model conforming to the set of feature vectors, or a set of labels expressing a set of models including sub-models conforming to the set of feature vectors is output as the recognition result.
PATTERN RECOGNITION DEVICE, PATTERN RECOGNITION METHOD, AND COMPUTER PROGRAM PRODUCT
According to an embodiment, a pattern recognition device is configured to divide an input signal into a plurality of elements, convert the divided elements into feature vectors having the same dimensionality to generate a set of feature vectors, and evaluate the set of feature vectors using a recognition dictionary including models corresponding to respective classes, to output a recognition result representing a class or a set of classes to which the input signal belongs. The models each include sub-models each corresponding to one of possible division patterns in which a signal to be classified into a class corresponding to the model can be divided into a plurality of elements. A label expressing a model including a sub-model conforming to the set of feature vectors, or a set of labels expressing a set of models including sub-models conforming to the set of feature vectors is output as the recognition result.
PATTERN RECOGNITION DEVICE, PATTERN RECOGNITION METHOD, AND COMPUTER PROGRAM PRODUCT
According to an embodiment, a pattern recognition device recognizes a pattern of an input signal by converting the input signal to a feature vector and matching the feature vector with a recognition dictionary. The recognition dictionary includes a dictionary subspace basis vector for expressing a dictionary subspace which is a subspace of a space of the feature vector, and a plurality of probability parameters for converting similarity calculated from the feature vector and the dictionary subspace into likelihood. The device includes a recognition unit configured to calculate the similarity using a quadratic polynomial of a value of an inner product of the feature vector and the dictionary subspace basis vector, and calculate the likelihood using the similarity and an exponential function of a linear sum of the probability parameters. The recognition dictionary is trained by using an expectation maximization method using a constraint condition between the probability parameters.