System and Method For Identifying Sentiment (Emotions) In A Speech Audio Input
20220392485 · 2022-12-08
Inventors
Cpc classification
International classification
Abstract
In a system and method for enabling a user to identify the emotions of speakers during a telephone or online conversation, spoken audio input is pre-processed using a one-dimensional Mel Spectrogram and/or a two-dimensional Mel-Frequency Cepstral Coefficient (MFCC) matrix, reducing the two-dimensional matrix to a single dimension output, and identifying at least one emotion in the audio input using a convolutional or recurrent neural network.
Claims
1. A method of analyzing spoken audio input for emotional content, comprising capturing a time series of the spoken audio input, pre-processing the time series of the spoken audio input by transforming the time series using Mel Spectrograms, or a two-dimensional Mel-Frequency Cepstral Coefficient (MFCC) matrix, and feeding a single dimensional output into a neural network to identify at least one emotion in the audio input.
2. A method of claim 1, wherein, in the case of an MFCC, the two-dimensional matrix is reduced to a single dimension output by means of mean normalization or other dimensionality reduction.
3. A method of claim 1, wherein the audio input comprises an analog input or a digital input configured to define a pre-defined number of frequency values.
4. A method of claim 1, further comprising representing at least one emotion in visual form to a user.
5. A method of claim 4, wherein the visual form includes one or more of written description, and graphic representation, of the at least one emotion.
6. A method of claim 1, further comprising representing the at least one emotion in auditory form to a user.
7. A method of claim 1, wherein the neural network includes layers for performing one or more of the following steps: recognizing patterns, reducing dimensionality, reducing data overfitting, transforming data into useful numbers, stabilizing and standardizing the data, reducing the number of data channels, and linearly altering data sizes.
8. A neural network structure for analyzing emotions in speech, comprising a recurrent or convolutional neural network architecture that includes a one-dimensional convolutional layer or recurrent layer to recognize patterns in series, and a layer to linearly alter data sizes.
9. A neural network structure of claim 8, wherein the neural network structure comprises a convolutional neural network architecture, and further comprises one or more of: layers to reduce dimensionality, layers to reduce data overfitting, layers to transform data into useful numbers, layers to stabilize and standardize data, and layers to reduce the number of data channels.
10. A neural network structure of claim 8, wherein the neural network structure comprises a recurrent neural network architecture, and further comprises one or more of: layers to transform data into useful numbers, layers to reduce the number of data channels, and layers to recurrently recognize patterns in series by using feedback connections and cell states.
11. A system for analyzing spoken audio input from one or more participants, for emotional content, comprising a pre-processing stage configured to generate a one-dimensional Mel Spectrogram or a two-dimensional Mel-Frequency Cepstral Coefficient (MFCC) matrix and to reduce the two-dimensional matrix to a single dimension output, and a neural network to identify at least one emotion in the audio input.
12. A system of claim 11, further comprising a display screen for representing the at least one emotion in visual form to one or more participants.
13. A system of claim 12, wherein the visual form includes one or more of written description, and graphic representation, of the at least one emotion.
14. A system of claim 11, wherein the participants are speakers taking part in a telephone or online conversation.
15. A system of claim 11, further comprising an audio output for representing the at least one emotion in auditory form to a participant.
16. A system of claim 11, wherein the neural network comprises a recurrent neural network with a recurrent layer to recognize patterns in series, or a convolutional neural network with a one-dimensional convolutional layer to recognize patterns, the neural network further including a layer to linearly alter data sizes.
17. A system of claim 16, wherein the neural network structure comprises a convolutional neural network architecture, and further comprises one or more of: layers to reduce dimensionality, layers to reduce data overfitting, layers to transform data into useful numbers, layers to stabilize and standardize data, and layers to reduce the number of data channels.
18. A system of claim 16, wherein the neural network structure comprises a recurrent neural network architecture, and further comprises one or more of: layers to transform data into useful numbers, layers to reduce the number of data channels, and layers to recurrently recognize patterns in series by using feedback connections and cell states.
19. A system of claim 11, wherein the system is part of an online conference call network and the participants are connected to the network by means of user access devices.
20. A system of claim 19, wherein the user access devices include one or more of cell phone, tablet, laptop, or desktop computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
DETAILED DESCRIPTION OF THE DISCLOSURE
[0017] The novel features believed to be characteristic of the exemplary embodiment(s) are set forth with particularity in the appended claims. The disclosure itself, however, both as to its organization and method of operation, together with further objects and advantages thereof, may best be understood by reference to the following description taken in connection with the accompanying drawings.
[0018] The non-limiting exemplary embodiment(s) will now be described more fully hereinafter with reference to the accompanying drawings, in which a preferred embodiment of the disclosure is shown. Such exemplary embodiment(s) may, however, be embodied in many different forms and should not be construed as limited to the embodiment set forth herein. Rather, these embodiment(s) are provided so that this application will be thorough and complete, and will fully convey the true scope of the disclosure to those skilled in the art.
[0019] The below disclosed subject matter is to be considered illustrative, and not restrictive, and any appended claim(s) are intended to cover all such modifications, enhancements, and other embodiment(s) which fall within the true scope of the non-limiting exemplary embodiment(s). Thus, to the maximum extent allowed by law, the scope of the non-limiting exemplary embodiment(s) is to be determined by the broadest permissible interpretation of the claim(s) and their equivalents, and shall not be restricted or limited by the foregoing detailed description.
[0020] References in the specification to “an exemplary embodiment”, “an embodiment”, “a preferred embodiment”, “an alternative embodiment” and similar phrases mean that a particular feature, structure, or characteristic described in connection with the embodiment(s) is included in at least an embodiment of the disclosure. The appearances of the phrase “a non-limiting exemplary embodiment” in various places in the specification are not necessarily all meant to refer to the same embodiment.
[0021] The non-limiting exemplary embodiments of the present disclosure discussed with respect to
[0022] As a broad overview, the implementations discussed below with respect to
[0023] Advantageously, the present disclosure interfaces speech emotion signals with a screen of a user interface to express emotions as written feedback or in terms of icons or graphic images that are illustrative of the emotion. As mentioned above, while the present embodiments describe specifically speech analysis to extract emotional content, it will be appreciated that this may be augmented with facial and body imaging content using computer vision to corroborate the emotional analysis obtained from the voice sample. The structure, data, and classification of the speech's emotion has been altered significantly over the prior art. The neural network contains a unique architecture, wherein the training data frame contains different classes along with Mel Spectrum and/or MFCC values. The classes of the data frame include the emotions which our model is classifying, along with other signal information such as silence and noises to identify non-speech that will not include emotion information. In one embodiment the model classified the emotions: happy, sad, fearful, disgusted, angry, surprise, and neutral. The Mel Spectrum and/or MFCC values are the data that the neural network uses to find patterns that can be used to identify each class in this embodiment; these are included in the data frame alongside the classes.
[0024] Referring to
[0025] As shown in
[0026]
[0027] The pre-processing and neural network of
[0028] The pre-processing step involves processing a time series of the spoken audio input by transforming the time series into a one-dimensional Mel Spectrogram, or a two-dimensional Mel-Frequency Cepstral Coefficient (MFCC) matrix, which is then reduced back to a single-dimension output by means of mean normalization or other dimensionality reduction, before feeding the single dimensional output into the neural network to identify at least one emotion in the audio input. This is illustrated and discussed below with respect to
[0029] Thus, one embodiment of the present disclosure provides, among other things, a system and method for recognizing sentiment from a spoken audio input signal. It provides, among other things, a computerized method for recognizing one or more emotions from the spoken audio input signal. This computerized method includes using a computer to extract exemplary features from the spoken audio input signal, wherein the method includes using a pre-processing stage involving Mel Spectrograms.
[0030] The Mel Spectrum value at each time point goes through mean normalization and is then passed through a multi-layered recurrent neural network as discussed above, or through a one-dimensional convolutional neural network, where the sentiments associated with the spoken audio input signal are learned during the training phase, and subsequently applied to new audio to select the emotion during runtime.
[0031] The present disclosure also provides, among other things, a system, wherein the system includes a computer program product (software) for execution on a computer or mobile device, which may also be implemented as a non-transitory, computer readable medium for recognizing one or more emotions from the spoken audio input signal, the computer readable medium having program code stored therein that when executed is configured to use a computer to extract exemplary features from the spoken audio input signal, wherein the logic of the software includes a Mel Spectrogram and/or Mel-Frequency Cepstral Coefficient (MFCC), and includes a mean normalization step, the system further including a computer configured to define a recurrent or convolutional neural network, which is used to assign at least one sentiment value to the spoken audio input signal based on a first comparison to training data.
[0032] In the present implementation of the invention the sentiment value output is communicated as an output signal to a user interface (e.g., computer or cell phone screen) to provide visual feedback associated with the sentiment value.
[0033]
[0041] Together, this combination of layers shown in the figures is trained on classified input data to generate the mathematical weights and biases stored in the proprietary trained neural network of the same structure. When the software is utilized, new input data is run through this trained neural network to classify emotions. It will be appreciated that the various implementations of the layers mentioned above is by way of example only. Other layer implementations can be used to achieve the same purpose of reducing dimensionality, reducing overfitting, etc. For example, linear alteration of the data size, dimensionality reduction, and/or reduction in the number of data channels could be achieved using rescaling, reshaping, or attention layers etc. Reduction in data overfitting could be achieved using grid search, activity regularization, or LASSO layers, etc. Data transformation into useful numbers could be achieved using regularization, encoding, or discretization layers, etc. Stabilizing and standardizing data could be achieved using average, masking, or maximum/minimum layers, etc.
[0042] As mentioned above, and as illustrated in
[0043] Referring to
[0044] One implementation of a system of the present invention is shown in
[0045] While the above embodiment made use of a convolutional neural network (CNN) for the machine learning model, another implementation makes use of a recurrent neural network (RNN).
[0046] simple RNN layers configured to recognize patterns,
[0047] flatten layers to linearly alter data sizes,
[0048] activation layers to transform data into useful numbers.
[0049] dense layers to reduce the number of data channels, and
[0050] long short-term memory layers to recurrently recognize patterns in series by using feedback connections and cell states.
[0051] Again, it will be appreciated that the various functions of the RNN can be implemented using different layers. For example, linear alteration of the data size, dimensionality reduction, and/or reduction in the number of data channels could be achieved using rescaling, reshaping or, attention layers, etc. Data transformation into useful numbers could be achieved using regularization, encoding or, discretization layers, etc. Recurrent recognition of patterns in series using feedback connections and cell states could be achieved using convolutional LSTM layers, gated recurrent units, or, stacked recurrent layers, etc.
[0052] Together, this combination of layers is trained on classified input data to generate the mathematical weights and biases stored in the proprietary trained neural network of the same structure. When the software is utilized, new input data is run through this trained neural network to classify emotions.
[0053] The processor may include a microprocessor or other device capable of being programmed or configured to perform computations and instruction processing in accordance with the disclosure. Such other devices may include microcontrollers, digital signal processors (DSP), Complex Programmable Logic Device (CPLD), Field Programmable Gate Arrays (FPGA), application-specific integrated circuits (ASIC), discrete gate logic, and/or other integrated circuits, hardware or firmware in lieu of or in addition to a microprocessor.
[0054] Functions and process steps described herein may be performed using programmed computer devices and related hardware, peripherals, equipment and networks. Such programming may comprise operating systems, software applications, software modules, scripts, files, data, digital signal processors (DSP), application-specific integrated circuit (ASIC), discrete gate logic, or other hardware, firmware, or any conventional programmable software, collectively referred to herein as a module.
[0055] The computer programs (e.g., the operating system, pre-programming stage and neural network) are typically stored in a memory that includes the programmable software instructions that are executed by the processor. In particular, the programmable software instructions include a plurality of chronological operating steps that define a control logic algorithm for performing the intended functions of the present disclosure. Such software instructions may be written in a variety of computer program languages such as C++, C#, etc.
[0056] The memory, which enables storage of data in addition the computer programs, may include RAM, ROM, flash memory and any other form of readable and writable storage medium known in the art or hereafter developed. The memory may be a separate component or an integral part of another component such as processor.
[0057] Further, alternative software implementations including, but not limited to, distributed processing or component/object distributed processing, parallel processing, or virtual machine processing can be constructed to implement methods described herein.
[0058] The present specification describes components and functions that may be implemented according to particular standards and protocols (e.g., TCP/IP, UDP/IP, HTML, HTTP, etc.)
[0059] In a non-limiting exemplary embodiment, a microphone may be used to capture verbal input signals, and the system may include a user interface e.g., a keyboard, mouse, etc.
[0060] While the disclosure has been described with respect to certain specific embodiment(s), it will be appreciated that many modifications and changes may be made by those skilled in the art without departing from the spirit of the disclosure. It is intended, therefore, by the description hereinabove to cover all such modifications and changes as fall within the true spirit and scope of the disclosure. In particular, with respect to the above description, it is to be realized that the optimum dimensional relationships for the parts of the exemplary embodiment(s) may include variations in size, materials, shape, form, function and manner of operation.