Method of recognising a sound event
11587556 · 2023-02-21
Assignee
Inventors
- Christopher James Mitchell (Cambridgeshire, GB)
- Sacha Krstulovic (Cambridgeshire, GB)
- Cagdas Bilen (Cambridgeshire, GB)
- Juan Azcarreta Ortiz (Cambridgeshire, GB)
- Giacomo Ferroni (Cambridgeshire, GB)
- Arnoldas Jasonas (Cambridgeshire, GB)
- Francesco Tuveri (Cambridgeshire, GB)
Cpc classification
G10L15/02
PHYSICS
International classification
G10L15/02
PHYSICS
Abstract
A method for recognising at least one of a non-verbal sound event and a scene in an audio signal comprising a sequence of frames of audio data, the method comprising: for each frame of the sequence: processing the frame of audio data to extract multiple acoustic features for the frame of audio data; and classifying the acoustic features to classify the frame by determining, for each of a set of sound classes, a score that the frame represents the sound class; processing the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame; and processing the sound class decisions for the sequence of frames to recognise the at least one of a non-verbal sound event and a scene.
Claims
1. A method for recognising at least one of a non-verbal sound event and a scene in an audio signal comprising a sequence of frames of audio data, the method comprising: for each frame of the sequence: processing the frame of audio data to extract multiple acoustic features for the frame of audio data; and classifying the acoustic features to classify the frame by determining, for each of a set of sound classes, a score that the frame represents the sound class, wherein classifying the acoustic features comprises classifying the frame of audio data using a set of first classifiers: processing the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame, wherein processing the sound class scores includes applying a temporal structure constraint to the sound class scores to generate the sound class decision, wherein applying the temporal structure constraint comprises processing the sound class scores to determine whether a consistency constraint is met over the sequence of frames and processing the sound class scores using a second classifier, wherein the second classifier is a neural network; and processing the sound class decisions for the sequence of frames to recognise the at least one of a non-verbal sound event and a scene.
2. The method of claim 1, wherein classifying the acoustic features comprises classifying the frame of audio data using a set of first classifiers, and wherein applying the temporal structure constraint comprises processing the sound class scores using a Viterbi optimal path search algorithm.
3. The method of claim 1, wherein the set of first classifiers comprises a set of neural network classifiers.
4. The method of claim 1, wherein processing the frame of audio data to extract the acoustic features for the frame of audio data comprises determining a feature vector defining the acoustic features for the frame of audio data.
5. The method of claim 1, wherein the frame of audio data comprises time domain audio data for a time window, and wherein processing the frame of audio data to extract the acoustic features for the frame of audio data comprises transforming the frame of audio data into frequency domain audio data.
6. The method of claim 1 wherein processing the frame of audio data to extract multiple acoustic features for the frame of audio data comprises processing the frame of audio data using a feature extraction neural network to extract the acoustic features for the frame.
7. The method of claim 1, wherein prior to said classifying the acoustic features to classify the frame, the method comprises concatenating the multiple acoustic features for the frame of audio data with multiple acoustic features for an adjacent frame of audio data in the sequence.
8. The method of claim 1, further comprising adjusting the sound class scores for multiple frames of the sequence of frames based on one or more of: knowledge about one or more of the sound classes; and knowledge about an environment in which the audio data was captured.
9. The method of claim 1, wherein processing the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame comprises using an optimal path search algorithm across more than one frame.
10. The method of claim 9, wherein the optimal path search algorithm is a Viterbi algorithm.
11. The method of claim 1, wherein processing the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame comprises: filtering the sound class scores for the multiple frames to generate a smoothed score for each frame; and comparing each smooth score to a threshold to determine a sound class decision for each frame.
12. The method of claim 1, wherein processing the class decisions for the sequence of frames to recognise the at least one of a non-verbal sound event and scene further comprises determining a start and an end time of the at least one of a non-verbal sound event and a scene.
13. A non-transitory data carrier carrying processor control code which when running on a device causes the device to perform the method of claim 1.
14. A computer system configured to implement the method of claim 1.
15. A consumer electronic device comprising the computer system of claim 14.
16. A system for recognising at least one of a non-verbal sound event and a scene in an audio signal comprising a sequence of frames of audio data, the system comprising a microphone to capture the audio data and one or more processors, wherein the system is configured to: for each frame of the sequence: process the frame of audio data to extract multiple acoustic features for the frame of audio data; and classify the acoustic features to classify the frame by determining, for each of a set of sound classes, a score that the frame represents the sound class, wherein classifying the acoustic features comprises classifying the frame of audio data using a set of first classifiers; process the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame, wherein processing the sound class scores includes applying a temporal structure constraint to the sound class scores to generate the sound class decision, wherein applying the temporal structure constraint comprises processing the sound class scores to determine whether a consistency constraint is met over the sequence of frames and processing the sound class scores using a second classifier, wherein the second classifier is a neural network; and process the class decisions for the sequence of frames to recognise the at least one of a non-verbal sound event and scene.
17. A sound recognition device for recognising at least one of a non-verbal sound event and scene in an audio signal comprising a sequence of frames of audio data, the sound recognition device comprising: a microphone to capture the audio data; and a processor configured to: receive the audio data from the microphone; and for each frame of the sequence: process the frame of audio data to extract multiple acoustic features for the frame of audio data; and classify the acoustic features to classify the frame by determining, for each of a set of sound classes, a score that the frame represents the sound class, wherein classifying the acoustic features comprises classifying the frame of audio data using a set of firest classifiers; process the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame, wherein processing the sound class scores includes applying a temporal structure constraint to the sound class scores to generate the sound class decision, wherein applying the temporal structure constraint comprises processing the sound class scores to determine whether a consistency constraint is met over the sequence of frames and processing the sound class scores using a second classifier, wherein the second classifier is a neural network; and process the class decisions for the sequence of frames to recognise the at least one of a non-verbal sound event and scene.
18. The method of claim 1, wherein applying the temporal structure constraint comprises one or more of (i) requiring that a specified number of proportion of the sequence of frames have a similar sound class score; (ii) requiring that a specified number or proportion of the sequence of frames have the same sound class decisions; (iii) requiring that a consistency metric is satisfied for the sequence of frames; or (iv) processing the sound class scores of the sequence of frames using a process which is responsive to a history of the sound class scores.
19. A method for recognising at least one of a non-verbal sound event and a scene in an audio signal comprising a sequence of frames of audio data, the method comprising: for each frame of the sequence: processing the frame of audio data to extract multiple acoustic features for the frame of audio data; and classifying the acoustic features to classify the frame by determining, for each of a set of sound classes, a score that the frame represents the sound class, wherein classifying the acoustic features comprised classifying the frame of audio data using a set of first classifiers; processing the sound class scores for multiple frames of the sequence of frames to generate, from each frame, wherein processing the sound class scores includes applying a temporal structure constraint to the sound class scores to generate the sound class decision, wherein applying the temporal structure constraint comprises processing the sound class scores to determine whether a consistency constraint is met over the sequence of frames, and processing the sound class scores using Viterbi optimal path search algorithm; and processing the sound class decisions for the sequence of frames to recognise the at least one of a non-verbal sound event and a scene.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1) Embodiments of the invention will be described, by way of example, with reference to the accompanying drawings, in which:
(2)
(3)
(4)
(5)
DETAILED DESCRIPTION OF THE DRAWINGS
(6)
(7) The system comprises a device 101. The device 101 may be any type of electronic device. The device 101 may be a consumer electronic device. For example the consumer electronic device 101 may be, a smartphone, a headphone, a smart speaker, a car, a digital personal assistant, a personal computer, a tablet computer. The device 101 comprises a memory 102, a processor 103, a microphone 105, an analogue to digital converter (ADC) 106, an interface 108 and an interface 107. The processor is in connection to: the memory 102; the microphone 105; the analogue to digital converter (ADC) 106; interface 108; and the interface 107. The processor 103 is configured to recognise a non-verbal sound event and/or scene by running computer code stored on the memory 102. For example, the processor 103 is configured to perform the method 200 of
(8) The microphone 105 is configured to convert a sound into an audio signal. The audio signal may be an analogue signal, in which case the microphone 106 is coupled to the ADC 106 via the interface 108. The ADC 106 is configured to convert the analogue audio signal into a digital signal. The digital audio signal can then be processed by the processor 103. In embodiments, a microphone array (not shown) may be used in place of the microphone 105.
(9) Although the ADC 106 and the microphone 105 are shown as part of the device 101, one or more of the ADC 106 and the microphone 105 may be located remotely to the device 101. If one or more of the ADC 106 and the microphone 105 are located remotely to the device 101, the processor 103 is configured to communicate with the ADC 106 and/or the microphone 105 via the interface 108 and optionally further via the interface 107.
(10) The processor 103 may further be configured to communicate with a remote computing system 109. The remote computing system 109 is configured to recognise a non-verbal sound event and/or scene, therefore the processing steps required to recognise a non-verbal sound event and/or scene may be spread between the processor 103 and the processor 113. The remote computing system comprises a processor 113, an interface 111 and a memory 115. The interface 107 of the device 101 is configured to interact with the interface 111 of the device 109 so that the processing steps required to recognise a non-verbal sound event and/or scene may be spread between the processor 103 and the processor 113.
(11)
(12) A step 201 shows acquiring a digital audio sample 215. The audio sample may have been acquired by a microphone, for example microphone 105 of
(13) The digital audio sample 215 is grouped into a series of 32 ms long frames with 16 ms long hop size. If the sampling frequency is 16 Khz, then this is equivalent to the digital audio sample 215 being grouped into a series of frames that comprise 512 audio samples with a 256 audio samples-long hop size.
(14) Once the digital audio sample 215 has been acquired, feature extraction is performed on the frames of the digital audio sample 215, as shown in the step 203. The feature extraction 203 results in a sequence of feature frames 217. The feature extraction step 203 comprises transforming the digital audio sample 215 into a series of multidimensional feature vectors (i.e. frames), for example emitted every 16 ms. The feature extraction of step 203 may be implemented in a variety of ways.
(15) One implementation of feature extraction step 203 is to perform one or more signal processing algorithms on the frames of the digital audio sample 215. An example of a signal processing algorithm is an algorithm that processes a power spectrum of the frame to extract a spectral flatness value for the frame. A further example is a signal processing algorithm that extracts harmonics and their relative amplitudes from the frame.
(16) An additional or alternative implementation of the feature extraction step 203 is to use a Deep Neural Network (DNN) to extract a number of acoustic features for a frame. A DNN can be configured to extract audio feature vectors of any dimension. A bottleneck DNN embedding or any other appropriate DNN embedding may be used to extract acoustic features. Here a neural network bottleneck may refer to a neural network which has a bottleneck layer between an input layer and an output layer of the neural network, where a number of units in a bottleneck layer is less than that of the input layer and less than that of the output layer, so that the bottleneck layer is forced to construct a generalised representation of the acoustic input.
(17) The feature vector stacking step 205 is an optional step of the method 200. The feature vector stacking step 205 comprises concatenating the acoustic feature vectors 217 into larger acoustic feature vectors 219. The concatenation comprises grouping adjacent feature vectors into one longer (i.e. a higher dimensional) feature vector.
(18) For example, if an acoustic feature vector comprises 32 features, the feature vector stacking step 205 may produce a 352 dimension stacked feature vector by concatenating an acoustic feature vector with 5 acoustic feature vectors before and after the considered acoustic feature vector (352 dimensions=32 dimensions×11 frames, where 11 frames=5 preceding acoustic feature vector+1 central acoustic feature vector+5 following acoustic feature vectors).
(19) An alternative example of the feature vector stacking step 205 would be to stack 15 acoustic feature vectors before and after a central acoustic feature vector, where an original acoustic feature vector having 43 features would produce a stacked acoustic feature vector with 1333 dimensions (1333d=43d×31 acoustic feature vectors, where 31 acoustic feature vectors=15 before+1 central+15 after).
(20) The acoustic modelling step 207 comprises classifying the acoustic features to classify the frame by determining, for each of a set of sound classes, a score that the frame represents the sound class. The acoustic modelling step 207 comprises using a deep neural network (DNN) trained to classify each incoming stacked or non-stacked acoustic feature vector into a sound class (e.g. glass break, dog bark, baby cry etc.). Therefore, the input of the DNN is an acoustic feature vector and the output is a score for each sound class. The scores for each sound class for a frame may collectively be referred to as a frame score vector. For example, the DNN used in the step 207 is configured to output a score for each sound class modelled by the system every 16 ms.
(21) An example DNN used in step 207 is a feed-forward fully connected DNN having 992 inputs (a concatenated feature vector comprising 15 acoustic vectors before and 15 acoustic vectors after a central acoustic vector=31 frames×32 dimensions in total). The example DNN has 3 hidden layers with 128 units per layer and RELU activations.
(22) Alternatively, a convolutional neural network (CNN), a recurrent neural network (RNN) and/or some other form of deep neural network architecture or combination thereof could be used.
(23) A schematic example of an output of the DNN is shown at 221. In this example, there are three different sound classes represented by three colours: grey (227), red (223) and blue (225). The horizontal axis represents time and the vertical axis represents a value of a score (where a downward vertical direction represents a high score). Each dot is a score value corresponding to a frame of audio data.
(24) A score warping step 209 is an optional step that follows 207. In step 209, the scores are reweighted according to probabilities learned from application-related data. In other words, the scores output by the DNN in step 207 are adjusted based on some form of knowledge other than the audio data acquired in step 201. The knowledge may be referred to as external information, examples of such external information can be seen at 208.
(25) As examples, the score warping 209 may comprise the following method: using prior probabilities of sound event and/or scene occurrence for a given application to reweight one or more scores. For example, for sound recognition in busy homes, the scores for any sound class related to speech events and/or scenes would be weighted up. In contrast, for sound recognition in unoccupied homes, the scores for any sound class related to speech events and/or scenes would be weighted down.
(26) Long-term acoustic analysis is performed at step 211. The long-term acoustic analysis performed at step 211 comprises processing the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame. The long-term acoustic analysis performed at step 211 outputs frame-level classification decisions after integrating longer term temporal information, typically spanning one or several seconds, into the frame-level scoring.
(27) As an example, if there are four sound classes: A, B, C and D, the long-term acoustic analysis performed at step 211 will comprise receiving a sequence of vectors. Each vector would have four dimensions, where each dimension represents a (optionally reweighted) score for a class. The long-term acoustic analysis performed at step 211 comprises processing the multiple vectors that represent a long-term window, typically 1.6 second/100 score values long context window. The long-term acoustic analysis performed at step 211 will then output a series of classification decisions for each frame (i.e. the output will be A, B, C or D for each frame, rather than 4 scores for each frame). The long-term acoustic analysis performed at step 211 therefore uses information derived from frames across a long-term window.
(28) The long-term acoustic analysis can be used in conjunction with external duration or co-occurrence models. For example: Transition matrices can be used to impart long-term information and can be trained independently of Viterbi. Transition matrices are an example of a co-occurrence model and also implicitly a duration model. Co-occurrence models comprise information representing a relation or an order of events and/or scenes. An explicit model of duration probabilities can be trained from ground truth labels (i.e. known data), for example fitting a Gaussian probability density function on the durations of one or several baby cries as labelled by human listeners. In this example, a baby cry may last between 0.1 s and 2.5 s and be 1.3 s long on average. More generally, the statistics of duration can be learned from external data. For example, from label durations or from a specific study on a duration of a specific sound event and/or scene. Many types of model can be used as long as they are able to generate some sort of class-dependent duration or co-occurrence score/weight (e.g., graphs, decision trees etc.) which can, for example, be used to rescore a Viterbi path(s), or alternatively, be combined with the sound class scores by some method other than the Viterbi algorithm across the long term, for example across a sequence of score frames spanning 1.6 s.
(29) Examples of the long-term acoustic analysis performed at step 211 are given below, where the long-term acoustic analysis may thus apply a temporal structure constraint.
(30) Score smoothing and thresholding
(31) Viterbi optimal path search
(32) a recurrent DNN trained to integrate the frame decisions across a long-term window.
(33) In more detail:
(34) a) Score Smoothing and Thresholding Across Long Term Window
(35) Median filtering or some other form of long-term low-pass filtering (for example a moving average filter) may be applied to the score values spanned by the long-term window. The smoothed scores may then be thresholded to turn the scores into class decisions, e.g., when a baby cry score is above the threshold then the decision for that frame is baby cry, otherwise the decision is world (“not a baby”). There is one threshold per class/per score.
(36) b) Viterbi Optimal Path Search Across a Long Term Window
(37) The input of the using the Viterbi algorithm to perform step 211 comprises: A state-space definition: there are S states where each state (s_i) is a sound class, for example: s_0==world; s_1==baby_cry; s_2==glass_break; etc. In one configuration there are 6 states however, in general there are as many states as there are classes to be recognised plus an extra state representing all other sounds (labelled as a ‘world’ class, (i.e. a non-target sound class), in the above). An array of initial probabilities: this is a S-sized array, where the i-th element is the probability that the decoded sequence starts with state i. In an example, these probabilities are all equal (for example, all equal to 1/S). A transition matrix A: this is a S×S matrix where the element (i, j) is the probability of moving from state i to state j. In an example configuration, this matrix is used to block transitions between target classes, for example, the probabilities of the row 0 (world class) are all greater than zero, which means a state can move from world to all other target classes. But, in row 1 (baby cry) only columns 0 and 1 are non-zero, which means that from baby cry the state can either stay in the baby cry state or move to the world state. Corresponding rules apply for the other rows. An emission matrix: this is a N×S matrix where the element (i, j) is the score (given by the acoustic model, after warping) of observing class j at the time frame i. In an example, N is equal to 100. In this example, the time window is 100 frames long (i.e. 1.6 seconds) and it moves with steps of 100 frames, so there is no overlap.
(38) In other words, every time that the Viterbi algorithm is called, the Viterbi algorithm receives as an input, for example, 100 sound class scores and outputs 100 sound class decisions.
(39) The settings are flexible, i.e., the number of frames could be set to a longer horizon and/or the frames could overlap.
(40) Transition matrices can be used to forbid the transition between certain classes, for example, a dog bark decision can be forbidden to appear amongst a majority of baby cry decisions.
(41) c) DNN Across a Long-Term Window Examples of a DNN used to perform the long-term acoustic analysis performed at step 211 are:
(42) A long short-term memory recurrent neural network (LSTM-RNN) with 101 stacked frame score vectors (50 frames before and after a target frame), where score frame vectors contain 6 scores (one for each of 6 classes) as input. Thus, the input size is a 101 by 6 tensor. The rest of the DNN comprises 1 LSTM hidden layer with 50 units, hard sigmoid recurrent activation, and tanh activation. The output layer has 6 units for a 6-class system.
(43) A gated recurrent units RNN (GRU-RNN): the input size is similarly a 101 by 6 tensor, after which there are 2 GRU hidden layers with 50 units each, and tanh activation. Before the output layer a temporal max pooling with a pool size of 2 if performed. The output layer has 6 units for a 6-class system.
(44) Long-term information can be inflected by external duration or co-occurrence models, for example transition matrices in case c) of using a Viterbi optimal path search, or inflected by an external model made by learning the typical event and/or scene lengths, for example probabilities of event and/or scene duration captured by some machine learning method, typically DNNs.
(45) At the step 213, the sound class decisions for a sequence of frames are processed to recognise a non-verbal sound event and/or scene. In an example, the sound class decisions for multiple frames are input and an indication of one or more non-verbal sound events and/or scenes are output. Examples of how step 213 may be performed are explained below, one or more of the below examples may be implemented in the step 213: a) the sound class decisions for each frame may be grouped into long-term event and/or scene symbols with a start time, an end time and a duration; b) discarding a sequence of sound class decisions of the same class which are shorter than a sound event and/or scene duration threshold defined individually for each sound class. For example: a sequence of “baby cry” sound class decisions can be discarded if the sequence of “baby cry” sound class decisions are collectively shorter than 116 milliseconds (which is approximately equivalent to 10 frames); a sequence of “smoke alarm” sound class decisions can be discarded if the sequence of “smoke alarm” sound class decisions are collectively shorter than 0.4 seconds (which is approximately equivalent to 25 frames). The sound event and/or scene duration thresholds can be set manually for each class; c) merging multiple non-verbal sound events and/or scenes of the same sound class that intersect a particular time window into one single non-verbal sound event and/or scene. For example, if two “baby cry” non-verbal sound events and/or scenes are determined to happen within a 4 seconds interval then they are merged into one a single “baby cry” non-verbal sound events and/or scenes, where the window duration (4 seconds in the above example) is a parameter which can be manually tuned. The window duration can be different for each sound class.
(46)
(47) The first step (302) of the process 300 is to capture audio data comprising multiple frames. The audio data may be captured by the microphone 105 and processed using the ADC 106. The processed audio data is output from the ADC 106 to the processor 103 via the interface 108. The processed audio data may be considered as audio data.
(48) At step 304 the audio data is processed to extract multiple acoustic features for each frame.
(49) At step 306, for each of a set of sound classes, a sound class score that the frame represents the sound class for each frame is determined. Step 306 may comprise classifying the acoustic features to classify the frame by determining, for each of a set of sound classes, a score that the frame represents the sound class.
(50) The next step (308) of the process 300 is to generate a sound class decision for each frame. This is performed by processing the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame.
(51) The next step of the process 300 is to process (step 310) the sound class decisions to recognise a non-speech sound event and/or scene.
(52) In response to recognising a non-speech sound event and/or scene, the system may optionally output a communication to a user device or a further computing device. The system may provide a visual, acoustic, or other indicator in response to recognising a non-speech sound event and/or scene.
(53)
(54) At step 402, data is input into the Neural Network. In an example, the Neural Network is configured to receive acoustic feature data of multiple frames and output sound class scores for a frame.
(55) At step 404, the output of the Neural Network is compared with training data to determine a loss as determined using a loss function. For example, the outputted sound class scores for a frame are compared to ground truth (sound class labels) for a frame. A loss is calculated for one or more sound classes, preferably a loss is calculated for each of the sound classes.
(56) At step 406, the loss is back propagated. Following the back propagation, the weightings of the Neural Network are updated at step 408.
(57) In an example, a loss function comprising the following features is used to determine a loss. The loss function directly optimises the classification of multi-frame events and/or scenes without resorting to an additional optimisation stage rather than considering only the classification of each short-time audio frame individually.
(58) An example loss function for training the machine learning model(s) of the system of
Σ.sub.i y.sub.i log x.sub.i
(59) wherein i represents a frame, y.sub.i is a sound class label for frame i, and x.sub.i represents one or more sound class scores for frame i output by the recurrent neural network. y.sub.i may be ground truth and may be a vector comprising labels for each sound class. In this example, the machine learning models may be one or more neural network.
(60) Another example loss function for training the machine learning model(s) of the system of
(61) In this example, the machine learning models may be one or more neural network.
(62) Each of these criteria can be enforced with one or more specific penalty terms, each of which are explained in more detail below.
(63) Non-Target Cross Entropy
(64) The set of sound classes may comprise one or more target sound classes and one non-target sound class. A target sound class is a sound class that the described system is configured to recognise (for example “baby crying”, “dog barking” or “female speaking”). The non-target sound class is a sound class that comprises all sound classes that are not target sound classes. If there are no audio event and/or scenes (that have a corresponding target sound class) in a frame then the frame will be classified as having a non-target sound class. The non-target sound class representative of an absence of each of the one or more target sound classes.
(65) The non-target cross entropy term penalises incorrect and can be determined by:
Σ.sub.i=non-target sound class y.sub.i log x.sub.i
(66) wherein i represents a frame having a ground truth of the non-target sound class representative of an absence of each of the one or more target sound classes, y.sub.i is a sound class label for frame i, and x.sub.i represents one or more sound class scores for frame i output by the recurrent neural network. y.sub.i may be ground truth and may be a vector comprising labels for each sound class.
(67) Target Loss
(68) For a class, in order to successfully recognise the sound event and/or scene associated with the class, it may not be necessary to correctly classify every frame. Rather, it may be sufficient to correctly classify only a percentage of frames associated with the sound event and/or scene. For example, for a sound event and/or scene that typically has a short time duration, it may be advantageous to correctly classify the majority of the frames having the class associated with the sound event and/or scene. For a sound event and/or scene that typically has a long time duration, correctly classifying only a small percentage of the frames having the class could be sufficient. For this purpose, a weighted pooling of the scores within a class can be used. Thus, a term of the loss function may determine:
Σ.sub.j[(Σ.sub.i ∈ label.sub.
(69) wherein, j represents a target sound class, i ∈ label.sub.j represents a frame that has been classified as sound class j, y.sub.i is a sound class label for frame i (i.e. the ground truth), x.sub.i represents one or more sound class scores for frame i output by the recurrent neural network, and pool.sub.β (x.sub.i, ∀i ∈ label.sub.j) is a function of sound class scores and comprises a parameter β.
(70) The poolβ( ) is the pooling function combining a number of outputs, may be defined as:
(71)
(72) which is equivalent to average pooling for β=0 and max-pooling when β.fwdarw.inf.
(73) With a correct β parameter, this loss function will lead to high values when no frames create a detection, and much lower values when sufficient number of frames have a detection, leaving the other frames unconstrained.
(74) Smoothness Loss
(75) As discussed above, temporally continuous classifications (i.e. smooth) are preferable to temporally inconsistent classifications as they are more likely to be considered as a recognition. Thus, a loss term that penalizes non-smooth changes in the class on the label can be used as determined below:
Σ.sub.i=target sound class y.sub.i log(1−(x.sub.i−(x.sub.i−1+x.sub.i+1)/2).sup.2)
(76) wherein i represents a frame, y.sub.i represents a sound class label for frame i, x.sub.i represents one or more sound class scores for frame i output by the recurrent neural network, x.sub.i−1 represents one or more sound class scores for frame i−1 output by the recurrent neural network, wherein the frame i−1 is a frame that has a position in the sequence preceding the frame i; and x.sub.i+1 represents one or more sound class scores for frame i+1 output by the recurrent neural network, wherein frame i+1 is a frame that has a position in the sequence after the frame i.
(77) Cross-Trigger Loss
(78) In an example, there is a loss term that penalises a frame being classified as more than one class. The loss term increases as other further classes are triggered on the target label (except the world class, since missed detections are not as important). An example term performing such a function is:
−Σ.sub.j[Σ.sub.i ∈ label.sub.
(79) wherein j represents a target sound class, i ∈ label j represents a frame i having a ground truth of a target sound class j, y.sub.i represents a sound class label for frame i, x.sub.i represents one or more sound class scores for frame i output by the recurrent neural network, pool.sub.max(x.sub.i, k, ∀k≠c) represents a highest sound class score of x.sub.i that is not a sound class score for class c.