METHODS AND SYSTEMS FOR DETERMINING COMPACT SEMANTIC REPRESENTATIONS OF DIGITAL AUDIO SIGNALS
20220238087 · 2022-07-28
Assignee
Inventors
- Peter Berg STEFFENSEN (Copenhagen K, DK)
- Mikael HENDERSON (Copenhagen K, DK)
- Uffe ANDERSEN (Copenhagen K, DK)
- Thomas JØRGENSEN (Copenhagen K, DK)
Cpc classification
G10H2240/085
PHYSICS
G10H2250/311
PHYSICS
G10H2250/225
PHYSICS
G10H2240/081
PHYSICS
G10H2240/141
PHYSICS
International classification
Abstract
A method and system for determining a compact semantic representation of a digital audio signal using a computer-based system by calculating at least one low-level feature matrix from the digital audio signal; processing the low-level feature matrix or matrices using pre-trained machine learning engines including an ensemble of modules, wherein each module in the ensemble is trained to predict a one of a plurality of high-level feature values; and concatenating the obtained plurality of high-level feature values into a descriptor vector. The calculated descriptor vectors can be used alone, or in an arbitrary or temporally ordered combination with further descriptor vectors calculated from different audio signals extracted from the same music track, as a compact semantic representation of the respective music track.
Claims
1-27. (canceled)
28. A method for determining a compact semantic representation of a digital audio signal using computer-based system, the method comprising: providing a digital audio signal; calculating, using a digital signal processor module, a low-level feature matrix from the digital audio signal, the low-level feature matrix comprising numerical values corresponding to a low-level audio feature in a temporal sequence; calculating, using a general extractor module, a high-level feature matrix from the low-level feature matrix, the high-level feature matrix comprising numerical values corresponding to a high-level audio feature; calculating, using a feature-specific extractor module, a number n.sub.f of high-level feature vectors from the high-level feature matrix, each high-level feature vector comprising numerical values corresponding to a high-level audio feature; calculating, using a feature-specific regressor module, a number n.sub.f of high-level feature values from the number n.sub.f of high-level feature vectors; wherein each high-level feature value represents a musical or emotional characteristic of the digital audio signal; and calculating a descriptor vector by concatenating the number n.sub.f of high-level feature values.
29. The method according to claim 28, wherein the low-level feature matrix is a vertical concatenation of the Mel-spectrogram of the digital audio signal and its subsequent first and second derivatives, and the low-level feature matrix preferably comprises a number of rows ranging from 1 to 1000, more preferably 1 to 200, most preferably 102 rows; and a number of columns ranging from 1 to 5000, more preferably 1 to 1000, most preferably 612 columns.
30. The method according to claim 28, wherein the general extractor module uses a pre-trained Convolutional Neural Network, CNN, model, wherein the architecture of the CNN model comprises: an input block configured for normalizing the low-level feature matrix using a batch normalization layer; followed by four consecutive convolutional blocks; and an output layer.
31. The method according to claim 30, wherein each of the four consecutive convolutional blocks comprises: a 2-dimensional convolutional layer, a batch normalization layer, an Exponential Linear Unit, a 2-dimensional max pooling layer, and a dropout layer; and wherein the convolutional layer of the first convolutional block comprises 64 filters, while the convolutional layers of the further consecutive blocks comprise 128 filters.
32. The method according to claim 30, wherein the CNN model is pre-trained in isolation from the rest of the modules as a musical genre classifier model by: replacing the output layer with a recurrent layer and a decision layer in the architecture of the CNN model; providing a number n.sub.l of labeled digital audio signals, wherein each labeled digital audio signal comprises an associated ground truth musical genre; training the CNN model by using the labeled digital audio signals as input, and iterating over a number of N epochs; and after the training, replacing the recurrent layer and decision layer with an output layer in the architecture of the CNN model; wherein the number n.sub.l is 1≤n.sub.l≤100,000,000, more preferably 100,000≤n.sub.l≤10,000,000, more preferably 300,000≤n.sub.l≤400,000, most preferably n.sub.l=340,000; and wherein the number of training epochs is 1≤N≤1000, more preferably 1≤N≤100, most preferably N=40.
33. The method according to claim 32, wherein the recurrent layer comprises two Gated Recurrent Units, GRU, layers, and a dropout layer; and the decision layer comprises a fully connected layer.
34. The method according to claim 28, wherein the high-level feature matrix comprises a number of rows ranging from 1 to 1000, more preferably 1 to 100, most preferably 32 rows; and a number of columns ranging from 1 to 1000, more preferably 1 to 500, most preferably 128 columns.
35. The method according to claim 28, wherein the feature-specific extractor module uses an ensemble of a number n.sub.f of a pre-trained Recurrent Neural Network, RNN, models, wherein the architecture of the RNN models may differ from each other, and a preferred RNN model architecture comprises: two Gated Recurrent Units, GRU, layers, and a dropout layer.
36. The method according to claim 35, wherein each of the RNN models in the ensemble is pre-trained as a regressor to predict one target value from the number n.sub.f of high-level feature values by: providing an additional, fully connected layer of one unit in the architecture of the RNN model, providing a number of annotated digital audio signals, wherein each annotated digital audio signal comprises a number of annotations, the number of annotations comprising ground truth values X.sub.GT for high-level features of the respective annotated digital audio signal; training each RNN model to predict one target value X.sub.P from the high-level feature values by using the annotated digital audio signals as input, and iterating until the Mean Absolute Error, MAE, between the one predicted target value X.sub.P and the corresponding ground truth value X.sub.GT meets a predefined threshold T; and after the training, removing the fully connected layer from the architecture of the RNN model; wherein the total number n.sub.a of annotations is 1≤n.sub.a≤100,000, more preferably 50,000≤n.sub.a≤100,000 more preferably 70,000≤n.sub.a≤80,000.
37. The method according to claim 28, wherein the high-level feature vector is a 1-dimensional vector comprising a number of values ranging from 1 to 1024, more preferably from 1 to 512, most preferably comprising either 33, 128 or 256 values.
38. The method according to claim 28, wherein the feature-specific regressor module uses an ensemble of a number n.sub.f of a pre-trained Gaussian Process Regressor, GPR, models, wherein: each GPR model is specifically configured to one target value from the number n.sub.f of high-level feature values, and each GPR model uses a rational quadratic kernel, wherein the kernel function k for points x.sub.i,x.sub.j is given by:
39. The method according to claim 37, wherein each of the GPR models in the ensemble is pre-trained as a regressor to predict one target value from the number n.sub.f of high-level feature values by: providing a number of annotated digital audio signals, wherein each annotated digital audio signal comprises a number of annotations, the number of annotations comprising ground truth values for high-level features of the respective annotated digital audio signal; training each GPR model to predict one target value from the high-level feature values by using the annotated digital audio signals as input, and iterating until the Mean Absolute Error, MAE, between the one predicted target value and the corresponding ground truth value meets a predefined threshold; repeating the above steps by performing a hyper-parameter grid search on the parameters σ, α and l of the kernel by assigning each parameter a value from a predefined list of [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8], and using Mean Squared Error, MSE, as the evaluation metric, until the combination of three hyper-parameters that obtain the lowest MSE are identified; and keeping the model with the smallest error by comparing the MAE and MSE; wherein the total number n.sub.a of annotations is 1≤n.sub.a≤100,000, more preferably 50,000≤n.sub.a≤100,000 more preferably 70,000≤n.sub.a≤80,000.
40. The method according to claim 28, further comprising training a descriptor profiler engine, the descriptor profiler engine comprising the digital signal processor module, the general extractor module, the feature-specific extractor module, and the feature-specific regressor module; by: providing a number n.sub.aa of auto-annotated digital audio signals, wherein each auto-annotated digital audio signal comprises an associated descriptor vector comprising truth values for different musical or emotional characteristics of the digital audio signal; training the descriptor profiler engine by using the auto-annotated digital audio signals as input, and iterating the modules until the Mean Absolute Error, MAE, between calculated values of descriptor vectors and truth values of associated descriptor vectors meets a predefined threshold; and calculating, using the trained descriptor profiler engine, descriptor vectors for un-annotated digital audio signals with no associated descriptor vectors, wherein the number n.sub.aa is 1≤n.sub.aa≤100,000,000, more preferably 100,000≤n.sub.aa≤1,000,000, more preferably 500,000≤n.sub.aa≤600,000.
41. A method for determining a compact semantic representation of a digital audio signal using computer-based system, the method comprising: providing a digital audio signal; calculating, using a low-level feature extractor module, from the digital audio signal , a Mel-spectrogram, and a Mel Frequency Cepstral Coefficients, MFCC, matrix; processing, using a low-level feature pre-processor module the Mel-spectrogram and MFCC matrix, wherein the Mel-spectrogram is subjected separately to at least a Multi Auto Regression Analysis, MARA, process and a Dynamic Histogram, DH, process, and the MFCC matrix is subjected separately to at least an Auto Regression Analysis, ARA, process and a MARA process, wherein the output of each MARA process is a first order multivariate autoregression matrix, the output of each ARA process is a third order autoregression matrix, and the output of each DH process is a dynamic histogram matrix; and calculating, using an ensemble learning module, a number n.sub.f of high-level feature values by: feeding the output matrices from the low-level feature pre-processor module as a group parallelly into a number n.sub.f of ensemble learning blocks within the ensemble learning module, each ensemble learning block further comprising a number n.sub.GP of parallelly executed Gaussian Processes, GPs, wherein each of the GPs receives at least one of the output matrices and outputs a predicted high-level feature value, and picking, as the output of each ensemble learning block, the best candidate from the predicted high-level feature values, using statistical data, as one of the number n.sub.f of high-level feature values, wherein each high-level feature value represents a musical or emotional characteristic of the digital audio signal; and calculating a descriptor vector by concatenating the number n.sub.f of high-level feature values.
42. The method according to claim 41, wherein picking the best candidate from the predicted high-level feature values comprises: determining, using a predefined database of statistical probabilities regarding the ability of each GP to predict a certain high-level feature value, the GP within the ensemble learning block with the lowest probability to predict the respective high-level feature value, and discarding its output; and picking the predicted high-level feature value with a numerical value in the middle from within the remaining outputs.
43. The method according to claim 41, further comprising: training an auto-annotating engine, the auto-annotating engine comprising the low-level feature extractor module, the low-level feature pre-processor module, and the ensemble learning module; providing a number of annotated digital audio signals, wherein each annotated digital audio signal comprises a number of annotations, the number of annotations comprising ground truth values for high-level features of a respective annotated digital audio signal; training the auto-annotating engine by using the annotated digital audio signals as input and training the Gaussian Processes using ordinal regression; and calculating, using the trained auto-annotating engine, descriptor vectors for un-annotated digital audio signals, the descriptor vectors comprising predicted high-level features, wherein the total number n.sub.a of annotations is 1≤n.sub.a≤100,000, more preferably 50,000≤n.sub.a≤100,000 more preferably 70,000≤n.sub.a≤80,000.
44. The method according to claim 43, wherein providing the number n.sub.aa of auto-annotated digital audio signals comprises: calculating the associated descriptor vector using a method.
45. The method according to claim 44, further comprising: storing the descriptor vector in a database alone, or in an arbitrary or temporally ordered combination with further one or more descriptor vectors, as a compact semantic representation of a music track, wherein each of the descriptor vectors are calculated from different audio signals extracted from the same music track.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0143] In the following detailed portion of the present disclosure, the aspects, embodiments and implementations will be explained in more detail with reference to the example embodiments shown in the drawings, in which:
[0144]
[0145]
[0146]
[0147]
[0148]
[0149]
[0150]
[0151] GPR model in accordance with a possible implementation form of the first aspect;
[0152]
[0153]
[0154]
[0155]
[0156]
[0157]
[0158]
[0159]
DETAILED DESCRIPTION
[0160]
[0161] In the context of the present disclosure ‘semantic’ refers to the broader meaning of the term used in relation to data models in software engineering describing the meaning of instances. A semantic data model in this interpretation is an abstraction that defines how stored symbols (the instance data) relate to the real world, and includes the capability to express information that enables parties to the information exchange to interpret meaning (semantics) from the instances, without the need to know the meta-model itself.
[0162] Thus, the term ‘compact semantic representation’ refers to efficiently sized digital information (data in a database) that expresses relations to high-level concepts (meaning) in the real world (e.g. musical and emotional characteristics) and provides means to compare associated objects (digital audio signals or music tracks) without the need to know what high-level concept each piece of data exactly represents.
[0163] In an initial step 101, a a digital audio signal 1 is provided.
[0164] In this context, ‘digital audio signal’ refers to any sound (e.g. music or speech) that has been recorded as or converted into digital form, where the sound wave (a continuous signal) is encoded as numerical samples in continuous sequence (a discrete-time signal). The average number of samples encoded in one second is called the sampling frequency (or sampling rate).
[0165] In a preferred embodiment, the provided audio signal 1 is sampled at 22050 Hz and converted to mono by averaging the two channels of a stereo signal. However, it should be understood that any suitable sampling rate and channel conversion can be used for providing the digital audio signal 1. The digital audio signal 1 can be provided in the form of an e.g. audio file on a storage medium 22 of computer-based system 20.
[0166] In an embodiment, the duration L.sub.s of the digital audio signal 1 ranges from 1 s to 60 s, more preferably from 5 s to 30 s. In a preferred embodiment, the duration L.sub.s of the digital audio signal is 15 s.
[0167] In an embodiment, the digital audio signal 1 is a representative segment extracted from a music track 11.
[0168] In a next step 102, a low-level feature matrix 2 is calculated from the digital audio signal 1 using a digital signal processor module 12. The numerical values of the low-level feature matrix 2 correspond to values of certain low-level audio features, arranged in a temporal sequence according to the temporal information from the digital audio signal 1.
[0169] The object of this digital signal processing step 102 is to transform the input audio signal 1 into a new space of variables that simplifies further analysis and processing.
[0170] A ‘matrix’ in this context is meant to be interpreted in a broad sense, simply defining an entity comprising a plurality of values in a specific arrangement of rows and columns.
[0171] The term ‘low-level audio feature’ in this context refers to numerical values describing the contents of an audio signal on a signal level (as opposed to high-level features referring to an abstracted, symbolic level) and are determined according to different kinds of inspections such as temporal, spectral, etc. In particular, the temporal sequence of low-level audio features in this context may refer to a Mel-spectrogram, a Mel Frequency Cepstrum Coefficient (MFCC) vector, a Constant-Q transform, a Variable-Q transform, or a Short Time Fourier Transform (STFT). Further examples may include, but are not limited to, those of fast Fourier transforms (FFTs), digital Fourier transforms (DFTs), Modified Discrete Cosine Transforms (MDCTs), Modified Discrete Sine Transforms (MDSTs), Quadrature Mirror Filters (QMFs), Complex QMFs (CQMFs), discrete wavelet transforms (DWTs), or wavelet coefficients.
[0172] In an embodiment the low-level feature matrix 2 is a vertical concatenation of the Mel-spectrogram of the digital audio signal 1 and its subsequent first and second derivatives.
[0173] In a possible embodiment, the Mel-spectrogram is computed by extracting a number of Mel frequency bands from a Short-Time Fourier Transform of the digital audio signal 1 using a Hanning window of 1024 samples with 512 samples of overlap (50% of overlap). In possible embodiments the number of Mel bands ranges from 10 to 50, more preferably from 20 to 40, more preferably the number of used Mel bands is 34. In a possible embodiment, the formulation of the Mel-filters uses the HTK formula. In a possible embodiment, each of the bands of the Mel-spectrogram is divided by the number of filters in the band. Finally, the result is squared to the power of two and transformed to the Decibel scale.
[0174] In possible embodiments the low-level feature matrix 2 comprises a number of rows ranging from 1 to 1000, more preferably 1 to 200, most preferably 102 rows; and a number of columns ranging from 1 to 5000, more preferably 1 to 1000, most preferably 612 columns.
[0175] In a next step 103, a high-level feature matrix 3 is calculated from the low-level feature matrix 2 using a general extractor module 13. The numerical values of the high-level feature matrix 3 each correspond to a high-level audio feature. This component maps the input data from the low-level matrix space into a latent-space. The dimensions of the latent-space are lower, which is useful for classification or regression tasks, such as identifying the mood or the genre of a digital audio signal 1.
[0176] A ‘matrix’ in this context is meant to be interpreted in a broad sense, simply defining an entity comprising a plurality of values in a specific arrangement of rows and columns.
[0177] As explained above the term ‘low-level audio feature’ in the present disclosure refers to numerical values describing the contents of an audio signal on a signal level and are determined according to different kinds of inspections (such as temporal, spectral, etc.). The term ‘high-level audio feature’ in contrast refers to numerical values on an abstracted, symbolic level determined based on numerical values of low-level audio features.
[0178] In possible embodiments the high-level feature matrix 3 comprises a number of rows ranging from 1 to 1000, more preferably 1 to 100, most preferably 32 rows; and a number of columns ranging from 1 to 1000, more preferably 1 to 500, most preferably 128 columns.
[0179] In a next step 104, a number n.sub.f of of high-level feature vectors 4 are calculated from the high-level feature matrix 3 using a feature-specific extractor module 14. The numerical values in the high-level feature vectors 4 each correspond to a high-level audio feature.
[0180] A ‘vector’ in this context is meant to be interpreted in a broad sense, simply defining an entity comprising a plurality of values in a specific order or arrangement.
[0181] In an embodiment the number of high-level feature vectors 4 is between 1 of 256, more preferably between 10≤n.sub.f≤50. In a preferred embodiment the number of high-level feature vectors 4 is n.sub.f=34.
[0182] In further possible embodiments, the high-level feature vector 4 is a 1-dimensional vector comprising a number of values ranging from 1 to 1024, more preferably from 1 to 512. In most preferred embodiments the high-level feature vector 4 is a 1-dimensional vector comprising either 33, 128 or 256 values.
[0183] In a next step 105, a number n.sub.f of high-level feature values 5 are calculated from the number n.sub.f of high-level feature vectors 4 using a feature-specific regressor module 15, wherein each high-level feature value 5 represents a musical or emotional characteristic of the digital audio signal 1.
[0184] According to the possible embodiments mentioned above, in some embodiments the number n.sub.f of high-level feature values 5 ranges between 1n.sub.f≤256, more preferably between 10n.sub.f≤50. In a preferred embodiment the number of high-level feature values 5 is n.sub.f=34.
[0185] In possible embodiments a high-level feature value 5 may represent a perceived musical characteristic corresponding to the musical style, musical genre, musical sub-genre, rhythm, tempo, vocals, or instrumentation of the respective digital audio signal 1; or a perceived emotional characteristic corresponding to the mood of the respective digital audio signal 1.
[0186] In a possible embodiment, the high-level feature values 5 correspond to a number of moods (such as ‘Angry’, ‘Joy’, or ‘Sad’), a number of musical genres (such as ‘Jazz’, ‘Folk’, or ‘Pop’), and a number of stylistic features (such as ‘Beat Type’, ‘Sound Texture’, or ‘Prominent Instrument’).
[0187] In a possible embodiment each high-level feature value 5 can take a discrete numerical value between a minimum value v.sub.min and a maximum value v.sub.max, wherein v.sub.min represents and absence and v.sub.max represents a maximum intensity of the musical or emotional characteristic in the digital audio signal 1.
[0188] In possible embodiments the minimum discrete numerical value is v.sub.min=1, and the maximum discrete numerical value can range between 1<v.sub.max≤100, more preferably 5≤v.sub.max≤10, more preferably the maximum discrete numerical value is v.sub.max=7.
[0189] In a next step 106, a descriptor vector 6 is calculated by concatenating the number n.sub.f of high-level feature values 5.
[0190] Similarly as above, a ‘descriptor vector’ in this context is meant to be interpreted in a broad sense, simply defining an entity comprising a plurality of high-level feature values in a specific order or arrangement that represent the digital audio signal 1.
[0191]
[0192] In this embodiment, the general extractor module 13 is a pre-trained Convolutional Neural Network (CNN) 17, wherein the architecture of the CNN 17 comprises:
[0193] an input block 171 configured for normalizing the low-level feature matrix 2 using a batch normalization layer; followed by
[0194] four consecutive convolutional blocks 172; and
[0195] an output layer 173.
[0196] In an embodiment each of the four consecutive convolutional blocks 172 comprises
[0197] a 2-dimensional convolutional layer 1721,
[0198] a batch normalization layer 1722,
[0199] an Exponential Linear Unit (ELU) 1723 as the activation function,
[0200] a 2-dimensional max pooling layer 1724, and
[0201] a dropout layer 1725.
[0202] In a possible embodiment, the convolutional layer 1721 of the first convolutional block comprises 64 filters, while the convolutional layers 1721 of the further consecutive blocks comprise 128 filters. In possible embodiments, the size of each filter is between 2×2 and 10×10, preferably 3×3. In further possible embodiments, the dropout layers have a rate for removing units between 0.1 and 0.5, more preferably a rate of 0.1.
[0203]
[0204] In an initial step the output layer 173 is replaced with a recurrent layer 174 and a decision layer 175 in the architecture of the CNN model 17.
[0205] In a possible embodiment the recurrent layer 174 comprises two Gated Recurrent Units (GRU) layers 1741, and a dropout layer 1742.
[0206] In a further possible embodiment, the decision layer 175 comprises a fully connected layer 1751.
[0207] In a next step a number n.sub.l of labeled digital audio signals 9 are provided, each labeled digital audio signal 9 comprising an associated ground truth musical genre, or simply ‘label’ (indicated on the figure as ‘LABEL’).
[0208] In a next step the CNN model 17 is trained by using the labeled digital audio signals 9 as input and iterating over a number of N epochs.
[0209] In the final step, after the training, the recurrent layer 174 and decision layer 175 are replaced back with an output layer 173 in the architecture of the CNN model 17.
[0210] In possible embodiments the number n.sub.l of labeled digital audio signals 9 is between 1≤n.sub.l≤100,000,000, more preferably between 100,000≤n.sub.l≤10,000,000, more preferably between 300,000≤n.sub.l≤400,000. In a most preferred embodiment, the number of labeled digital audio signals 9 is n.sub.l=340,000.
[0211] In further possible embodiments the number of training epochs is between 1≤N≤1000, more preferably between 1≤N≤100. In a most preferred embodiment, the number of training epochs is N=40.
[0212]
[0213] In the same way as the general extractor module 13, the feature-specific extractor module 14 maps the input data (the high-level feature matrix 3) to a latent-space, but in this case, the latent-space is specific for one of the values of the descriptor vector 6. There are a number n.sub.f of pre-trained RNN models 18, one for each value of the descriptor vector 6, which is the reason for naming this component as an ensemble. Each model in the ensemble is based on Recurrent Neural Networks, and while the input for all of the models 18 is the high-level feature matrix 3, the subsequent architecture of the RNN models 18 may differ from each other.
[0214] In a preferred embodiment, the RNN model 18 architecture comprises two GRU layers 181, and a dropout layer 182. In a possible embodiment, the GRU layers 181 comprise a number of units between 1 and 100, most preferably 33 units. In further possible embodiments, the dropout layer 182 has a rate for removing units between 0.1 and 0.9, more preferably a rate of 0.5.
[0215]
[0216] In an initial step, an additional, fully connected layer 183 of one unit in the architecture of the RNN model 18 is provided.
[0217] In a next step, a number of annotated digital audio signals 7 is provided, wherein each annotated digital audio signal 7 comprises a number of annotations A, the number of annotations comprising ground truth values X.sub.GT for high-level features of the respective annotated digital audio signal 7. The annotations may further comprise a starting point in seconds referring to the original digital audio signal 1 or a music track 11 that the digital audio signal 1 was extracted from.
[0218] In a next step, each RNN model 18 is trained to predict one target value X.sub.P from the high-level feature values 5 by using the annotated digital audio signals 7 as input, and iterating until the Mean Absolute Error (MAE) between the one predicted target value X.sub.P and the corresponding ground truth value X.sub.GT meets a predefined threshold T.
[0219] In the final step, after the training, the fully connected layer 183 is removed from the architecture of the RNN model 18.
[0220] In possible embodiments, the total number n.sub.a of annotations is between 1≤n.sub.a≤100,000, more preferably between 50,000≤n.sub.a≤100,000 most preferably between 70,000≤n.sub.a≤80,000.
[0221]
[0222] Each GPR model 19 in the ensemble is specifically configured to predict one target value from the number n.sub.f of of high-level feature values 5.
[0223] In an embodiment, each GPR model 19 uses a rational quadratic kernel, wherein the kernel function k for points x.sub.i,x.sub.j is given by the formula (also shown in
[0224] wherein
[0225] {σ,α,l}∈[0.0,0.2,0.4,0.6,0.8,1.0,1.2,1.4,1.6,1.8]
[0226] In an embodiment, the implementation for the GPR uses the python module ‘scikit-learn’.
[0227]
[0228] In an initial step 1091, a number of annotated digital audio signals 7 are provided, wherein each annotated digital audio signal 7 comprises a number of annotations, the number of annotations comprising ground truth values for high-level features of the respective annotated digital audio signal 7.
[0229] In a next step 1092, each GPR model 19 is trained 19 to predict one target value from the high-level feature values 5 by using the annotated digital audio signals 7 as input, and iterating until the Mean Absolute Error (MAE) between the one predicted target value and the corresponding ground truth value meets a predefined threshold.
[0230] In a next step 1093, the above steps are repeated by performing a hyperparameter grid search on the parameters σ, α and l of the kernel by assigning each parameter a value from a predefined list of values:
[0231] [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8].
[0232] Parameters which define a model architecture are referred to as ‘hyperparameters’ and thus this process of searching for the ideal model architecture is referred to as ‘hyperparameter grid search’. A grid search will go through a manually specified subset of the values for each hyperparameter with the goal to determine what are the values for these hyperparameters that provide the best model.
[0233] In the case of a GPR model 19, the hyperparameters are sigma, alpha, and l, and the hyperparameter search comprises assigning to each of the hyperparameters one of the values in the list [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8], train the GPR model 19 and evaluate.
[0234] An example of few iterations of the search is the following:
[0235] [Iteration 1]
[0236] step 1—assign values to each hyperparameter:
[0237] sigma=0.0, alpha=0.0, l=0.0
[0238] step 2—train GPR model
[0239] step 3—evaluate model
[0240] [Iteration 2]
[0241] step 1—assign values to each hyperparameter:
[0242] sigma=0.2, alpha=0.0, l=0.0
[0243] step 2—train GPR model
[0244] step —evaluate model
[0245] [Iteration 3]
[0246] step 1—assign values to each hyperparameter:
[0247] sigma=0.4, alpha=0.0, l=0.0
[0248] step 2—train GPR model
[0249] step 3—evaluate model
[0250] [ . . . ]
[0251] [Iteration 1000]
[0252] step 1—assign values to each hyperparameter:
[0253] sigma=1.8, alpha=1.8, l=1.8
[0254] step 2—train GPR model
[0255] step 3—evaluate model
[0256] For this step the evaluation metric used is the Mean Squared Error. Each set of values for the hyperparameters will give us a different MSE. The outcome of the hyperparameter grid search is finding the values for the combination of three hyperparameters that obtain the lower MSE, and the training is carried out until this combination is identified.
[0257] In a next step 1094, the obtained smallest MAE and MSE values from the above steps are compared, and the model with the smallest error is identified and used as the pre-trained GPR model.
[0258] In possible embodiments, the total number n.sub.a of annotations is between 1≤n.sub.a≤100,000, more preferably between 50,000≤n.sub.a≤100,000 most preferably between 70,000≤n.sub.a≤80,000.
[0259]
[0260] The descriptor profiler engine 16 comprises a digital signal processor module 12, a general extractor module 13, a feature-specific extractor module 14, and a feature-specific regressor module 15 according to any of the possible embodiments described above.
[0261] In an initial step 1101, a number n.sub.aa of auto-annotated digital audio signals 8 are provided, wherein each auto-annotated digital audio signal 8 comprises an associated descriptor vector 6A comprising truth values for different musical or emotional characteristics of the digital audio signal 1.
[0262] In a next step 1102, the descriptor profiler engine 16 is trained by using the auto-annotated digital audio signals 8 as input and iterating the parameters of the modules 12 to 15 until the MAE between calculated values of descriptor vectors 6 and truth values of associated descriptor vectors 6A meets a predefined threshold. This training step 1102 results in a trained descriptor profiler engine 16T.
[0263] In a possible embodiment, the trained descriptor profiler engine 16T is validated in a further step, using the set of annotated digital audio signals 7 as described above, wherein each annotated digital audio signal 7 comprises a number of annotations, the number of annotations comprising ground truth values for high-level features of the respective annotated digital audio signal 7, and wherein the total number n.sub.a of annotations is between 1≤n.sub.a≤100,000, more preferably between 50,000≤n.sub.a≤100,000 most preferably between 70,000≤n.sub.a≤80,000.
[0264] In a final step 1103, descriptor vectors 6 are calculated, using the trained descriptor profiler engine 16T, for un-annotated digital audio signals 10 which have no descriptor vectors 6A associated therewith.
[0265] In possible embodiments, the number n.sub.aa of auto-annotated digital audio signals 8 is between 1≤n.sub.aa≤100,000,000, more preferably 100,000≤n.sub.aa≤1,000,000. In most preferred embodiments, the number n.sub.aa of auto-annotated digital audio signals 8 is between 500,000≤n.sub.aa≤600,000.
[0266] In a possible embodiment, the training of the descriptor profiler engine 16 is an iterative optimization process, wherein in each iteration, the general extractor module 13 and the feature-specific extractor module 14 are trained with the auto-annotated digital audio signals 8, and the feature-specific regressor module 15 is trained, as described above, using the annotated digital audio signals 7. In a final step, the descriptor profiler engine 16 updates the annotations of the auto-annotated data set. This process is repeated until there is no improvement on the evaluation. Thus, in the first iteration, the auto-annotations come from the number n.sub.aa of auto-annotated digital audio signals 8, but in the following iterations, it is the descriptor profiler engine 16 that creates them.
[0267] In an embodiment, the trained descriptor profiler engine 16T is evaluated by computing the MAE between manual annotations and the predictions (high-level feature values 5 of calculated descriptor vectors 6) of the trained descriptor profiler engine 16T. A small MAE suggests the model is correct in predicting, while a large MAE suggests that the predictions are not accurate.
[0268]
[0269] In an initial step 201, a digital audio signal 1 is provided. Similarly as above, the duration L.sub.s of the digital audio signal 1 may range from 1 s to 60 s, more preferably from 5 s to 30 s. In a preferred embodiment, the duration L.sub.s of the digital audio signal is 15 s.
[0270] In an embodiment, the digital audio signal 1 is a representative segment extracted from a music track 11, wherein ‘music track’ refers to any piece of music, either a song or an instrumental music piece, created (composed) by either a human or a machine. In this context, duration L.sub.s of the digital audio signal can be any duration that is shorter than the duration of the music track 11 itself and can be determined by taking into account factors such as copyright limitations, or the most efficient use of computing power.
[0271] In a next step 202, a Mel-spectrogram 2A, and a Mel Frequency Cepstral Coefficients (MFCC) matrix 2B is calculated from the digital audio signal 1 using a low-level feature extractor module 23.
[0272] Mel Frequency Cepstral Coefficients (MFCCs) are used in digital signal processing as a compact representation of the spectral envelope of a digital audio signal and provide a good description of the timbre of a digital audio signal 1. This step 202 can comprise further sub-steps. In an implementation, a lowpass filter is applied to the digital audio signal 1 before calculating the linear frequency spectrogram, preferably followed by downsampling the digital audio signal 1 to a single channel (mono) signal using a sample rate of 22050 Hz.
[0273] In a possible embodiment, the Mel-spectrogram and the MFCCs are computed by extracting a number of Mel frequency bands from a Short-Time Fourier Transform of the digital audio signal 1 using a Hanning window of 1024 samples with 512 samples of overlap (50% of overlap). In possible embodiments the number of Mel bands ranges from 10 to 50, more preferably from 20 to 40, more preferably the number of used Mel bands is 34. In a possible embodiment, the formulation of the Mel-filters uses the HTK formula. In a possible embodiment, each of the bands of the Mel-spectrogram is divided by the number of filters in the band.
[0274] This step accounts for the non-linear frequency perception of the human auditory system while reducing the number of spectral values to a fewer number of Mel bands. Further reduction of the number of bands can be achieved by applying a non-linear companding function, such that higher Mel-bands are mapped into single bands under the assumption that most of the rhythm information in the music signal is located in lower frequency regions. In a possible embodiment, the MFCCs are calculated by applying a cosine transformation on the Mel spectrogram. The MFCCs can then be concatenated into an MFCC matrix 2B.
[0275] In possible embodiments, the size of the Mel-spectrogram 2A, and the MFCC matrix 2B ranges between the dimensions of 1 to 100 rows and 1 to 1000 columns, with a preferred size of 34×612.
[0276] In a next step 203, the Mel-spectrogram 2A and MFCC matrix 2B are processed using a low-level feature pre-processor module 24. The Mel-spectrogram 2A is subjected separately to at least a Multi Auto Regression Analysis (MARA) process and a Dynamic Histogram (DH) process. The MFCC matrix 2B is subjected separately to at least an Auto Regression Analysis (ARA) process and a MARA process. The output of each MARA process is a first order multivariate autoregression matrix (with a preferred size of 34×34), the output of each ARA process is a third order autoregression matrix (with a preferred size of 34×4), and the output of each DH process is a dynamic histogram matrix (with a preferred size of 17×12), thus resulting in altogether at least 4 matrices (two first order multivariate autoregression matrices, a dynamic histogram matrix, and a third order autoregression matrix).
[0277] In a next step 204, a number n.sub.f of high-level feature values 5 are calculated using an ensemble learning module 25. In an embodiment, the ensemble learning module 25 comprises a number n.sub.f of ensemble learning blocks 25A, each ensemble learning block 25A further comprising a number n.sub.GP of parallelly executed Gaussian Processes (GPs), wherein each of the learning blocks 25A is configured to predict as an output one specific high-level feature value 5.
[0278] In possible embodiments, the number of parallelly executed GPs is between 1<n.sub.GP≤10, more preferably 1<n.sub.GP≤5.
[0279] In a most preferred embodiment, the number of parallelly executed GPs is n.sub.GP=4.
[0280] In further possible embodiments, the number n.sub.f of of high-level feature values 5 and ensemble learning blocks 25A is between 1≤n.sub.f≤256, more preferably between 10≤n.sub.f≤50. In a preferred embodiment n.sub.f=34.
[0281] Within the step of calculating 204 high-level feature values 5, in a first step 2041 the output matrices from the low-level feature pre-processor module 24 are fed as a group parallelly into all of the ensemble learning blocks 25A within the ensemble learning module 25.
[0282] In an embodiment, the at least 4 output matrices from the low-level feature pre-processor module 24 are fed into the ensemble learning blocks 25A so that each of the GPs within the ensemble learning block 25A receives at least one of the output matrices.
[0283] In a possible embodiment, the output matrices are fed into the ensemble learning blocks 25A so that each of the GPs within the ensemble learning block 25A receives exactly one of the output matrices.
[0284] In a preferred embodiment, 4 output matrices (two first order multivariate autoregression matrices, a dynamic histogram matrix, and a third order autoregression matrix) are fed into a number n.sub.f of ensemble learning blocks 25A so that each of the 4 GPs within one ensemble learning block 25A receives exactly one of the output matrices.
[0285] After processing the output matrices from the low-level feature pre-processor module 24, each GP outputs a predicted high-level feature value (X.sub.p) 5A.
[0286] In a next step 2042, the best candidate from the predicted high-level feature values 5A is selected as the output high-level feature value 5 of each ensemble learning block 25A. The selection is automatic and based on statistical data of predicting probabilities of the different GPs regarding a certain high-level feature value 5 that the respective ensemble learning block 25A is expected to predict.
[0287] In a final step 205 a descriptor vector 6 is calculated by concatenating the number n.sub.f of high-level feature values 5 obtained as the output of the number n.sub.f of ensemble learning blocks 25A within the ensemble learning module 25.
[0288] In an exemplary embodiment illustrated in
[0289] determining 2043 the GP within the ensemble learning block 25A with the lowest probability to predict the high-level feature value 5 that the respective ensemble learning block 25A is expected to predict, using a predefined database of statistical probabilities regarding the ability of each GP to predict a certain high-level feature value 5.
[0290] For each high-level feature value 5 of the descriptor vector 6, and for each GP component, it has been rated how confident a specific combination is to predict the correct high-level feature value 5. All the information is available in a database. As an example, for the GP1 component, when ‘Blues’ is predicted with the value ‘5’, the confidence of that prediction is 0.85, meaning that the correct prediction is achieved 85% of the time.
[0291] The output of the identified GP with the lowest correct prediction probability is then discarded and, from the remaining outputs 5A, the output 5A with a median numerical value is picked 2044 as the high-level feature value 5 predicted by the ensemble learning block 25A.
[0292]
[0293] The auto-annotating engine 26 comprises a low-level feature extractor module 23, a low-level feature pre-processor module 24, and an ensemble learning module 25 according to any of the possible embodiments described above.
[0294] In an initial step 2061, a number of annotated digital audio signals 7 are provided, wherein each annotated digital audio signal 7 comprises a number of annotations, the number of annotations comprising ground truth values for high-level features of the respective annotated digital audio signal 7.
[0295] In a next step 2062, the auto-annotating engine 26 is trained by training the Gaussian Processes using ordinal regression, using the annotated digital audio signals 7 as input. This training step 2062 results in a trained auto-annotating engine 26T.
[0296] In a final step 2063, descriptor vectors 6 comprising predicted high-level features are calculated, using the trained auto-annotating engine 26T, for un-annotated digital audio signals 10.
[0297] In possible embodiments, the total number n.sub.a of annotations is between 1≤n.sub.a≤100,000, more preferably between 50,000≤n.sub.a≤100,000 most preferably between 70,000≤n.sub.a≤80,000.
[0298] In regression tasks, it is common to report metrics such as the Mean Squared Error, MSE, and the coefficient of determination, R.sup.2. In the same way as in MAE, the lower the score, the better. However, for R.sup.2, the best score is 1.0.
[0299] Table 1 reports the testing results for these three metrics for both the Auto-annotating Engine (AAE) 26 and the Descriptor Profiler Engine (DPE) 16 in accordance with the present disclosure.
TABLE-US-00001 TABLE 1 Metric AAE DPE MAE 1.08 ± 0.00 0.88 ± 0.19 MSE 2.45 ± 0.00 1.69 ± 0.59 R.sup.2 0.44 ± 0.00 0.57 ± 0.14
[0300]
[0301] In this particular embodiment, which connects the two models of the auto-annotating engine 26 and the descriptor profiler engine 16, the associated descriptor vectors 6A, during the step 1101 of providing the number n.sub.aa of auto-annotated digital audio signals 8 for the descriptor profiler engine 16, are calculated using a trained auto-annotating engine 26T according to any of the embodiments described above.
[0302]
[0303] The computer-based system 20 may include a processor 21, a storage device 22, a memory 27, a communications interface 28, an internal bus 29, an input interface 30, and an output interface 31, and other components not shown explicitly in
[0304] In some embodiments the computer-based system 20 includes a digital signal processor (DSP) module 12 configured to calculate a low-level feature matrix 2 from a digital audio signal 1; a general extractor (GE) module 13 configured to calculate a high-level feature matrix 3 from a low-level feature matrix 2; a feature-specific extractor (FSE) module 14 configured to calculate high-level feature vectors 4 from a high-level feature matrix 3; a feature-specific regressor (FSR) module 15 configured to calculate high-level feature values 5 from high-level feature vectors 4; and optionally, a descriptor profiler engine 16 comprising a DSP module 12, a GE module 13, an FSE module, and an FSR module in accordance with the present disclosure.
[0305] In some embodiments the computer-based system 20 further includes a low-level feature extractor module (LLFE) 23 configured to process a digital audio signal 1 and extract therefrom a Mel-spectrogram 2A and/or an MFCC matrix 2B; a low-level feature pre-processor (LLFPP) module 24 configured to process a Mel-spectrogram 2A and/or an MFCC matrix 2B; an ensemble learning (EL) module 25 comprising ensemble learning blocks 25A configured to calculate one or more high-level feature values 5 from the output data from the LLFPP module 24; and optionally an auto-annotating engine 26 comprising an LLFE module 23, an LLFPP module 24, and an EL module 25 in accordance with the present disclosure.
[0306] While only one of each component is illustrated, the computer-based system 20 can include more than one of some or all of the components.
[0307] A processor 21 may control the operation and various functions of the computer-based system 20. As described in detail above, the processor 21 can be configured to control the components of the computer-based system 20 to execute a method for determining a compact semantic representation of a digital audio signal 1 in accordance with the present disclosure. The processor 21 can include any components, circuitry, or logic operative to drive the functionality of the computer-based system 20. For example, the processor 21 can include one or more processors acting under the control of an application.
[0308] A storage device 22 may store information and instructions to be executed by the processor 21. The storage device 22 can be any suitable type of storage medium offering permanent or semi-permanent memory. For example, the storage device 22 can include one or more storage mediums, including for example, a hard drive, Flash, or other EPROM or EEPROM.
[0309] In some embodiments, instructions (optionally in form of an executed application) can be stored in a memory 22. The memory 22 can include cache memory, flash memory, read only memory, random access memory, or any other suitable type of memory. In some embodiments, the memory 22 can be dedicated specifically to storing firmware for a processor 21. For example, the memory 22 can store firmware for device applications.
[0310] An internal bus 29 may provide a data transfer path for transferring data to, from, or between a storage device 22, a processor 21, a memory 27, a communications interface 28, and some or all of the other components of the computer-based system 20.
[0311] A communications interface 28 enables the computer-based system 20 to communicate with other computer-based systems, or enables devices of the computer-based system (such as a client and server) to communicate with each other, either directly or via a computer network 34. For example, communications interface 28 can include Wi-Fi enabling circuitry that permits wireless communication according to one of the 802.11 standards or a private network.
[0312] Other wired or wireless protocol standards, such as Bluetooth, can be used in addition or instead.
[0313] An input interface 30 and output interface 31 can provide a user interface for a user 33 to interact with the computer-based system 20.
[0314] An input interface 30 may enable a user to provide input and feedback to the computer-based system 20. The input interface 30 can take any of a variety of forms, such as one or more of a button, keypad, keyboard, mouse, dial, click wheel, touch screen, or accelerometer.
[0315] An output interface 31 can provide an interface by which the computer-based system 20 can provide visual or audio output to a user 33 via e.g. an audio interface or a display screen. The audio interface can include any type of speaker, such as computer speakers or headphones, and a display screen can include, for example, a liquid crystal display, a touchscreen display, or any other type of display.
[0316] The computer-based system 20 may comprise a client device or a server, or both a client device and a server in data communication.
[0317] The client device may be a portable media player, a cellular telephone, pocket-sized personal computer, a personal digital assistant (PDA), a smartphone, a desktop computer, a laptop computer, and any other device capable of communicating via wires or wirelessly (with or without the aid of a wireless enabling accessory device).
[0318] The server may include any suitable types of servers that are configured to store and provide data to a client device (e.g., file server, database server, web server, or media server). The server can store media and other data (e.g., digital audio signals 1 of music tracks 11, and any type of associated information such as metadata or descriptor vectors 6), and the server can receive data download requests from the client device.
[0319] The server can communicate with the client device over a communications link which can include any suitable wired or wireless communications link, or combinations thereof, by which data may be exchanged between server and client. For example, the communications link can include a satellite link, a fiber-optic link, a cable link, an Internet link, or any other suitable wired or wireless link. The communications link is in an embodiment configured to enable data transmission using any suitable communications protocol supported by the medium of the communications link. Such communications protocols may include, for example, Wi-Fi (e.g., a 802.11 protocol), Ethernet, Bluetooth (registered trademark), radio frequency systems (e.g., 900 MHz, 2.4 GHz, and 5.6 GHz communication systems), infrared, TCP/IP (e.g., and the protocols used in each of the TCP/IP layers), HTTP, BitTorrent, FTP, RTP, RTSP, SSH, any other communications protocol, or any combination thereof.
[0320] There may further be provided a database on the server, configured to store a plurality of digital audio signals 1 and/or associated metadata or descriptor vectors 6 as explained above, whereby the database may be part of, or in data communication with, the client device and/or the server device. The database can also be a separate entity in data communication with the client device.
[0321]
[0322] In a first step, a number of digital audio signals 1 are extracted from the same music track 11A in accordance with a respective possible implementation of the method of the present disclosure as described above.
[0323] In a next step, a number of descriptor vectors 6 are calculated from the digital audio signals 1 in accordance with a respective possible implementation of the method of the present disclosure as described above.
[0324] In preferred embodiments, the descriptor vectors 6 are calculated from the digital audio signals 1 using either a trained auto-annotating engine 26T or a trained descriptor profiler engine 16T in accordance with the present disclosure.
[0325] The descriptor vectors 6 can be stored in a database separately, or in an arbitrary or temporally ordered combination, as a compact semantic representation of the music track 11A.
[0326] The above steps are then repeated for at least one further music track 11B, resulting in further digital audio signals 1, and ultimately, further descriptor vectors 6, which can also be stored in a database separately, or in an arbitrary or temporally ordered combination, as a compact semantic representation of the music track 11B.
[0327] In a next step, these compact semantic representations are used for determining similarities between the two music tracks 11A,11B according to any known method or device designed for determining similarities between entities based on associated numerical vectors. The result of such methods or devices are usually a similarity score between the music tracks.
[0328] Even though in this exemplary implementation only two music tracks 11A,11B are compared, it should be understood that the method can also be used for comparing a larger plurality of music tracks and for determining a similarity ranking between a plurality of music tracks.
[0329] In a possible embodiment, determining similarities between two or more music tracks 11 comprises calculating distances between the descriptor vectors 6 in the vector space. In a possible embodiment the distance between the descriptor vectors 6 is determined by calculating their respective pairwise (Euclidean) distances in the vector space, whereby the shorter pairwise (Euclidean) distance represents a higher degree of similarity between the respective descriptor vectors 6. In a further possible embodiment, the respective pairwise distances between the descriptor vectors 6 are calculated with the inclusion of an optional step whereby Dynamic Time Warping is applied between the descriptor vectors 6. Similarly as above, the shorter pairwise (Euclidean) distance represents a higher degree of similarity between the respective descriptor vectors 6.
[0330] In another possible embodiment, determining the similarities comprises calculating an audio similarity index between each of the music tracks 11 by comparing their respective descriptor vectors 6 separately, or in an arbitrary or temporally ordered combination (according to their compact semantic representations). The audio similarity indexes may be stored (and optionally visualized) in the form of an audio similarity matrix 32, wherein each row and column represents a high-level feature value 5 or one of the plurality of music tracks 11, and each value in the matrix 32 is the audio similarity index between the respective high-level feature values 5 or the music tracks 11 that their column and row represent. Thus, the diagonal values of the matrix 32 will always be of highest value as they show the highest possible degree of similarity.
[0331] The audio similarity matrices 32 between each of the two (or more) music tracks 11A,11B can later be used to generate similarity-based playlists of the music tracks 11, or to categorize a multitude of music tracks 11 into groups according to musical or emotional characteristics.
[0332] The various aspects and implementations have been described in conjunction with various embodiments herein. However, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject-matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measured cannot be used to advantage. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems.
[0333] The reference signs used in the claims shall not be construed as limiting the scope.