Generation of a signature of a musical audio signal

10217469 ยท 2019-02-26

Assignee

Inventors

Cpc classification

International classification

Abstract

The invention concerns a method for generating a signature of a musical audio signal of a given duration, the method comprising the following steps: modelling (104) the musical audio signal to obtain, for each frequency band of a set of n frequency bands, a diagram representing the energy of the audio signal for the frequency band, on the basis of the time during said given duration; determining (103) musical transition times t.sub.k of the audio signal during the given duration; associating (105) each musical transition time tk with an item of local information comprising a vector of n values representative, respectively, of the energy of the audio signal in each of the n diagrams obtained between musical transition time tk and a subsequent musical transition time tk+1 and/or a vector of n values representative, respectively, of the energy of the audio signal in each of the n diagrams obtained between musical transition time tk and a preceding musical transition time tk1; determining (106), on the basis of the local information associated with each musical transition time tk, a key associated with the musical transition time, the determined keys forming a first set of keys of the audio signal; generating (107) a signature of the musical audio signal comprising pairs of keys from the first set of keys and associated musical transition times tk.

Claims

1. A computer-implemented method for the generation of a signature of a musical audio signal of a given duration, wherein the method comprises the following steps: modelling said musical audio signal in order to obtain, for each frequency band of a set of n frequency bands, with n strictly greater than 1, a diagram representing the energy of the audio signal for said frequency band, as a function of time during said given duration; determining musical transition times t.sub.k of the audio signal during the given duration, said musical transition times t.sub.k correspond respectively to local maximums of an onset function calculated on the basis of the musical audio signal; associating each musical transition time t.sub.k with an item of local information comprising a vector of n values representative respectively of the energy of the audio signal in each of the n diagrams obtained between the musical transition time t.sub.k and a subsequent musical transition time t.sub.k+1 and/or a vector of n values representative respectively of the energy of the audio signal in each of the n diagrams obtained between the musical transition time t.sub.k and a preceding musical transition time t.sub.k1; determining, as a function of the item of local information associated with each musical transition time t.sub.k, a key associated with said musical transition time, the determined keys forming a first set of keys of the audio signal; generating and storing in a memory a signature of the musical audio signal comprising couples, each couple comprising a key from the first set of keys and an associated musical transition time t.sub.k.

2. The computer-implemented method according to claim 1, in which the keys in the first set of keys are obtained by quantization of the vectors of n values representative respectively of the energy of the signal in each frequency band in the set of n frequency bands in a period comprised between the musical transition time t.sub.k and a following musical transition time t.sub.k+1 or by quantization of the vectors of n values representative respectively of the energy of the signal in each frequency band in the set of n frequency bands in a period comprised between the musical transition time t.sub.k and a preceding musical transition time t.sub.k1.

3. The computer implemented method according to claim 2, in which each key in the first set of keys is a vector of n bits, each bit of index i being determined by comparison between the energy value of index i in the vector of n values on the basis of which said key is determined, and the average of the values of said vector of n values.

4. The computer implemented method according to claim 3, in which if the energy value of index i in the vector of n values on the basis of which the key is determined is greater than the average of the values of said vector of n values, then the bit of index i is equal to 1, the bit of index i otherwise being equal to 0.

5. A non-transitory computer readable storage medium, with a program stored thereon, said program containing instructions for implementing the method according to claim 1, when this program is executed by a processor.

6. A device for generating a signature of a musical audio signal of a given duration, wherein the device comprises: a unit for modelling said musical audio signal in order to obtain, for each frequency band of a set of n frequency bands, a diagram representing the energy of the audio signal for said frequency band, as a function of time during said given duration; a first unit for determining musical transition times t.sub.k of the audio signal during the given duration; a unit for associating each musical transition time t.sub.k with an item of local information comprising a vector of n values representative respectively of the energy of the audio signal in each of the n diagrams obtained between the musical transition time t.sub.k and a subsequent musical transition time t.sub.k+1 and/or a vector of n values representative respectively of the energy of the audio signal in each of the n diagrams obtained between the musical transition time t.sub.k and a preceding musical transition time t.sub.k1; a second unit for determining, as a function of the item of local information associated with each musical transition time t.sub.k, a key associated with said musical transition time, the determined keys forming a first set of keys of the audio signal; a unit for generating a signature of the musical audio signal comprising couples, each couple comprising a key from the first set of keys and an associated musical transition time t.sub.k.

Description

(1) Other features and advantages of the invention will become apparent on examination of the detailed description below, and the attached drawings in which: FIG. 1 shows a method for the generation of a signature of a musical audio signal of a given duration according to the invention; FIG. 2 shows a superimposition of an onset function calculated on the basis of a musical audio signal and from a chromagram determined on the basis of the same musical audio signal according to the invention; FIG. 3 shows a method for identifying a query musical audio signal from reference musical audio signals, the query audio signal and the reference audio signals being represented by respective signatures obtained by implementation of the method according to FIG. 1. FIG. 4 is a time/time diagram representing a scatter plot according to an embodiment of the invention; FIG. 5 shows a device for generating a signature of a musical audio signal of a given duration according to an embodiment of the invention;

(2) FIG. 6 shows a device 600 for identifying a query musical audio signal among reference musical audio signals, the query audio signal and the reference audio signals being represented by respective signatures obtained by implementation of the method according to FIG. 1, according to the invention.

(3) FIG. 1 shows a method for the generation of a signature of a musical audio signal of a given duration according to the invention;

(4) In a step 101, a musical audio signal is received by a device for the generation of a signature according to the invention, which will be illustrated subsequently with reference to FIG. 5. Such a musical audio signal can correspond to an audio signal for which the user would like to access meta-data, such as the name of the artist or the title. Such an audio signal is called a query audio signal. The musical audio signal can also correspond to a reference audio signal the signature of which is intended to be stored in a database with a view to a subsequent comparison with a signature of a query audio signal (as detailed below with reference to FIG. 3) with a view to identifying the query audio signal.

(5) As explained previously, the invention relates to allowing not only near-exact matching (i.e. effective search for an audio excerpt in a large database with robustness to distortion) but also approximate matching (namely identifying an audio excerpt with a reference audio signal which is similar thereto within a large database).

(6) Based on the musical audio signal received, a step of determining states of the musical audio signal is carried out with reference to step 102.

(7) Step 102 comprises, in a step 103, the determination of musical transition times t.sub.k of the musical audio signal, during the given duration of the musical audio signal.

(8) In parallel with step 103, in a step 104, the step of determining states comprises the modelling of the musical audio signal in order to obtain, for each frequency band of a set of n frequency bands, a diagram representing the energy of the audio signal for the frequency band in question, as a function of time during the aforementioned given duration.

(9) Steps 103 and 104 will be better understood with reference to FIG. 2, which shows a superimposition of an onset function 201 calculated on the basis of the musical audio signal and a chromagram 205 determined on the basis of the musical audio signal. It should be noted that the determination of the chromagram 205 and of the onset function 201 can both require the application of a Fourier transform to the query audio signal, with different windowings. In order to avoid any time shifting between the onset function 201 and the chromagram 205, provision can be made in the invention to synchronize the Fourier transforms.

(10) The present invention can for example provide detection of the musical transition times t.sub.k as corresponding to the local maxima 202 of the onset function.

(11) The chromagram 205 is constituted by n diagrams 206.i, i varying from 1 to n, illustrating the energy of the musical audio signal for each frequency band. The greater the energy, the lighter the chromagram is.

(12) The method according to the invention next comprises a step 105 of defining states of the musical audio signal. A state s.sub.k is defined by associating a time t.sub.k with an item of local information 1.sub.k:
s.sub.k=(t.sub.k,i.sub.k).

(13) The item of local information i.sub.k can be given by the following characteristics:
I.sub.k=(c.sub.k,l,c.sub.k,r)

(14) in which c.sub.k,l is a mean chroma vector to the left of t.sub.k and c.sub.k,r is a mean chroma vector to the right of t.sub.k.

(15) The vector c.sub.k,r is a vector of n values representative respectively of the energy of the musical audio signal in each of the n diagrams 206 obtained in the period 203 between the musical transition time t.sub.k and a subsequent musical transition time t.sub.k+1. The vector c.sub.k,l is a vector of n values representative respectively of the energy of the query audio signal in each of the n diagrams 206.i obtained in the period 204 between the musical transition time t.sub.k and a preceding musical transition time t.sub.k+1. For example, the component of index i of the vector c.sub.k,r (hereinafter denoted c.sub.k,r,i) corresponds to the average energy of the frequency band represented by the diagram 206.i over the period 203 and the component of index i of the vector c.sub.k,l (hereinafter denoted c.sub.k,l,i) corresponds to the average energy of the frequency band represented by the diagram 206.i over the period 204.

(16) It should be noted that the item of local information i.sub.k can correspond only to the mean chroma vector c.sub.k,r to the right of t.sub.k or to the mean chroma vector c.sub.k,l to the left of t.sub.k.

(17) The advantage of the definition of states presented above is that the query audio signal is represented in a compact form representative of rhythmic and harmonic information on the musical audio signal. Thus, the present invention allows near-exact matching applications but also approximate matching of a query audio signal.

(18) The method according to the invention then comprises a step 106 of determining, as a function of the item of local information associated with each musical transition time t.sub.k, a key associated with the musical transition time t.sub.k, the determined keys forming a first set of keys of the musical audio signal;

(19) A first set of keys should be generated, which are invariant with respect to the variations in the musical audio signal and sufficiently discriminative. The invariance with respect to variations must cover the classical post-processing distortions (equalization, pitch shifting, etc) but also recordings of one and the same title under different conditions (matching of similar audio signals).

(20) According to the invention, a key can be determined for each determined state s.sub.k of the musical audio signal, and can correspond to a quantized version of one of the vectors c.sub.k,l and c.sub.k,r. The example used hereinafter will be that of a quantized version of the vector c.sub.k,r. Each component c.sub.k,r,i of the vector c.sub.k,r can for example be quantized using a single bit {tilde over (c)}.sub.k,r,i, in the following manner:

(21) C ~ k , r , i = { 1 if c k , r , i > c _ k , r 0 otherwise

(22) with c.sub.k,r corresponding to the average of the c.sub.k,r,is for i varying between 1 and n.

(23) Thus, the quantization of the vector c.sub.k,r is deterministic. Provision can also be made in the invention, so as to limit the risk of changing a key at this stage of the procedure, to generate a probabilistic set of binary vectors {tilde over (c)}.sub.k,r, associated with their respective probabilities.

(24) Each binary vector {tilde over (c)}.sub.k,r constitutes a key of the musical audio signal and is stored in association with the musical transition time t.sub.k.

(25) The first set of keys generated in this way and the times t.sub.k respectively associated with these keys thus form a signature of the musical audio signal, in a step 107.

(26) Such a signature according to the invention contains rhythmic and harmonic information on the musical audio signal, thus allowing approximate matching as well as near-exact matching. Such a method for the generation of a signature is applied both to the query audio signal and also to the reference audio signal with a view to the storage of their respective signatures in a database.

(27) When the method described above is applied to a reference audio signal, provision is made moreover for the definition of an index function. By index function is meant a function which inputs a key and returns a set of values.

(28) The benefit of such an operation is that it can potentially be carried out with a complexity O(1). In the present case, the values are pointers to the reference audio signals in the database and the keys are audio characteristics which characterize these reference audio signals.

(29) Thus, when the method for the generation of a signature has been applied to all the reference audio signals, the keys of all the references are extracted and stored in the index. Thus, the values of a key correspond to the identifiers of the reference audio signals as well as the musical transition times t.sub.k corresponding to the key in the reference audio signals.

(30) FIG. 3 shows a method for identifying a query musical audio signal among reference musical audio signals, the query audio signal and the reference audio signals being represented by respective signatures obtained by implementation of the method according to FIG. 1.

(31) In a step 301, a query audio signal is received by an identification device which will subsequently be presented with reference to FIG. 6.

(32) In a step 302, a signature of the query audio signal is obtained, by applying the method according to FIG. 1 to the query audio signal. The signature thus comprises a first set of keys K associated respectively with musical transition times t.sub.k (it should be noted that the index k of the musical transition times is independent of the symbol K representing a key).

(33) In a step 303, access is gained to a database comprising the signatures of the reference audio signals, with a view to constituting, using the index function, and for each signature of the reference signal having at least one key from the second set in common with the first set of keys of the signature of the query signal, a third set of keys comprising the keys from the second set in common with the first set. It should be noted that a key K in the third set can be associated with several musical transition times in the first set and/or with several musical transition times in the second set.

(34) The method then comprises a step 304 of determining a score as a function of a matching between each couple of musical transition times associated with each key in the third set. Determining the score is carried out for each signature of the reference signal comprising at least one key in common with the first set of keys of the signature of the query audio signal. Determining the score can for example consist of determining an affine function matching a greater number of musical transition times of the reference signal that are associated with the keys in the third set with the musical transition times of the query signal that are associated with the same keys in the third set, the score allocated to the signature of the reference signal being equal to the aforementioned greater number.

(35) A detailed example of determining the score of a signature of the reference signal based on the third set of keys common to this signature of the reference signal and to the signature of the query signal is shown with reference to FIG. 4.

(36) FIG. 4 is a time/time diagram representing a scatter plot 403, each dot corresponding to a key in the third set, the x-axis 401 of the dot being the musical transition time of the reference signal associated with the key in the third set and the y-axis of the dot corresponding to the musical transition time of the query signal associated with the key in the third set. Thus a scatter plot is constructed for each signature of the reference signal which shares at least one key in common with the signature of the query signal.

(37) Based on this scatter plot, a straight line 404 is determined connecting a maximum of dots of the constructed scatter plot. Determining the straight line 404 can be carried out by applying a Hough algorithm to the set of couples of musical transition times associated with the keys in the third set.

(38) The score of the signature of the reference signal for which the scatter plot was constructed can then be equal to the number of dots connected by the straight line 404 determined in the scatter plot. In FIG. 4, it can be seen that a straight line 404 comprising a large number of dots is obtained. The fact that many dots are situated outside the straight line is explained by the fact that a key from the first set can be present several times in the second set. This reflects the repetition of a key in the title represented by the reference audio signal. It should moreover be borne in mind that due to the very strong quantization applied in order to determine the keys, the latter are not very discriminative (which is intentional, and useful for allowing approximate matching of the query audio signal).

(39) In a step 305, at least one signature of a reference signal having one of the highest scores is selected, as a signature of a candidate reference signal to correspond to the query signal. For example, provision can be made in the invention to select M signatures of candidate reference signals, M being able to be predetermined or being able to depend on the number of keys in the first set of keys of the signature of the query signal, for example. No restriction is attached to the value of M.

(40) In a step 306, a set of signatures of reference signals having the highest scores is selected, and post-processing is applied: to candidate reference signals, represented by the signatures of the set, and to the query signal,
with a view to identifying the query signal as one of the reference signals. The set of signatures can for example comprise the aforementioned M signatures.

(41) Such a step of post-processing can utilize complex calculations, without however involving too many resources, inasmuch as the database of reference signals has been reduced to M candidate reference signals.

(42) An example of post-processing is described below. However, the invention is in no way restricted to this example, and any known method can be applied for selecting one of the candidate reference audio signals as a match for the query audio signal. The state of the art proposes numerous techniques for this purpose.

(43) At this stage of the procedure, the selected M reference audio signals should be compared with the query audio signal.

(44) By way of example, there is proposed below a dynamic programming approach which is based on the modelling of audio signals by states as previously described.

(45) Firstly, for each signature of a reference signal having keys in common with the first set of keys, the query audio signal can be modified as a function of the affine function determined for obtaining the score of the signature of the reference signal.

(46) In fact, the straight line 404 defines an affine function characterized by a slope referenced 405 in FIG. 4 and denoted a, and by a y-axis at the origin referenced 406 and denoted b. The straight line 404 means that a key extracted from the query audio signal at a time t will correspond to a key in the reference audio signal at time a(tb).

(47) The inverse of a then represents a time stretching of the query audio signal while b represents the time in the reference audio signal at which the query audio signal starts. Thus, the query audio signal can be modified by scaling the query signal as a function of the slope a and by time-cropping of the query audio signal as a function of the y-axis at the origin b.

(48) The query audio signal thus modified is represented by the sequence of states {s.sub.1 . . . s.sub.i . . . s.sub.m}, m being the number of musical transition times determined previously in the time interval corresponding to the duration of the query audio signal. In order to represent the states of the candidate reference signals, use will be made hereinafter of the notation {.sub.1 . . . .sub.j . . . .sub.p}, p being the number of musical transition times in the candidate reference signal, p being dependent on the candidate reference signal.

(49) In order to dynamically align two sequences of states, three types of penalties should be defined:

(50) f.sup.d(s.sub.i): penalty assigned for the deletion of state s.sub.i;

(51) f.sup.i(s.sub.i): penalty assigned on insertion of state s.sub.i;

(52) f.sup.s(s.sub.i, .sub.j): penalty assigned to the substitution of state s.sub.i by state .sub.j.

(53) By way of example, the functions f.sup.d and f.sup.i can both adopt a constant value equal to 0.3.

(54) Knowing that s.sub.i=(t.sub.i,(c.sub.i,l,c.sub.i,r)) and .sub.j=(.sub.j,(.sub.j,l,.sub.j,r)) denoting .sub.j the musical transition time of index j in the reference signal, .sub.j,l the mean chroma vector to the left of .sub.j, and .sub.j,r the mean chroma vector to the right of .sub.j, f.sup.s is defined by:

(55) f s ( s i , j ) = cos ( c i , l , j , l ) .Math. cos ( c i , r , j , r ) .Math. e .Math. t i - j .Math. c

(56) The first cosine term penalizes the resemblance of the mean chroma vector to the left of t.sub.i with the mean chroma vector to the left of .sub.i. The second cosine term penalizes the resemblance of the mean chroma vector to the right of t.sub.i with the mean chroma vector to the right of .sub.j. The exponential term penalizes the timing error between the occurrence of state s.sub.i in the modified query signal, and the occurrence of state .sub.j in the reference signal.

(57) Dynamic programming consists of iteratively filling a matrix D. For each couple (i,j){1 . . . m}{1 . . . p}, D(i,j) contains the score of the alignment of the subsequence {s.sub.1 . . . s.sub.i} with the sub-sequence {.sub.1 . . . .sub.j}. D is calculated as follows:

(58) D ( i , j ) = max { D ( i , j - 1 ) .Math. f d ( j ) D ( i - 1 , j - 1 ) .Math. f s ( s i , j ) D ( i - 1 , j ) .Math. f i ( s i ) }

(59) The score of the alignment of {s.sub.1 . . . s.sub.m} with {.sub.i . . . .sub.p} is finally given by D(m,p).

(60) In a step 307, at the end of the post-processing, the query audio signal is matched in a near-exact or approximate manner as corresponding to the reference audio signal having the highest score D(m,p). Such a match can furthermore be conditional on a step of comparison with a predetermined threshold: if the highest score D(m,p) is below this threshold, no match is obtained for the query audio signal.

(61) FIG. 5 shows a device 500 for the generation of a signature of a musical audio signal of a given duration.

(62) For this purpose, the device 500 comprises a reception unit 501 of a musical audio signal. As previously described, the musical audio signal can be a query audio signal or a reference audio signal.

(63) The device 500 also comprises a modelling unit 502 of the musical audio signal in order to obtain, for each frequency band of a set of n frequency bands, a diagram representing the energy of the audio signal for the frequency band, as a function of time during the given duration. The diagrams obtained can be grouped together in the form of the chromagram 205 shown in FIG. 2. In particular, the modelling unit 502 is capable of implementing step 104 of the method shown in FIG. 1.

(64) The device 500 comprises a first unit 502 for determining musical transition times t.sub.k of the audio signal during the given duration, which is capable of implementing step 103 of the method shown in FIG. 1.

(65) In order to synchronize the operations carried out by the units 502 and 503, the device 500 comprises a synchronization unit 504 capable of implementing step 104 of the method shown in FIG. 1.

(66) The device 500 comprises a unit 505 for associating each musical transition time t.sub.k with an item of local information comprising a vector of n values representative respectively of the energy of the audio signal in each of the n diagrams obtained between the musical transition time t.sub.k and a subsequent musical transition time t.sub.k+1 and/or a vector of n values representative respectively of the energy of the audio signal in each of the n diagrams obtained between the musical transition time t.sub.k and a preceding musical transition time t.sub.k1. The association unit 505 is capable of implementing step 105 of the method shown in FIG. 1.

(67) The device 500 also comprises a second unit 506 determining, as a function of the item of local information associated with each musical transition time t.sub.k, a key associated with the musical transition time, the determined keys forming a first set of keys of the audio signal. The determination unit 506 is capable of implementing step 106 of the method shown in FIG. 1.

(68) The device 500 comprises a unit 507 for generating a signature of the musical audio signal comprising couples of keys from the first set of keys and associated musical transition times t.sub.k, which is capable of implementing step 107 of the method shown in FIG. 1. The generation unit 507 can moreover be capable of transmitting the musical audio signal as well as the generated signature to a remote entity.

(69) For example, when the musical audio signal is a query audio signal, it can be transmitted with the generated signature to a remote server responsible for identifying the query audio signal, or more generally to a device according to FIG. 6. In this case the device 500 can be included in a user terminal such as a mobile telecommunications terminal, a portable or fixed computer, etc.

(70) As a variant, when the musical audio signal is a reference audio signal, it can be transmitted with the generated signature to a reference database capable of storing the reference audio signal in association with the generated signature.

(71) FIG. 6 shows a device 600 for the identification of a query musical signal among reference musical signals, the query signal and the reference signals being represented by respective signatures obtained by implementation of the method according to FIG. 1. For this purpose, the device 600 can be linked to the device 500 and can thus receive, by means of a reception unit 601, a query audio signal accompanied by a signature. In this case, a unit 602 extracts the signature of the query audio signal for communication to a constituting unit 603, described hereinafter. As a variant, the reception unit 601 can receive a query audio signal without a signature, the signature being generated by a unit 602. The unit 602 can thus correspond to the signature generation device 500 and can transmit the generated signature to the constituting unit 603.

(72) Moreover, the device 600 can be linked to a database 604 storing signatures respectively associated with reference audio signals. In the database 604, each signature of a reference signal thus comprises a second set of keys with which are associated respective musical transition times .sub.k. According to the invention, the database 604 can be incorporated into the device 600.

(73) The device 600 comprises the constituting unit 603, which accesses the database 604 and which, for each signature of a reference signal having at least one key from the second set in common with the first set of keys of the signature of the query signal, constitutes a third set of keys comprising the keys from the second set in common with the first set, each key from the third set being associated both with the musical transition time associated with the key in the signature of the reference signal, and with the musical transition time associated with the key in the signature of the query signal. The constituting unit 603 is capable of implementing step 303 of the method according to FIG. 3.

(74) The device 600 also comprises a unit 605 for determining a score as a function of a matching between each couple of musical transition times associated with a key from the third set, which is capable of implementing step 304 of the method according to FIG. 3.

(75) The device 600 comprises a unit 606 for selecting at least one signature of a reference signal having one of the highest scores, as a signature of a candidate reference signal for identifying the query signal. The selection unit 606 is capable of implementing step 305 of the method according to FIG. 3.

(76) The device also comprises a post-processing unit 607 capable of implementing step 306 of the method according to FIG. 3, so as to determine among a set of candidate reference signals, the reference audio signal matching the query audio signal. The unit 607 moreover cannot return any result when the highest score of a reference audio signal is below the threshold introduced beforehand.

(77) The device 600 comprises a transmission unit 608 for returning the reference audio signal matching the query audio signal to a user terminal or to a display unit of the device 600, not shown in FIG. 6.

(78) The invention also provides for the case in which the device 600 does not include the unit 306. In this case, the transmission unit 608 transmits the candidate query signals to a remote entity which can include the unit 306, which is particularly advantageous in cases where the device 600 has limited calculation capacities.

(79) It should be noted that the device 500 according to FIG. 5 and the device 600 according to FIG. 6 can be grouped together within one and the same overall device (when the device 500 replaces the unit 602 of the device 600). The overall device can then be used in a user terminal or in a remote server accessible from a user terminal. In the latter case, the user terminal transmits a query signal to the remote server.

(80) The present invention also provides for a computer program containing instructions for implementation of the method shown in FIG. 1, when this program is executed by a processor, and a computer program containing instructions for implementation of the method shown in FIG. 3, when this program is executed by a processor.