Audio matching
11556587 · 2023-01-17
Assignee
Inventors
Cpc classification
G06F16/635
PHYSICS
International classification
G06F16/635
PHYSICS
Abstract
An audio matching technique generates audio fingerprints from a captured audio signal. Coarse and Fine fingerprints are generated from the captured audio. The coarse fingerprint is used to match with a set of coarse fingerprints stored in a database to identify a subset of possibly matching database entries. The fine fingerprint is then used to perform a detailed comparison with fine fingerprints associated with the subset of possibly matching database entries in order to find a match for the captured audio signal.
Claims
1. An audio matching system comprising: an audio input for capturing an audio signal; an audio database comprising a plurality of database entries, each entry of database entries being associated with audio content, and said each entry of the database entries comprising: i) a fine database acoustic fingerprint representative of the associated audio content; and ii) information relating to the associated audio content; and one or more processors configured to: process the captured audio signal to generate a fine query acoustic fingerprint and a fine query acoustic fingerprint representative of the captured audio signal, wherein the fine query acoustic fingerprint comprises an array of values; generate a coarse query acoustic fingerprint from the fine query acoustic fingerprint by applying a set of different filters to the fine query acoustic fingerprint; generate a coarse query acoustic fingerprint representative of the captured audio signal; apply a filter from a set of filters to a portion of the array of the values, by weighting each value of the values of the portion with a respective filter coefficient and by combining the weighted values, to generate a value of the coarse query acoustic fingerprint, wherein the filter comprises a plurality of filter coefficients; match, based on the generated value of the coarse query acoustic fingerprint, the coarse query acoustic fingerprint with coarse database acoustic fingerprints associated with said plurality of the database entries to identify a subset of possibly matching database entries; match the fine query acoustic fingerprint with fine database acoustic fingerprints of the database entries in said subset of the possibly matching database entries to identify a matching database entry; and output a matching response comprising information of the identified matching database entry.
2. The audio matching system according to claim 1, wherein each entry comprises associated coarse database acoustic fingerprint, or wherein the one or more processors is configured to generate the coarse database acoustic fingerprint associated with a database entry.
3. The audio matching system according to claim 1, wherein each filter from the set of filters is applied to a plurality of portions of the array to generate a corresponding plurality of values of the coarse query acoustic fingerprint.
4. The audio matching system according to claim 3, wherein the one or more processors is configured to generate a row or column of the coarse query acoustic fingerprint in response to applying each filter to the fine query acoustic fingerprint.
5. The audio matching system according to claim 1, wherein the fine query acoustic fingerprints have a greater bit rate than the coarse query acoustic fingerprints.
6. The audio matching system according to claim 1, wherein the one or more processors are configured to generate a spectrogram of the captured audio signal and the one or more processors are configured to generate the fine query acoustic fingerprint from the spectrogram.
7. The audio matching system according to claim 6, wherein the one or more processors are configured to apply a set of filters to the spectrogram to generate the fine query acoustic fingerprint.
8. The audio system according to claim 7, wherein the one or more processors are configured to apply a set of filters to the spectrogram that is different to the set of filters applied to generate the coarse query acoustic fingerprint.
9. The audio matching system according to claim 7 wherein the one or more processors are configured to apply a filter from the set of filters to a portion of the spectrogram to generate a value of the fine query acoustic fingerprint.
10. The audio matching system according to claim 9, wherein each filter comprises a plurality of filter coefficients and wherein the one or more processors are configured to apply the filter to the portion of the spectrogram by weighting each value of the values of the portion with a respective filter coefficient and by combining the weighted values to generate the value of the fine query acoustic fingerprint.
11. The audio matching system according to claim 10, wherein each filter from the set of filters is applied to a plurality of portions of the spectrogram to generate a corresponding plurality of values of the fine query acoustic fingerprint.
12. The audio matching system according to claim 11, wherein the one or more processors are configured to generate a row or column of the fine query acoustic fingerprint in response to applying each filter to the spectrogram.
13. The audio matching system according to claim 12, wherein the one or more processors are configured to order rows or columns of the fine query acoustic fingerprint that are generated by applying the set of filters to the spectrogram so that similar rows or columns are adjacent to each other.
14. The audio matching system according to claim 13, wherein the one or more processors are configured to order the rows or columns of the fine query acoustic fingerprint in order to increase coherence between neighbouring rows or columns of the fine query acoustic fingerprint.
15. The audio matching system according to claim 1, wherein each filter has an associated offset that defines portions of the fine query acoustic fingerprint to which the filter is applied.
16. The audio system according to claim 1 wherein the audio input is configured to capture one of: an acoustic signal and an electromagnetic signal.
17. An audio matching method, the method comprising: capturing, by one or more processors, an audio signal; providing, by the one or more processors, an audio database comprising a plurality of database entries, each entry of the database entries being associated with audio content, and said each entry of the database entries comprising: a fine database acoustic fingerprint representative of the associated audio content and information relating to the associated audio content; processing, by the one or more processors, the captured audio signal to generate a fine query acoustic fingerprint and a fine query acoustic fingerprint representative of the captured audio signal, wherein the fine query acoustic fingerprint comprises an array of values; generating, by the one or more processors, a coarse query acoustic fingerprint from the fine query acoustic fingerprint by applying a set of different filters to the fine query acoustic fingerprint; generate, by the one or more processors, a coarse query acoustic fingerprint representative of the captured audio signal; apply, by the one or more processors, a filter from the set of filters to a portion of the array of the values, by weighting each value of the values of the portion with a respective filter coefficient and by combining the weighted values, to generate a value of the coarse query acoustic fingerprint, wherein the filter comprises a plurality of filter coefficients; matching, based on the generated value of the coarse query acoustic fingerprint, the coarse query acoustic fingerprint with coarse database acoustic fingerprints of the plurality of the database entries to identify a subset of possibly matching database entries; matching, by the one or more processors, the fine query acoustic fingerprint with fine database acoustic fingerprints of the database entries in said subset of the possibly matching database entries to identify a matching database entry; and outputting, by the one or more processors, a matching response comprising information relating to the identified matching database entry.
18. An audio matching system comprising: an audio database comprising a plurality of database entries, each entry of the database entries being associated with audio content, and said each entry of the database entries: i) comprising a first database acoustic fingerprint representative of the associated audio content having a first bit rate; ii) having an associated second database acoustic fingerprint representative of the associated audio content having a second bit rate that is lower than the first bit rate; and iii) comprising information relating to the associated audio content; and one or more processors configured to: capture an audio signal; process the captured audio signal to generate a first query acoustic fingerprint and a first query acoustic fingerprint representative of the captured audio signal having said first bit rate, wherein the first query acoustic fingerprint comprises an array of values; apply a filter from a set of filters to a portion of the array of values, by weighting each value of the values of the portion with a respective filter coefficient and by combining the weighted values, to generate a value of a coarse query acoustic fingerprint; generate a second query acoustic fingerprint and a second query acoustic fingerprint representative of the captured audio signal having said second bit rate by applying a set of different filters to the first query acoustic fingerprint; match the second query acoustic fingerprint with second database acoustic fingerprints associated with said plurality of the database entries to identify a subset of possibly matching database entries; match, based on the generated value of the coarse query acoustic fingerprint, the first query acoustic fingerprint with first database acoustic fingerprints of the database entries in said subset of the possibly matching database entries to identify a matching database entry; and output a matching response comprising information of the identified matching database entry.
19. The audio matching system according to claim 18, wherein each entry comprises associated second database acoustic fingerprint or wherein the one or more processors are configured to generate the second database acoustic fingerprint associated with a database entry.
20. An audio matching system comprising: a user device, an audio matching server and an audio database comprising a plurality of database entries, each entry of the database entries being associated with audio content, and said each entry of the database entries: i) comprising a first database acoustic fingerprint representative of the associated audio content having a first bit rate; ii) having an associated second database acoustic fingerprint representative of the associated audio content having a second bit rate that is lower the first bit rate; and iii) comprising information relating to the associated audio content; wherein the user device has one or more processors configured to: capture an audio signal; process the captured audio signal to generate a first query acoustic fingerprint and a first query acoustic fingerprint representative of the captured audio signal having said first bit rate, wherein the first query acoustic fingerprint comprises an array of values; wherein the audio matching server has one or more processors; wherein the one or more processors of the user device or the one or more processors of the audio matching server is configured to: generate a second query acoustic fingerprint and a second query acoustic fingerprint representative of the captured audio signal having said second bit rate; apply a filter from the set of filters to a portion of the array of values, by weighting each value of the values of the portion with a respective filter coefficient and by combining the weighted values, to generate a value of a coarse query acoustic fingerprint, wherein the filter comprises a plurality of filter coefficients; wherein the one or more processors of the audio matching server is configured to: match the second query acoustic fingerprint with second database acoustic fingerprints associated with said plurality of database entries to identify a subset of possibly matching database entries; match, based on the generated value of the coarse query acoustic fingerprint, the first query acoustic fingerprint with first database acoustic fingerprints of the database entries in said subset of the possibly matching database entries to identify a matching database entry; and outputting a matching response comprising information of the identified matching database entry.
21. A user device for use in an audio matching system, the user device comprising: an audio input for capturing an audio signal; an audio database comprising a plurality of database entries, each entry of database entries being associated with audio content, and said each entry of the database entries comprising: i) a fine database acoustic fingerprint representative of the associated audio content; and ii) information relating to the associated audio content; and one or more processors configured to: process the captured audio signal to generate a fine query acoustic fingerprint and a fine query acoustic fingerprint representative of the captured audio signal, wherein the fine query acoustic fingerprint comprises an array of values; generate a coarse query acoustic fingerprint from the fine query acoustic fingerprint by applying a set of different filters to the fine query acoustic fingerprint; generate a coarse query acoustic fingerprint representative of the captured audio signal; apply a filter from a set of filters to a portion of the array of the values, by weighting each value of the values of the portion with a respective filter coefficient and by combining the weighted values, to generate a value of the coarse query acoustic fingerprint, wherein the filter comprises a plurality of filter coefficients; match, based on the generated value of the coarse query acoustic fingerprint, the coarse query acoustic fingerprint with coarse database acoustic fingerprints associated with said plurality of the database entries to identify a subset of possibly matching database entries; match the fine query acoustic fingerprint with fine database acoustic fingerprints of the database entries in said subset of the possibly matching database entries to identify a matching database entry; and output a matching response comprising information of the identified matching database entry.
22. An audio matching server for use in an audio matching system, the audio matching server comprising: an audio input for capturing an audio signal; an audio database comprising a plurality of database entries, each entry of database entries being associated with audio content, and said each entry of the database entries comprising: i) a fine database acoustic fingerprint representative of the associated audio content; and ii) information relating to the associated audio content; and one or more processors configured to: process the captured audio signal to generate a fine query acoustic fingerprint and a fine query acoustic fingerprint representative of the captured audio signal, wherein the fine query acoustic fingerprint comprises an array of values; generate a coarse query acoustic fingerprint from the fine query acoustic fingerprint by applying a set of different filters to the fine query acoustic fingerprint; generate a coarse query acoustic fingerprint representative of the captured audio signal; apply a filter from a set of filters to a portion of the array of the values, by weighting each value of the values of the portion with a respective filter coefficient and by combining the weighted values, to generate a value of the coarse query acoustic fingerprint, wherein the filter comprises a plurality of filter coefficients; match, based on the generated value of the coarse query acoustic fingerprint, the coarse query acoustic fingerprint with coarse database acoustic fingerprints associated with said plurality of the database entries to identify a subset of possibly matching database entries; match the fine query acoustic fingerprint with fine database acoustic fingerprints of the database entries in said subset of the possibly matching database entries to identify a matching database entry; and output a matching response comprising information of the identified matching database entry.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) These and other aspects of the invention will become apparent from the following detailed description of exemplary embodiments which are described with reference to the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24) Overview
(25)
(26) A more detailed description will now be given of the main parts of the above described audio matching system.
(27) User Cellular Telephone
(28)
(29) When making a voice call, the processor 63 compresses the received audio and then passes it to an RF processing unit 57 which modulates the compressed audio data onto one or more RF carrier signals for transmission to the base station 7 via the antenna 27. Similarly, compressed audio signals received via the antenna 27 are fed to the RF processing unit 57, which demodulates the received RF signals to recover the compressed audio data from the RF carrier signal(s), which is then passed to the processor 63 for decompression. The regenerated audio samples are then output to the loudspeaker 25 via the digital to analogue converter 59 and the amplifier 61.
(30) As shown in
(31) Application Software—Frequency Analysis
(32)
(33) As is well known in the art, a windowing function (such as a Hamming window) is typically used to extract the frames 39 of samples from the incoming audio signal (D(t))—to reduce distortions introduced by the extraction. Once a frame 39 of samples has been extracted, the frequency analysis unit 34 performs a frequency analysis process on the audio samples to determine the frequency content within a defined frequency band of interest, which will typically be a portion of the passband of the filter 51. In this embodiment, this frequency band of interest is limited to the band 475 Hz to 2.52 kHz. Other frequency bands may of course be used.
(34) As those skilled in the art will appreciate, the frequency analysis process performed by the frequency analysis unit 34 can be done in a number of different ways—such as by using a Fast Fourier Transform (FFT) or a Discrete Cosine Transform (DCT) or by using wavelet transforms or even by using an array of filter banks. In the preferred embodiment wavelet transforms are used. This frequency analysis will generate, for each frame 39 of audio samples, a vector of numbers—representing the frequency content (amplitude) in each of a number (K) of frequency sub-bands within the defined frequency band of interest (e.g. 475 Hz to 2.52 kHz). Thus as shown in
(35) As illustrated in
(36) The spectrogram thus generated is effectively a K×L matrix of values representing the audio clip. The rows of the matrix represent the different frequency sub-bands and the different columns represent different points of time within the audio clip. The individual value in the spectrogram at location (i, j) corresponds to the amplitude of the frequency component in sub-band i at time j. Of course, the matrix could be written in the transpose—with the columns representing the frequency sub-bands and the rows representing the time points. Therefore, references to rows and columns in this document are interchangeable.
(37) Application Software—Fingerprint Generation
(38) Returning to
(39)
(40)
(41) And the filter type 47-2 can be formed from the following matrix of coefficients: [−1 2 −1]
(42)
(43) This process is illustrated in more detail in
(44) As discussed above, the fingerprint 43 is generated by applying a first optimised set 45 of these filters 47 to the spectrogram 35. Each different filter 47 in this first set 45 will produce a different row of the final fingerprint 43. Thus, concatenating the different binary vectors 44 produced by applying the first set 45 of filters to the spectrogram 35 into a matrix, forms the final output 2D fingerprint 43. The order in which the binary vectors 44 are concatenated is determined in advance and the same first set 45 of filters and ordering are used to generate corresponding fingerprints for the entries in the database 15—so that the audio matching server 5 can compare fingerprints that have been generated in the same manner. In this embodiment, the first set 45 of filters comprises thirty two different filters and so the fingerprint that is generated will be a 32 by L matrix of binary values. Thirty two filters were chosen as this allows for convenient processing of the fingerprints by a 32 bit or a 64 bit processor (which may be used, for example, to perform the fingerprint matching process in the audio matching server 5). However, as those skilled in the art will appreciate any number of filters may be used in the first set 45.
(45) Further, as before, the rows and columns of the fingerprint 43 are interchangeable. Thus instead of the binary vectors 44 forming the rows of the fingerprint 43, they may be used to form the columns. In this case the fingerprint will be an L by 32 matrix of binary values. As long as the same process is performed to generate the fingerprints for the entries in the audio database 15, the orientation of the fingerprint 43 does not matter.
(46) Fingerprint Row/Column Ordering
(47) As discussed above, the binary vectors 44 generated by applying the filters 47 in the optimised set 45 to the spectrogram 35 are concatenated together in an order that is defined in advance. Normally, the order does not matter—as long as the same order is applied when generating the fingerprints for the entries in the database 15. This means that in a conventional fingerprint, the 1s and 0s will be appear randomly distributed throughout the fingerprint—such as for the example fingerprint 43-1 shown in
(48) As those skilled in the art will appreciate, as the ordering of the binary vectors 44 is known in advance, the individual binarised values could be written directly in to the relevant part of the fingerprint 43 without being written into a binary vector 44 first. The explanation above has been given for ease in understanding the way in which the fingerprint 43 is generated.
(49) Application Software—Matching Response
(50) Returning to
(51) Audio Matching Server
(52)
(53) In this embodiment, the processor 201 is controlled by software instructions stored in memory 209 (although in other embodiments, the processor 201 may be formed from one or more dedicated hardware processors—such as Application Specific Integrated Circuits). The software instructions include an operating system 211 that controls the overall operation of the audio matching server 5; a communications control module 213 that controls communications between the audio matching server 5 and the user device 1 and the database 15; a coarse fingerprint generation unit 215 that generates a coarse fingerprint from the query fingerprint 43 received from a user device 1; a coarse fingerprint matching unit 217 that matches the coarse fingerprint generated by the coarse fingerprint generation unit 215 with coarse fingerprints stored in the database 15; a fine fingerprint matching unit 219 that matches the fingerprint 43 received from the user device 1 with a fine fingerprint of a subset of the entries in the database 15 to identify a matching entry; and a matching response reporting unit 220 that reports the matching results back to the user device 1 in a matching response message 46. As will be explained in more detail below, the coarse fingerprint generation unit 215 generates the coarse fingerprint using a second set 221 of optimised filters that is stored in the memory 209.
(54) As discussed above, the audio matching server 5 performs matching operations between coarse/fine fingerprints corresponding to a query received from the user device 1 and coarse/fine fingerprints stored within the database 15. To distinguish between these different fingerprints, the fingerprint 43 received from the user device 1 will be referred to the “fine query fingerprint” 43 and the coarse fingerprint that is generated from it will be referred to as the “coarse query fingerprint”. The fingerprints stored in the database 15 will be referred to as “coarse database fingerprints” and “fine database fingerprints”.
(55) Coarse Fingerprint Generation Unit
(56) As mentioned above, in this embodiment, the coarse fingerprint generation unit 215 generates the coarse query fingerprint from the fine query fingerprint 43 received from the user device 1. This is advantageous as it means that the user device 1 does not need to transmit, for example, the spectrogram 35 of the audio clip to the audio matching server 5 in order for the coarse query fingerprint to be generated.
(57)
(58)
(59)
(60) The inventors have found that the coarse fingerprint generated in this way is still sufficiently distinctive to allow it to be used in an initial search of the database 15 in order to significantly reduce the number of database entries that have to be compared with the fine query fingerprint. This is because of the specific ordering that was performed to generate the fine query fingerprint. This ordering means that there are larger (information containing) bit patterns in the fine fingerprint and the information contained in these larger bit patterns survive to some extent through the process of generating the corresponding coarse fingerprints. If a more traditional (random looking) fine fingerprint (such as the fingerprint 43-1 shown in
(61) Coarse Fingerprint Matching Unit
(62) Once the coarse query fingerprint 309 has been generated, the coarse fingerprint matching unit 217 compares the coarse query fingerprint 309 with coarse database fingerprints stored in the database 15 in order to identify a subset of the database entries that may be a possible match.
(63)
(64) As shown in
(65) Thus, once the coarse query fingerprint 309 has been generated, the coarse fingerprint matching unit 217 matches (i.e. compares) the coarse query fingerprint 309 with the coarse database fingerprint 325 stored in every entry 320 of the database 15; in order to identify a number of possible matching entries. This matching process is illustrated in
(66)
(67) In this embodiment the bit-wise comparison considers the percentage of non-matching bits. Thus if there is not a match then the expected percentage of non-matching bits should be around 50% (or 0.5). If there is a match then the expected percentage of non-matching bits should be close to zero.
(68) The comparison result for the comparison between the coarse query fingerprint 309 and the coarse database fingerprints 325 includes a list of database entries 320 that might match with the coarse query fingerprint 309. In
(69) Fine Fingerprint Matching Unit
(70) The comparison results obtained from the coarse fingerprint matching unit 217 are then passed to the fine fingerprint matching unit 219 which use this information to restrict the matching operation that it does between the fine query fingerprint 43 and the fine database fingerprints 323. In particular, the fine fingerprint matching unit 219 uses the list of possible matching entries so that the fine fingerprint comparisons are restricted to just the fine fingerprints in the database entries identified in this list of possible matching entries. Further, if the comparison results include timing information indicating the time within the audio content where the match was found in the coarse database fingerprint 325, then the fine fingerprint matching unit 219 uses this timing information to restrict the comparison between the fine query fingerprint 43 and the corresponding fine database fingerprint 323 to around this timing. So for example, if the match was found at 135 seconds from the start of the coarse fingerprint 325 then the fine matching unit 219 may restrict the matching process so that the fine query fingerprint 43 is only matched with the portions of the fine database fingerprint between times 130 and 145 seconds from the start.
(71)
(72) Matching Response Reporting Unit
(73) The matching response reporting unit 220 either receives a “nil” report or the identifier for the database entry 320 that matches with the fine query fingerprint. If a “nil” report is received then the matching response reporting unit 220 returns a “nil” response back to the user device 1. If a database identifier is received, then the matching response reporting unit 220 retrieves relevant information from the corresponding database entry 320. The information retrieved may include the stored metadata 322 and/or stored links 327 from the identified database entry 320. This information is then returned to the user device 1 in a matching response message 46.
(74) Training
(75) Identifying the First Optimised Set of Filters
(76) The above description describes the operation of an audio matching system that uses audio fingerprints to identify captured audio. In order to generate the fine fingerprint 43, a first set 45 of optimised filters was applied to the spectrogram 35 of the captured audio. The way in which this first set 45 of optimised filters is determined will now be explained. This process happens in advance during a training routine.
(77) As discussed above with reference to
(78)
(79)
(80) Referring to
(81) As can be seen from
(82) The process illustrated in
(83) Unfortunately, it is not possible to just determine the distributions for all the 16,000 possible filters and then pick the ones that have the best discrimination (separation between the matching and non-matching distributions and smallest variances etc.)—as many of the filters will effectively be isolating the same characteristic feature in the audio signal. That is many of the distributions for the different filters will be highly correlated with one another. It is possible to identify these correlations by looking at the covariance between the filter distributions and to use this information in an optimisation process to find the optimum combination of filters. The aim of that optimisation can be to minimise the chance of “false positives” (falsely declaring a pair a “match”) and to minimise the chance of false negatives (falsely declaring a pair a “non-match”)—when the generated fingerprint is being matched against the database fingerprints. These are contradictory demands as in general reducing the chance of false positives increases the chance of false negatives. To address this, we can define a certain accepted rate of false positives (P.sub.FP,accept) and then, subject to this constraint, we can find the optimal set of filters that minimises the false-negative rate.
(84) To calculate P.sub.FP,accept, note that the distribution resulting from a set of filters is the sum of normal distributions and therefore a normal distribution itself. Therefore, if the distributions for matching and non-matching pairs are well separated (like that shown in
(85) The chance of a false positive is based on the chance of the bit error rate of a non-matching pair falling below the threshold (γ), which, for a normal distribution, is given by:
(86)
Where μ.sub.N is the mean bit error rate for a pair of non-matching fingerprints, σ.sub.N is the standard deviation of the bit error rate for a pair of non-matching fingerprints and erfc is the standard complimentary error function.
(87) When a fingerprint is being matched against a large database of fingerprints, the probability of a false positive depends on the size of the database (D) and can be approximated as:
(88)
which is set to the acceptance rate. This equation can be inverted to find the corresponding threshold value that will achieve this accepted rate of false positives:
γ=μ.sub.N−√{square root over (2)}σ.sub.Nerfc.sup.−1(2P.sub.FP,accept/D)
(89) Therefore, the false-negative rate can now be minimised (to thereby maximise the recognition rate), by minimising the chance of a false negative, given the threshold is set as above. The result is:
(90)
(91) Where μ.sub.M is the mean bit error rate for a pair of matching fingerprints, σ.sub.M is the standard deviation of the bit error rate for a pair of matching fingerprints, μ.sub.N is the mean bit error rate for a pair of non-matching fingerprints, σ.sub.N is the standard deviation of the bit error rate for a pair of non-matching fingerprints and erfc is the standard complimentary error function.
(92) Since the complimentary error function is a monotonically decreasing function, minimising the chance of a false negative, i.e. minimising the above function, is equivalent to maximising the argument of the complimentary error function, here called the first ‘Score’; S.sup.(1):
(93)
(94) Thus, the aim of the optimisation process is to find the set 45 of filters with aggregate parameters (μ.sub.M, μ.sub.N, σ.sub.M, σ.sub.N) that result in the highest score S.sup.(1).
(95) These aggregate parameters over the set 45 of filters are related to the individual parameters of the individual filters in the set 45 as follows:
(96)
(97) Where n is the number of filters in the set 45. The aggregated variance (square of the standard deviation) becomes a combination of the variances of the individual filters belonging to the set 45 as well as the covariance between pairs of filters in the set 45, as follows:
(98)
(99) Where COV.sup.(l,k) is the covariance between filter l and filter k.
(100) The means and variances for individual filters for both matching and non-matching pairs of audio dips can be determined from the training process discussed above with reference to
(101)
(102) And the corresponding variances from:
(103)
(104) Moreover, the covariance value (COV.sub.M.sup.(i,j)) between two filters (i and j) for matching pairs of audio clips can be determined from:
(105)
(106) And the covariance value (COV.sub.N.sup.(i,j)) between two filters (i and j) for non-matching pairs of audio clips can be determined from:
(107)
(108) This involves the calculation and storage of N.sub.f (the number of filters considered, which as discussed above is roughly 16,000) values of (μ.sub.M(i), σ.sub.M(i)); N.sub.f values of (μ.sub.N(i), σ.sub.N(i)); and (the dominant part) 2(N.sub.f).sup.2 covariance values. From these values it is then possible to calculate the above score S.sup.(1) for any combination of filters.
(109) It is not practical to calculate this score for every single combination of n filters from this set of 16000 possible filters—the number of combinations prohibits this. However, it is possible to use a Dynamic Programming technique to break this problem down into an iterative path searching problem through a trellis of nodes that propagates and scores paths through the trellis of nodes. This means that the optimum path can be found through the trellis without having to consider and score all paths.
(110) Such a trellis 409 is illustrated in
(111) One of the advantages of using a dynamic programming technique to find the best path through the trellis is that the scores for each path can be accumulated during the path traversal process. Specifically, considering the instance where a candidate path currently ends at node q at column number K in the trellis (i.e. K filters have been selected so far), which represents a set of filters l=1, 2, . . . K. In this case, note that the aggregate means μ.sub.M and μ.sub.N can be updated toward node r at column K+1, by adding the means of node r, μ.sub.M(r) and, μ.sub.N(r), i.e.:
(112)
where μ.sub.M.sup.q and μ.sub.N.sup.q are the aggregate means at node q, combining filters l=1, 2, . . . K (i.e. at column K) and μ.sub.M.sup.r and μ.sub.M.sup.r are the aggregate means at node r. Similarly, variances (σ.sub.M).sup.2 and (σ.sub.N).sup.2 can be updated from column K to column K+1 as follows:
(113)
where the covariance of the added filter at node r with all previous filters in the path must be taken into account. The updated metrics can then be used to recalculated the score S at at node r.
(114) As those skilled in the art will appreciate, the trellis 409 illustrated in
(115) Identifying the Second Optimised Set of Filters
(116) As discussed above, in order that a meaningful coarse fingerprint 309 can be generated from the fine fingerprint 43, the rows (or columns) of the fine fingerprint 43 have to be ordered so that there is some level of coherence in the fine fingerprint—i.e. filters that tend to produce similar results are ordered next to each other. In this way, filters that are generally correlated with each other are next to each other. This results in a fine fingerprint 43 that is less random in appearance—i.e. having larger areas of the same binary value (as illustrated in by the fingerprint 43-2 shown in
(117) As discussed above, the covariance values that are determined for two filters gives information on the correlation between the two filters. Therefore, we can determine the order of the filters in dependence upon the covariance values calculated for the n filters in the optimised set 45 of filters. This can be achieved, for example, using a reverse Cuthill-Mckee ordering on the largest covariance values for the n filters. The determined ordering information is also provided to the user devices 1 with the first optimised set 45 of filters.
(118) A similar training process can then be applied to determine the second set 221 of optimised filters. The main difference between this training process and the training process discussed above is that the filters are applied to fine fingerprints that are obtained for the matching and non-matching pairs of audio clips. Also, the optimisation process has a different goal.
(119)
(120) The optimisation goal for determining the second set 221 of filters is to find the filters that will result in a minimum subset of possible matching database entries whilst not excluding the correct entry in the database 15—which will therefore reduce the number of comparisons required of the fine fingerprint 43. The database entries that must be searched in more detail (i.e. those for which a comparison between fine fingerprints will be performed) are those that fall below some second threshold γ.sup.(2) (which will be different from the γ threshold used above). The expected number (N.sub.f) of database entries below the threshold is given by:
(121)
where all the parameters are representative of the coarse fingerprints and not the fine fingerprints—as signified by the superscript (2). To quantify N.sub.r, the threshold γ.sup.(2) must be set. This threshold is set by defining an acceptable probability (P.sub.accept) of a false negative (falsely classified a matching fingerprint as a non-matching fingerprint) from:
(122)
(123) This equation can be inverted to yield:
γ.sup.(2)=μ.sub.M.sup.(2)+σ.sub.M.sup.(2)√{square root over (2)}erfc.sup.−1(2P.sub.accept)
which provides the threshold γ.sup.(2) for a given acceptable false negative rate. Inserting this threshold into the equation for the expected number (N.sub.f) of database entries below the threshold yields:
(124)
(125) In order to minimise this number, we need to find the combination of filters that will maximise the argument of this complimentary error function. Thus, the score to be maximised in this second optimisation process is given by:
(126)
(127) Again, the task is to find the combination of filters that maximises this score; where the aggregate mean and variances (μ.sup.(2).sub.N, μ.sup.(2).sub.M, σ.sup.(2).sub.N and σ.sup.(2).sub.M) for any combination of filters can be calculated using the means, variances and covariances determined for each filter of the combination during the training process illustrated in
(128)
(129) Where n is the number of filters in the second set 221 of optimised filters. The aggregate variance (square of the standard deviation) becomes a combination of the variances of the individual filters belonging to the set 221 as well as the covariance between pairs of filters, as follows:
(130)
(131) Where COV.sup.(2)(l,k) is the covariance between filter l and filter k.
(132) The means, variances and covariances for individual filters for both matching and non-matching pairs of audio clips are determined from the training process discussed above with reference to
(133)
(134) And the corresponding variances from:
(135)
(136) Moreover, the covariance value (COV.sub.M.sup.(2)(i,j) between two filters (i and j) for matching pairs of audio clips can be determined from:
(137)
(138) And the covariance value (COV.sub.N.sup.(2)(i,j)) between two filters (i and j) for non-matching pairs of audio clips can be determined from:
(139)
(140) As before, it is not practical to calculate the above score (S.sup.(2)) for every possible combination of n filters from the set of 16,000 possible filters—the number of possible combinations is too large. However, as before, we can use Dynamic Programming to find the path having the maximum score (S.sup.(2)) using the trellis 409 and the path propagation techniques discussed above. This dynamic programming procedure will identify the best path through the trellis 409—which in turn identifies the best combination of filters to form the second set 221 of optimised filters that are used by the audio matching server 5 to generate the coarse fingerprints from fine fingerprints.
(141) As before, the path scoring can be accumulated during the dynamic programming path propagation to find the best path through the trellis—so it is not necessary to recalculate the score S.sup.(2) each time a new node (filter) is added to a candidate path. Instead the score is updated using the individual statistics for the filter associated with the new node r, column K+1, when coming from node q at column number K as before:
(142)
where μ.sub.M.sup.q.sup.
(143)
(144) The updated metrics can then be used to recalculated the score S.sup.(2) at node r.
(145) Modifications and Further Embodiments
(146) An embodiment has been described above illustrating the way in which fingerprints may be created for the identification of an audio signal in an audio database. As those skilled in the art will appreciate various modifications and improvements can be made to the above embodiment and some of these modifications will now be described.
(147) In the above embodiment, the user device generated a fine fingerprint which it transmitted to the audio matching server which generated a coarse fingerprint from the fine fingerprint. In another embodiment, the user device itself may calculate the coarse fingerprint and send it together with the fine fingerprint to the audio matching server.
(148) In the above embodiments, a coarse fingerprint was generated from the fine fingerprint. This is particularly beneficial in the scenario where a user device determines the fine fingerprint and sends it to a remote server for comparison with the database entries. In other embodiments where the user device calculates both the coarse fingerprint and the fine fingerprint, the coarse fingerprint can be determined from the spectrogram of the captured audio rather than from the fine fingerprint. In this case, the second set 221 of optimised filters would be trained using the second score described above—but based on binarised vectors obtained by applying the filters to the spectrogram rather than to the fine fingerprint. This would also be the case if the user device transmitted the fine fingerprint and the spectrogram to the remote server—which then calculated the coarse fingerprint from the received spectrogram. However, this latter possibility is not preferred as it requires the spectrogram (which is a large data structure) to be transmitted from the user device to the server.
(149) In the above embodiments, the user device or the audio matching server generated a coarse fingerprint from a fine fingerprint using a set of optimised filters that are applied to the fine fingerprint. In a simpler embodiment, the coarse fingerprint could be generated simply by sub-sampling the fine fingerprint or by averaging the fine fingerprint. However, it is preferred to apply the above described second set of optimised filters to the fine fingerprint as it has been found that the resulting coarse fingerprint is better at minimising the number of database entries that are found to be possibly matching whilst minimising false positives and false negatives.
(150) In embodiments where the size of the database is relatively small, the audio matching server and the database may form part of the user device itself. In this case, there is no need for the user device to transmit any fingerprint data over the telecommunications network or over the computer network. This data would simply be sent between the different software components being run in the user device (although any results of the matching may be transmitted over the network to a server).
(151)
(152) In the above embodiments, the optimisation process set an acceptable false positive rate and then found the set of filters that minimised the false negative rate. In another embodiment, the optimisation process may set an acceptable false negative rate and then find the set of filters that minimises the false positive rate. In a further embodiment the optimisation process may set an acceptable false positive rate and an acceptable false negative rate and then find the set of filters that minimises some other cost function.
(153) In the above embodiment, the dynamic programming processes selected the path through the trellis 409 having the highest score. As those skilled in the art will appreciate, the best or optimum path that is chosen does not have to be the one having the highest score—for example the path having the second highest or the third highest score could be used instead.
(154) In the above embodiment, a user device captured sounds using a microphone and the audio samples were processed using a software application stored on the user device. As those skilled in the art will appreciate, some or all of this processing may be formed by dedicated hardware circuits, although software is preferred due to its ability to be added to the portable user device after manufacture and its ability to be updated once loaded. The software for causing the portable user device to operate in the above manner may be provided as a signal or on a carrier such as compact disc or other carrier medium. Additionally, a range of other portable devices may be used, such as laptop computers, PDAs, tablet computers and the like. Similarly, the software forming part of the audio matching server may be replaced with suitable hardware circuits—such as Application Specific Integrated Circuits.
(155) The above embodiments have described a fingerprint based audio matching system. This system may also be used together with a watermarking type of audio matching system that detects hidden watermarks that have been hidden in the audio. In particular, if no watermark can be found in some captured audio, then the above fingerprint audio recognition can be used to identify the captured audio.
(156) In the above embodiments, the sets of optimised filters used five different types of filter. In other embodiments more or fewer types of filter may be used. Further it is not essential to use filters that are rectangular in shape—other irregular shapes of filters could be used (such as “L” shaped filters). The shape of filter just defines the neighbouring values in the spectrogram (or in the fine fingerprint) that are weighted by the corresponding coefficient in the filter and then combined together.
(157) In the above embodiment, when generating the fine fingerprint, each filter of the first set of optimised filters was stepped along the spectrogram one time step at a time. This meant that the fine fingerprint had the same temporal dimension as the original spectrogram. As those skilled in the art will appreciate, the fine fingerprint could omit some of these data points—at the start or end of the spectrogram. Also, a larger step size could also be used—for example one time point could be skipped at each step. In this case the fine fingerprint would have a temporal duration half that of the spectrogram. So if the spectrogram had 500 time points then the generated fine fingerprint would have 250 time points.
(158) In the embodiment described above, the coarse fingerprint was generated with a time resolution 1/10.sup.th that of the spectrogram. That is 10 time points were skipped between steps when the second set of optimised filters were stepped across the spectrogram. As those skilled in the art will appreciate, other step sizes could of course be used to achieve different compression of the data in the time dimension.
(159) In the above embodiments, the audio signal captured by the user device was an acoustic signal. In other embodiments, the user device may capture the audio signal as an electromagnetic signal received via the antenna of the user device; or in the case that the user device is not a portable device and is, for example, a personal computer or a set-top-box or a smart television, the audio signal may be captured via a signal received over a broadcast television network (e.g. a satellite network, a cable network, an ADSL network or the like), the Internet or some other computer network.
(160) In the above embodiments each entry in the database contained a coarse fingerprint and a fine fingerprint. In another embodiment, each database entry may not contain the coarse fingerprint—which may instead be generated when needed from the fine database fingerprint. The coarse database fingerprint may be generated from the fine database fingerprint in a number of different ways—just as the coarse query fingerprint can be determined in a number of different ways from the fine query fingerprint. These different ways of determining the coarse fingerprint from the fine fingerprint are discussed above and will not be repeated again. Needless to say, the technique used to generate the coarse query fingerprint should be the same as the technique that is used to generate the coarse database fingerprint.