Acoustic fingerprint extraction and matching

Abstract

A method of acoustic matching of audio recordings by means of acoustic fingerprinting is disclosed. An acoustic fingerprint is extracted from a fragment of an audio recording. The fingerprint represents a highly discriminative compact digital digest (acoustic hash) of the acoustic recording and consists of smaller digital entities called acoustic sub-fingerprints (acoustic hash-words), computed from perceptually essential properties of the acoustic recording. Two acoustic fingerprints corresponding to two audio fragments are matched to determine degree of acoustic similarity of the two audio fragments.

Claims

1. An automated method for extracting an acoustic sub-fingerprint from an audio signal fragment, said method comprising: using at least one computer processor to perform the steps of: a: dividing an audio signal into a plurality of time-separated signal frames (frames) of equal time lengths of at least 0.5 seconds, wherein all frames overlap in time by at least 50% with at least one other frame, but wherein at least some frames are non-overlapping in time with other frames; b: selecting a plurality of non-overlapping frames to produce at least one cluster of frames, each selected frame in a given cluster of frames thus being a cluster frame; wherein the minimal distance between centers of said cluster frames is equal or greater than a time-length of one frame; c: decomposing each cluster frame into a plurality of substantially overlapping frequency bands to produce a corresponding plurality of frequency band signals, wherein said frequency bands overlap in frequency by at least 50% with at least one other frequency band, and wherein at least some frequency bands are non-adjacent frequency bands that do not overlap in frequency with other frequency bands; d: for each cluster frame, calculating a quantitative value of a selected signal property of frequency band signals of selected frequency bands of that cluster frame, thus producing a plurality of calculated signal property values, said selected signal property being any of: average energy, peak energy, energy valley, zero crossing, and normalized energy; e: using a feature vector algorithm and said calculated signal property values of said cluster frames to produce a feature-vector of said cluster; f: using a sub-fingerprint algorithm to digitize said feature-vector of said cluster and produce said acoustic sub-fingerprint.

2. The method of claim 1, wherein said feature vector algorithm performs the steps of: over at least two of said cluster frames, within individual cluster frames, selecting pairs of non-adjacent frequency bands, and calculating a difference between said calculated signal property values of said pairs of non-adjacent frequency bands, thus obtaining within-frame non-adjacent band signal property delta values; within said individual cluster frames, combining said within-frame non-adjacent band signal property delta values to produce an individual frame delta set for said individual cluster frame; selecting pairs of said cluster frames, each cluster frame having a position within said cluster, and using said position within said cluster to calculate derivatives of corresponding pairs of said individual frame delta sets, thus producing between-frame delta derivative values; and producing said feature-vector of said cluster by combining said between-frame delta derivative values.

3. The method of claim 1, wherein said feature vector algorithm performs the steps of: within individual cluster frames, selecting pairs of non-adjacent frequency bands, and obtaining within-frame non-adjacent band signal property delta values by calculating differences between signal property values of said selected pairs; within individual cluster frames, further combining a plurality of said within-frame non-adjacent band signal property delta values to produce an individual frame delta set; within individual cluster frames, further obtaining a within-frame delta derivative value by calculating a difference between said within frame non-adjacent band signal property delta values at two positions of said individual frame delta set; producing said feature vector by combining, over said cluster frames, said within-frame delta derivative values.

4. The method of claim 1, wherein said feature vector algorithm performs the steps of: within individual cluster frames, selecting pairs of non-adjacent frequency bands, and obtaining within-frame non-adjacent band signal property delta values by calculating differences between their signal property values; within individual cluster frames, further combining a plurality of said within-frame non-adjacent band signal property delta values to produce an individual frame delta set; producing said feature vector by combining, over said cluster frames, said frame delta sets from said individual cluster frames.

5. The method of claim 1 wherein said feature-vector of said cluster comprises a vector comprising positive and negative feature-vector numeric values, and said sub-fingerprint algorithm digitizes said feature-vector of said cluster to a simplified vector of binary numbers by setting positive feature vector numeric values to 1, and other feature vector numeric values to 0, thus producing a digitized acoustic sub-fingerprint.

6. The method of claim 1, further used in an automated method for extracting a timeless fingerprint characterizing at least a fragment of an audio signal, said method comprising: using at least one computer processor to perform the steps of: a: dividing any of an audio signal, or a fragment of said audio signal with a time length greater than 3 seconds, into a plurality of time-overlapping signal frames (frames); b: creating a plurality of frame clusters, each frame cluster (cluster of frames) comprising at least two non-overlapping frames; wherein each frame cluster comprises frames (cluster frames) that are disjoint, non-adjacent, and substantially spaced from other frame cluster frames; c: selecting frame clusters, and using the method of claim 1 to compute sub-fingerprints for at least some of said selected frame clusters, thus producing a set of sub-fingerprints, wherein each selected said frame cluster produces a single sub-fingerprint; d: removing sub-fingerprints having repetitive values from said set of sub-fingerprints, thus producing a refined set of sub-fingerprints for this plurality of frame clusters; e: producing said timeless fingerprint by combining, in an arbitrary order, and without any additional information, at least some selected sub-fingerprints from said refined sub-fingerprint set.

7. The method of claim 6, wherein said sub-fingerprints do not carry information about a time location or position of said selected frame clusters relative to said at least a fragment of an audio signal; and wherein said sub-fingerprints do not carry information about a time location or position of said selected frame clusters relative to a time location or position of other clusters of frames used to generate other sub-fingerprints comprising said timeless fingerprint.

8. The method of claim 6 wherein said at least some selected sub-fingerprints from said refined sub-fingerprint are combined in an arbitrary manner which is independent from an order in which corresponding frame clusters of said audio signal appear in said at least a fragment of an audio signal.

9. The method of claim 6, further used in a method for numerically calculating a degree of acoustic similarity of a first and a second audio sample, said method comprising: using at least one computer processor to perform the steps of: a: splitting said first audio sample into a set of first sample fragments, and splitting the second audio sample into a set of second sample fragments, said first audio sample and said second audio sample having a time duration of at least 3 seconds; b: producing a set of first audio sample timeless fingerprints by using acoustic properties of all said first sample fragments and computing a set of first acoustic sub-fingerprints, and combining selected said first acoustic sub-fingerprints in an arbitrary order; and producing a set of second audio sample timeless fingerprints by using acoustic properties of all said second sample fragments by computing a set of second acoustic sub-fingerprints, and combining selected said second acoustic sub-fingerprints in an arbitrary order; c: producing a first timeless super-fingerprint by selecting at least some first audio sample timeless fingerprints from said set of first audio sample timeless fingerprints, and combining them in an arbitrary order; and producing a second timeless super-fingerprint by selecting at least some second audio sample timeless fingerprints from said set of second audio sample timeless fingerprints, and combining them in an arbitrary order; d: matching said first and second timeless super-fingerprints by paring first audio sample timeless fingerprints from said first timeless super-fingerprint with second audio sample timeless fingerprints from said second timeless super-fingerprint, thus producing plurality of fingerprint pairs, and for each fingerprint pair in said plurality of fingerprint pairs, calculating how many identical sub-fingerprints (hits) are contained in both fingerprint pairs, thus producing a hit-list; e: calculating, using said hit-list, a degree of acoustic similarity of said first and a second audio samples.

10. The method of claim 9 wherein: relative positions and temporal relations of any of said sub-fingerprints comprising any of said timeless fingerprints are unknown; and said sub-fingerprints do not carry temporal information about its location within any corresponding sample fragments of any said audio samples; and relative positions and temporal relations of any of said timeless fingerprints in any of said timeless super-fingerprints are unknown; and said timeless fingerprints in any of said timeless super-fingerprints do not carry temporal information about their location relative to the other timeless fingerprints of other said timeless super-fingerprints.

11. The method of claim 9, further omitting consecutive sub-fingerprints with repetitive values when combining, in step b, any of said first acoustic sub-fingerprints or said second acoustic sub-fingerprints to produce any of said first audio sample timeless fingerprints or said second audio sample timeless fingerprints.

12. The method of claim 9, further calculating said degree of acoustic similarity by determining: if a number of hits in said hit-list exceeds a predetermined threshold; or if a maximal number of hits in said hit-list exceeds a predetermined threshold.

13. The method of claim 9, further calculating said degree of acoustic similarity by calculating a sum of hit-list values in positions wherein a number of hits exceeds a predetermined threshold.

14. The method of claim 9, further calculating said degree of acoustic similarity by: normalizing each value of said hit-list by dividing said value by a total amount of sub-fingerprints contained in a shortest timeless fingerprint of a corresponding fingerprint pair related to said value, thus producing a normalized hit-list; and calculating a sum of selected normalized hit-list values.

15. The method of claim 9, further calculating said degree of acoustic similarity by: normalizing each value of said hit-list by dividing said value by an amount of sub-fingerprints contained in a shortest timeless fingerprint of a corresponding fingerprint pair related to said value, thus producing a normalized hit-list; and calculating a sum of those normalized hit-list values in positions wherein a number of hits surpasses a predetermined threshold and/or a normalized value surpasses a predetermined threshold.

16. The method of claim 9, further calculating said degree of acoustic similarity by: normalizing each value of said hit-list by dividing said value by a total amount of sub-fingerprints contained in a shortest timeless fingerprint of a corresponding fingerprint pair related to said value, thus producing a normalized hit-list; calculating an amount of positions in said normalized hit-list where a number of hits surpasses a predetermined threshold and/or a normalized value surpasses a predetermined threshold; and normalizing said amount of positions by a total amount of values in said normalized hit-list.

17. The method of claim 9, further calculating said degree of acoustic similarity by: normalizing each value of said hit-list by dividing said value by a total amount of sub-fingerprints contained in a shortest timeless fingerprint of a corresponding fingerprint pair related to said value, thus producing a normalized hit-list; and calculating any of a peak value, median value, and average value of selected values in said normalized hit-list.

18. An automated method for extracting a timeless fingerprint characterizing at least a fragment of an audio signal, said method comprising: using at least one computer processor to perform the steps of: a: dividing any of an audio signal, or a fragment of said audio signal with a time length greater than 3 seconds, into a plurality of time-overlapping signal frames (frames); b: creating a plurality of frame clusters, each frame cluster comprising at least two non-overlapping frames; wherein each frame cluster comprises frames that are disjoint, non-adjacent, and substantially spaced from other frame cluster frames; c: selecting frame clusters, and computing sub-fingerprints for at least some of said selected frame clusters, thus producing a set of sub-fingerprints, wherein each selected frame cluster produces a single sub-fingerprint; d: removing sub-fingerprints having repetitive values from said set of sub-fingerprints, thus producing a refined set of sub-fingerprints for this plurality of frame clusters; e: producing said timeless fingerprint by combining, in an arbitrary order, and without any additional information, at least some selected sub-fingerprints from said refined sub-fingerprint set.

19. A method for numerically calculating a degree of acoustic similarity of a first and a second audio sample, said method comprising: using at least one computer processor to perform the steps of: a: splitting said first audio sample into a set of first sample fragments, and splitting the second audio sample into a set of second sample fragments, said first audio sample and said second audio sample having a time duration of at least 3 seconds; b: producing a set of first audio sample timeless fingerprints by using acoustic properties of all said first sample fragments to compute a set of first acoustic sub-fingerprints, and combining selected said first acoustic sub-fingerprints in an arbitrary order; and producing a set of second audio sample timeless fingerprints by using acoustic properties of all said second sample fragments to compute a set of second acoustic sub-fingerprints, and combining selected said second acoustic sub-fingerprints in an arbitrary order; c: producing a first timeless super-fingerprint by selecting at least some first audio sample timeless fingerprints from said set of first audio sample timeless fingerprints, and combining them in an arbitrary order; and producing a second timeless super-fingerprint by selecting at least some second audio sample timeless fingerprints from said set of second audio sample timeless fingerprints, and combining them in an arbitrary order; d: matching said first and second timeless super-fingerprints by paring first audio sample timeless fingerprints from said first timeless super-fingerprint with second audio sample timeless fingerprints from said second timeless super-fingerprint, thus producing plurality of fingerprint pairs, and for each fingerprint pair in said plurality of fingerprint pairs, calculating how many identical sub-fingerprints (hits) are contained in both fingerprint pairs, thus producing a hit-list; e: calculating, using said hit-list, a degree of acoustic similarity of said first and a second audio samples.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 shows the segmentation of input time-domain audio signal into overlapping frames.

(2) FIG. 2 shows applying a window function to the frame signal.

(3) FIG. 3 shows decomposition of the frame signal into semi-logarithmically scaled, substantially overlapping frequency bands.

(4) FIG. 4 shows a cluster of frames consisting of two non-overlapping, non-adjacent signal frames.

(5) FIG. 5 shows calculating the delta (difference) value of signal property values (e.g. energy) of non-adjacent bands of a single frame.

(6) FIG. 6 shows calculating the derivative (difference) value from two delta values of two non-adjacent frames i,j comprising a single cluster.

(7) FIG. 7 shows a simplified flowchart of the sub-fingerprint extraction procedure.

(8) FIG. 8 shows a fingerprint of a fragment combined of sub-fingerprints in arbitrary order.

(9) FIG. 9 shows matching of two fingerprints.

(10) FIG. 10 shows producing super-fingerprint of a long audio recording by combining its fingerprints in arbitrary order.

(11) FIG. 11 Shows matching of two super-fingerprints.

DETAILED DESCRIPTION OF THE INVENTION

(12) The present invention, in some embodiments thereof, relates to a method and system of acoustic fingerprinting allowing to determine acoustic similarity of audio recordings. Note that all steps disclosed herein are intended to be automatically implemented by one or more computer processors.

(13) At a pre-processing stage, it is often useful to first convert any digital (sampled) multi-channel audio data to a single (mono) channel and downsample the audio data to a low sampling rate, thus providing sufficient audio bandwidth and acoustic data for the human auditory system to recognize the audio.

(14) In a preferred embodiment, the selected operational sampling rate may be 8000 Hz, which corresponds to a typical telephony audio channel bandwidth.

(15) Acoustic Sub-Fingerprint Extraction

(16) The time-domain signal of an audio fragment is segmented into overlapping frames, and the frame signals are weighted with a suitable window function such as the Hann or Hamming window.

(17) The segmentation of the audio fragment signal into overlapping frames is depicted in FIG. 1. The time-domain audio signal 101 is shown as a waveform on time-amplitude plot. The audio signal is segmented into signal frames 103 having sequential numbers i1, i, i+1, . . . and overlapping by more than 50% with each other. Weighting the frame time-domain signal with a window function is depicted in the FIG. 2. Frame signal sample data 201 is multiplied by the window function 203 to produce the weighted signal frame 205.

(18) In a preferred embodiment, the frame size is selected to be large, 4096 samples (which corresponds to approximately 0.5 seconds at 8000 Hz sampling rate), with the aim to provide substantial signal averaging and increased stability to noises. The overlapping factor is selected to be large too. In a preferred embodiment, it is set to 64, which leads to extraction of approximately 125 signal frames per 1 second of audio. Hann window is used as the weighting function.

(19) The time-domain audio signal of each frame is then decomposed into several frequency bands, which significantly overlap with each other. Decomposition of the signal into n semi-logarithmically scaled frequency bands with >50% overlap is illustrated in FIG. 3. The signal frame 301 is decomposed into frequency bands 303 overlapping with each other and having semi-logarithmically scaled bandwidth.

(20) In a preferred embodiment of the algorithm, the frequency bands are constructed so that each next band contains at least half of the bandwidth of the previous band. The decomposition can be performed using different methods such as filter bank, FFT decomposition, etc.

(21) The bands can be scaled in linear or logarithmic scale. The logarithmic scale better covers sliding of spectrum details up and down as the result of time/speed modification.

(22) In a preferred embodiment, FFT decomposition is performed on each frame (4096 samples), and the obtained frequency spectrum is divided into 16 significantly overlapping (more than 50%) frequency bands with semi-logarithmic bandwidth scaling (i.e. having bandwidth increasing with frequency).

(23) The invention's frequency bands construction method using large band overlap has significant advantage over the non-overlapping, disjoint bands approach, and demonstrates improved resistance to playback speed variation, which inevitably causes salient spectral features to slip from one frequency to another due to spectrum stretching.

(24) The number of bands in single frame should be selected in consideration of required bit-length of a single sub-fingerprint and other related factors as described hereinafter.

(25) The long-term approach of the invention's method comprises in using a cluster of several disjoint, substantially distant, non-overlapping signal frames to form a single acoustic sub-fingerprint. Namely, for the reference frame having sequential number i in the source audio fragment, in order to form a cluster of frames, one or more additional preceding disjoint signal frames are selected, such as frames with numbers in, i2n, . . . , where n is large enough to provide at least one frame size separation between two frames in the cluster. The selected frames should not overlap with each other in order to contribute independent, uncorrelated information into their combination in the cluster.

(26) In a preferred embodiment, each cluster is formed by three frames. Two, four or more frames can be used in other possible implementations, including, depending on the required bit-length of a single sub-fingerprint, selected number of bands in one frame and other related factors.

(27) In a preferred embodiment, the distance between centers of two closest frames in the cluster is set to be twice the frame size.

(28) FIG. 4 illustrates two non-overlapping, time-separated signal frames of an audio signal 401, namely, the reference frame 403 having number i and its preceding frame 405 having number in, forming a single cluster 407.

(29) A single cluster of frames is used to produce a single sub-fingerprint. In a preferred embodiment, each consecutive cluster of frames, corresponding to each consecutive reference frame, is used to extract a corresponding sub-fingerprint. In other embodiments, clusters that are used for sub-fingerprint extraction can be selected based on a specific criterion (e.g. frame signal level thresholding) or even randomly. An exact mechanism and criteria on selecting clusters suitable or not suitable for sub-fingerprint extraction is located outside the scope of this description. Fingerprint matching and searching approaches disclosed hereinafter do not rely on either succession or on contiguousness of the sub-fingerprints in the fingerprint.

(30) In order to form a single sub-fingerprint, the spectral band data contained in frames of the cluster is first quantitatively characterized. The characterization is done based on a selected perceptually essential property of the band signal (e.g. average band energy). A feature-vector is then constructed by applying a specific calculation method to the numerical values of the spectral band data property and combining the calculated values into one set. The constructed feature-vector is then converted into a sub-fingerprint binary data using a pre-defined computation rule.

(31) Various different signal properties can be used to generate (extract) the feature-vector and the corresponding sub-fingerprint data from the data carried by the band signals of frames in the cluster. Examples of possible signal properties that can be used for this purpose: zero-crossing, peak energy, average energy, etc. In the simplest implementation, the feature-vector can be produced by combining the calculated band signal property values into one vector, and the sub-fingerprint can be then derived from this simple feature-vector by rounding its values to one of two closest pre-defined levels to obtain the corresponding 0's or 1's for the data-bits of the sub-fingerprint.

(32) The invention is based, in part, on the insight, obtained from experimental work, that the difference (delta) of average energies in non-adjacent bands of a frame represents robust, resistant and highly discriminative signal property. Its derivative over time, i.e. a quantitative rate of change of the said delta from one frame of the cluster to another frame of the cluster (note that the frames are non-overlapping and time-separated), has been selected to be the basis for the feature-vector extraction in a preferred embodiment. The derivative is preserved well under various real-world audio transformations such as EQ, lossy audio codec compression and even transducing over-the-air, and therefore represents a suitable fingerprinting feature.

(33) In a preferred embodiment, the feature-vector of the cluster and its corresponding sub-fingerprint can be computed by performing the following procedure:

(34) a) calculating signal property values (e.g. average energy) E.sub.k.sup.i of band signals in each frame of the cluster (where i denotes the number of frame, k denotes the number of frequency band in the frame);

(35) b) calculating differences (delta) .sub.k,l.sup.i of the calculated signal property values E.sub.k.sup.i in selected non-adjacent bands k,l of each frame i:
.sub.k,l.sup.i=E.sub.k.sup.iE.sub.l.sup.i,|lk|>1;

(36) c) calculating differences (derivative) .sub.k,l.sup.i,j of the delta values over two cluster frames i,j (time axis) in corresponding bands k,l:
.sub.k,l.sup.i,j=.sub.k,l.sup.i.sub.k,l.sup.j,|ij|>>1,

(37) i,j are non-overlapping, time-separated frames

(38) d) combining the calculated derivative values for different combinations of bands k,l and frames i,j within the cluster into one set to produce the feature-vector V representing this cluster:
V={.sub.k,l.sup.i,j};

(39) e) computing the sub-fingerprint binary data F by quantizing the calculated values of the feature-vector V (the derivative values).

(40) In another possible embodiment, the derivatives on the step (c) are calculated on the same frame, but for different pairs of bands: .sub.(k,l),(m,n).sup.i=.sub.k,l.sup.i.sub.m,n.sup.i, wherein i is a number of frame, and k, l, m, n b are band numbers. Other variations of this method are possible.

(41) FIG. 5 depicts the calculation of single energy delta .sub.k,l.sup.i using two values of selected signal property (e.g. band signal energy) of two non-adjacent frequency bands l, k of frame i, namely:
.sub.k,l.sup.i=E.sub.k.sup.iE.sub.l.sup.i.

(42) FIG. 6 depicts calculation of derivative (difference) .sub.k,l.sup.i,j of two deltas .sub.k,l.sup.i, .sub.k,l.sup.j in two non-adjacent frames i,j comprising a single cluster.

(43) In some embodiments of the invention, the feature vector algorithm and the at least one computer processor can perform the steps of over at least two of the cluster frames, within individual cluster frames, selecting pairs of non-adjacent frequency bands, and calculating a difference between said calculated signal property values of the pairs of non-adjacent frequency bands. This lets the algorithm obtain within-frame non-adjacent band signal property delta values. Then within the individual cluster frames, the algorithm combines these within-frame non-adjacent band signal property delta values to produce an individual frame delta set for that individual cluster frame. The algorithm then selects pairs of these cluster frames (each cluster frame having a position within the cluster), and uses this position within the cluster to calculate derivatives of corresponding pairs of these individual frame delta sets. This process lets the algorithm produce the between-frame delta derivative values. The algorithm can then produce the feature-vector of the cluster by combining these between-frame delta derivative values.

(44) More specifically, in a preferred embodiment, when the selected signal property is the average energy, the energy deltas .sub.k,l.sup.i are calculated in each of the three non-adjacent frames i=i.sub.1, i.sub.2, i.sub.3, comprising the cluster, as a difference of average energies E.sub.k.sup.i in non-adjacent bands with indexes (k,l){(1,5), (2,6), . . . , (12,16)}. As a result, 12 energy deltas .sub.k,l.sup.i are extracted from each frame i=i.sub.1, i.sub.2, i.sub.3 of the cluster. Derivatives of the deltas are then calculated as a difference of the deltas in adjacent cluster frames, i.e.
.sub.k,l.sup.i.sup.1.sup.,i.sup.2=.sub.k,l.sup.i.sup.1.sub.k,l.sup.i.sup.2,
.sub.k,l.sup.i.sup.2.sup.,i.sup.3=.sub.k,l.sup.i.sup.2.sub.k,l.sup.i.sup.3.

(45) As a result, 24 derivative values are obtained, and the feature-vector V is constructed:
V={.sub.1,5.sup.i.sup.1.sup.,i.sup.2,.sub.2,6.sup.i.sup.1.sup.,i.sup.2, . . . ,.sub.12,16.sup.i.sup.1.sup.,i.sup.2,.sub.1,5.sup.i.sup.2.sup.,i.sup.3,.sub.2,6.sup.i.sup.2.sup.,i.sup.3, . . . ,.sub.12,16.sup.i.sup.2.sup.,i.sup.3}.

(46) Digitization: The feature-vector of the cluster will often initially comprise a vector comprising positive and negative feature-vector numeric values. To further simplify this, the sub-fingerprint algorithm can digitize this cluster feature-vector to a simplified vector of binary numbers. The algorithm can do this by, for example, setting positive feature vector numeric values to 1, and other feature vector numeric values to 0. This produces a digitized acoustic sub-fingerprint.

(47) More specifically, in a preferred embodiment, the acoustic sub-fingerprint F data bits can be extracted by quantizing the sign of the feature-vector V values (i.e. quantizing the derivatives): F=sign(V). Namely, the bit value is set to 1 if the corresponding derivative .sub.k,l.sup.i,j is positive, and 0 otherwise. As a result, the extracted feature vector F consists of 24 bits (3 bytes).

(48) Other feature vector construction methods: In some embodiments, the feature vector algorithm can also perform the steps of selecting, within individual cluster frames, pairs of non-adjacent frequency bands. This algorithm can then obtain within-frame non-adjacent band signal property delta values by calculating differences between the signal property values of these pairs. Additionally, the algorithm can also combine, within individual cluster frames, a plurality of these within-frame non-adjacent band signal property delta values to produce an individual frame delta set. The feature vector algorithm can then produce the feature vector by combining, over these cluster frames, the frame delta sets from these individual cluster frames.

(49) More specifically, in some embodiments, each cluster is formed by two frames, and the feature-vector V is constructed directly from the values of the in-frame deltas: V={.sub.k,l.sup.i}, where i=i.sub.1, i.sub.2 and (k,l){(1,5), (2,6), . . . , (12,16)}. This method also yields a feature-vector having 24 values. The sub-fingerprint F is extracted by quantizing the sign of the feature-vector V values: F=sign(V). As a result, the extracted feature vector F consists of 24 bits (3 bytes).

(50) A summary of the invention's sub-fingerprint extraction process is depicted in FIG. 7 by means of a simplified flowchart. The flowchart summarizes the steps of the procedure for extracting sub-fingerprint from an audio signal fragment.

(51) The invention's long-term acoustic sub-fingerprint extraction method results in extraction of very robust sub-fingerprints that remain invariant under aggressive transformations of sound. In particular, due to substantially large frame size used for the audio partitioning, most of the sub-fingerprints remain intact even under significant (several percent) playback speed variation and time scale modification.

(52) Additionally, the use of non-overlapping and largely spaced frames in the clusters producing sub-fingerprints, combined with the nature of the selected signal property and the large number of bits in the produced sub-fingerprint, results in extraction of highly representative and highly discriminative sub-fingerprints.

(53) Timeless Fingerprint Extraction

(54) In prior art systems, fingerprint of an audio fragment is usually generated by combining its sub-fingerprints into one set together with a corresponding additional data such as explicit or implicit information about time-location of the sub-fingerprints in the source audio fragment. In particular, some fingerprint extraction techniques are based on coupling the sub-fingerprint data with its corresponding time-position (time-stamp) in the source signal. Other methods imply combining sub-fingerprints exactly in the order of their arrival, i.e. in the same order as their corresponding reference frames appear in the source signal. These methods are usually imposed by the need to compensate for imperfect degree of discrimination of the sub-fingerprints using the added extra information (such as the absolute or at least relative sub-fingerprint time-location) in order to reduce error rate of fingerprint matching and search. The added information inevitably increases the size of the fingerprint data.

(55) The timeless fingerprinting method disclosed herein removes the necessity to use any extra information about sub-fingerprint time-locations. Due to the specifics and the special properties of the invention's acoustic sub-fingerprint extraction method, the extracted acoustic sub-fingerprints represent highly discriminative data entities. A sub-fingerprint produced by the invention's method represents a highly distinguishing, unambiguous quantitative characteristic of the source acoustic signal not only at the specific time-location but also, with a large degree of confidence, over a long signal range around the source time-location from which it has been extracted. For real-world audio recordings, such as speech or music having dynamic acoustic content, the sub-fingerprints derived from it by the described method have almost no repetitive values in non-adjacent positions over very long fragments of audio. Thus, combining together several highly discriminative sub-fingerprints originating from an audio fragment into one set results in high-entropy data and therefore, with a very large degree of confidence, ensures that no other acoustically different audio fragment can produce a fingerprint containing the same sub-fingerprints. This insight is an important aspect of the invention's fingerprint extraction method.

(56) Thus, the invention's timeless fingerprint extraction method comprises in combining the highly discriminative sub-fingerprints extracted from an audio fragment into one set without adding any additional auxiliary information such as the sub-fingerprint absolute or relative temporal data. The use of multi-byte high-entropy sub-fingerprints dramatically reduces probability of collisions, i.e. probability to encounter another acoustically different fragment, which would result in extraction of the same sub-fingerprints. At the same time, omission of any auxiliary information saves significant amount of space on storing the fingerprint data, allows producing compact fingerprints, and squeezes fingerprint database size very significantly.

(57) Moreover, since no relative or absolute temporal data is stored in the fingerprint, the order of the sub-fingerprints in the timeless fingerprint becomes meaningless too. This allows to speed-up the fingerprint extraction significantly by parallelizing the sub-fingerprints extraction process in capable computational systems and storing the extracted sub-fingerprints in the order as they arrive from the extractor without the need to preserve their original sequential order relative to the source audio stream.

(58) Importantly, the same set of sub-fingerprints extracted from an audio fragment can be used to create timeless fingerprints of different scalecontaining all of the extracted sub-fingerprints or only a part of them. Different versions of timeless fingerprints (such as more detailed and more compact, intersecting and disjoint) can be created from the same set of sub-fingerprints.

(59) Thus in some embodiments, the invention can be used in an automated method for extracting a timeless fingerprint that characterizes at least a fragment of an audio signal. This embodiment of the invention can, for example comprise:

(60) a: using at least one computer processor to divide any of an audio signal, or a fragment of said audio signal with a time length greater than 3 seconds, into a plurality of time-overlapping signal frames (frames).

(61) b: using at least one computer processor to create a plurality of frame clusters, each frame cluster (cluster of frames) comprising at least two non-overlapping frames; wherein each frame cluster comprises frames (cluster frames) that are disjoint, non-adjacent, and substantially spaced from other frame cluster frames.

(62) c: using at least one computer processor to select these frame clusters, and use the previously discussed acoustic sub-fingerprint methods to compute sub-fingerprints for at least some of these selected frame clusters, thus producing a set of sub-fingerprints, wherein each selected frame cluster produces a single sub-fingerprint.

(63) d: using at least one computer processor to remove those sub-fingerprints that have repetitive values from this set of sub-fingerprints, thus producing a refined set of sub-fingerprints for this plurality of frame clusters.

(64) e: producing one or more timeless fingerprints by combining, in an arbitrary order, and without any additional information, at least some selected sub-fingerprints from this refined sub-fingerprint set.

(65) These above fingerprints are considered timeless because the sub-fingerprints do not carry information about a time location or position of the selected frame clusters relative to the audio signal or fragment of the audio signal. Further, these sub-fingerprints also do not carry information about a time location or position of these selected frame clusters relative to a time location or position of other clusters of frames used to generate other sub-fingerprints comprising this timeless fingerprint.

(66) Note also that in some embodiments at least some selected sub-fingerprints from the refined sub-fingerprint are combined in an arbitrary manner which is independent from an order in which the corresponding frame clusters of these audio signals appear in the audio signal or fragment of an audio signal.

(67) Eventually, the invention's timeless fingerprint extraction method comprises in combining acoustic sub-fingerprints F.sub.1, F.sub.2, . . . , F.sub.n extracted from the audio fragment into one set ={F.sub.i}, i=1 . . . n, in arbitrary (in particular, even random) order, and without any additional information (such as any absolute or relative temporal information). For example: ={F.sub.5, F.sub.1, . . . F.sub.n3, . . . , F.sub.3} Any repetitive sub-fingerprints can be omitted to produce compact fingerprint. It is also possible to apply additional filtering of the sub-fingerprints set in order to further reduce its size. Finally, the same set of sub-fingerprints can be used to produce various different, intersecting or non-intersecting sub-fingerprint sub-sets and their corresponding fingerprints.

(68) In a preferred embodiment of the invention's fingerprint extraction method, the set of sub-fingerprints extracted from the audio fragment is refined by removing any consecutive sub-fingerprints having repetitive values. Additionally, any unreliable sub-fingerprints (e.g. sub-fingerprints extracted from signal frames containing energy deltas with low values) are removed from the set. The sub-fingerprints from the refined set are combined together into one block of data to produce the acoustic fingerprint of the audio fragment. Appropriate sub-fingerprint data alignment is maintained within the fingerprint data block. Such implementation results in extraction of fingerprints encountering approximately 30 sub-fingerprints per second of audio in average. With the 24-bit (3 bytes) long sub-fingerprints, this results in extraction of approximately 90 bytes of acoustic fingerprint data per second of audio. Thus, the ratio of the size of the extracted acoustic characterization information (with the data rate of 90 bytes per second) to the size of the source raw audio information (with the data rate of 16 Kbytes per second for 8000 Hz/16 bit PCM audio) is around 0.0056, which corresponds to data reduction by a factor of 177.

(69) Combining the acoustic sub-fingerprints of an audio fragment in an arbitrary order into one timeless fingerprint is depicted in FIG. 8. Sub-fingerprints 803 are extracted from an audio fragment 801 and are then combined into one fingerprint block 805 in an arbitrary order (original sequential order is not preserved, and some fingerprints are omitted).

(70) In a preferred embodiment, the size of audio fragment producing a single fingerprint has been selected to be 10-20 seconds long.

(71) A timeless fingerprint matching technique described hereinafter makes no use of the order, temporal, or any other additional information about the sub-fingerprints comprising the fingerprint.

(72) Determining Acoustic Similarity by Matching of Timeless Fingerprints

(73) High-entropy fingerprint data combined with high degree of discrimination and robustness of sub-fingerprints comprising the fingerprint, allows performing reliable, low-error matching of two fingerprints without requiring any extra information such as information about temporal location of the sub-fingerprints in the source audio fragment and/or relative to each other.

(74) In order to perform reliable matching of two even substantially long audio fragments and to judge about their acoustic similarity with rather high confidence, it is enough to apply the most straight-forward approach and simply calculate the number of identical sub-fingerprints (hits) contained in the corresponding timeless fingerprints of the two audio fragments. The resulting number of hits has to be only normalized and thresholded with an experimentally adjusted threshold to declare the two fingerprints a match or a miss. No additional special logic or bit-error rate (BER) calculation or distance measuring (such as Euclidean, Manhattan or Hamming distance) is involved in the process. This approach allows performing efficient search even in a large database, provided that the database consists of fingerprints having sizes comparable to the size of the matched fingerprint.

(75) More specifically, in a preferred embodiment, matching of two audio fragments A.sup.1 and A.sup.2 with their corresponding timeless fingerprints .sup.1={F.sub.i.sup.1} and .sup.2={F.sub.j.sup.2} is performed by finding a set C.sup.1,2 consisting of acoustic sub-fingerprints F contained in the both fingerprints: C.sup.1,2={F|F.sup.1.sup.2}. The size of the set C.sup.1,2 is the number of hits, h.sup.1,2=|C.sup.1,2| (where the operator || denotes number of members). The degree of acoustic similarity d of A.sup.1 and A.sup.2 is then computed as:

(76) $d = \frac{h^{1, 2}}{\min (.Math.^{1} .Math., .Math.^{2} .Math.)}$

(77) The resulting degree of acoustic similarity d is a value between 0.0 and 1.0. For example, if fragment A.sup.1 is a sub-fragment of A.sup.2 then .sup.1.Math..sup.2 and therefore C.sup.1,2=.sup.1, h.sup.1,2=|C.sup.1,2|=|.sup.1| which leads to d=1.0. To the contrary, if fragments A.sup.1 and A.sup.2 are totally different acoustically, then .sup.1=.sup.2, C.sup.1,2=, h.sup.1,2=, and therefore d=0.0.

(78) The matching process described hereinbefore is depicted in FIG. 9. Two timeless fingerprints .sup.1={F.sub.i.sup.1} and .sup.2={F.sub.j.sup.2} are analyzed, each containing its corresponding acoustic sub-fingerprints. Identical sub-fingerprints (hits) contained in the both fingerprints are identified and are stored in the hit-list C.sup.1,2. The number of sub-fingerprints contained in the hit-list C.sup.1,2 gives the number of hits h.sup.1,2: |C.sup.1,2|=h.sup.1,2.

(79) When applied to high entropy-fingerprints, the invention's matching method demonstrates high reliability, robustness and results in low error-rate. In a preferred embodiment operating with fingerprints corresponding to approximately 10-15 seconds long audio fragments, matching of fingerprints against a set of audio fragments originating from different music recordings and containing a few millions of fingerprints results in a very high accuracy with an average error-rate of <0.001% (less than one error per 100000 audio fragments).

(80) In some embodiments, the previously discussed timeless fingerprint techniques can be further used in to numerically calculate a degree of acoustic similarity of a first and a second audio sample. This can be done by:

(81) a: using at least one computer processor to split the first audio sample into a set of first sample fragments, and also split the second audio sample into a set of second sample fragments. Here the first and second audio samples and will typically have a time duration of at least 3 seconds.

(82) b: using at least one computer processor to produce a set of first audio sample timeless fingerprints by using acoustic properties of all the first sample fragments and computing a set of first acoustic sub-fingerprints, then selecting and combining these first acoustic sub-fingerprints in an arbitrary order. Similarly, this computer processor will also produce a set of second audio sample timeless fingerprints by using acoustic properties of all the second sample fragments by computing a set of second acoustic sub-fingerprints, and then selecting and combining these second acoustic sub-fingerprints in an arbitrary order.

(83) c: using at least one computer processor to produce a first timeless super-fingerprint by selecting at least some first audio sample timeless fingerprints from the set of first audio sample timeless fingerprints, and combining them in an arbitrary order. Similarly, using the computer processor to produce a second timeless super-fingerprint by selecting at least some second audio sample timeless fingerprints from the set of second audio sample timeless fingerprints, and combining them in an arbitrary order.

(84) d: using at least one computer processor to match the first and second timeless super-fingerprints by paring first audio sample timeless fingerprints from the first timeless super-fingerprint, with second audio sample timeless fingerprints from the second timeless super-fingerprint, thus producing plurality of fingerprint pairs. Then, for each fingerprint pair in the plurality of fingerprint pairs, using the computer processor(s) to calculate how many identical sub-fingerprints (hits) are contained in both fingerprint pairs, thus producing a hit-list.

(85) e: using at least one computer processor to calculate, using this hit-list, a degree of acoustic similarity of the first and second audio samples.

(86) Note that in the above discussion, in at least some embodiments, the relative positions and temporal relations of any of said sub-fingerprints comprising any of said timeless fingerprints will often be unknown. Further these sub-fingerprints will often not carry temporal information about their location within any corresponding sample fragments of any said audio samples. Additionally, the relative positions and temporal relations of any of said timeless fingerprints in any of said timeless super-fingerprints will often also be unknown. Finally, the timeless fingerprints in any of said timeless super-fingerprints will typically not carry temporal information about their location relative to the other timeless fingerprints of other said timeless super-fingerprints.

(87) More specifically, matching of long audio tracks can be done by matching their corresponding timeless super-fingerprints. Namely, a long audio recording is first segmented into several shorter fragments (e.g. 10-15 seconds long), and timeless fingerprint is extracted from each fragment by performing procedures described hereinbefore. The fragments can be disjoint or overlapping. Timeless fingerprints originating from the same track are combined into a group of fingerprints to produce the timeless super-fingerprint of the long audio recording, as shown in FIG. 10. Similar to the process of producing timeless fingerprints from sub-fingerprints described hereinbefore, the timeless super-fingerprint is produced of a group of timeless fingerprints by combining them together in an arbitrary order and without adding any additional temporal information. All or only part of the available fingerprints can be used to form the super-fingerprint. For the super-fingerprint to be a representative of an audio recording, it is enough to associate the fingerprints comprising the super-fingerprint with the audio recording from which they originate by adding a short identifier associated with the recording.

(88) In a preferred embodiment of super-fingerprint extraction, a long audio recording custom character (e.g. music track) is segmented into 10-15 seconds long fragments A.sup.i to produce a set of fragments ={A.sup.i|A.sup.i.Math., U.sub.iA.sup.i=} of the audio recording . Fingerprint .sup.i is extracted from each fragment A.sup.i of the set . The extracted fingerprints are merged together into one group to produce the super-fingerprint custom character ={.sup.i} of the audio recording .

(89) FIG. 10 depicts the process of combining several fingerprints .sup.i, corresponding to fragments A.sup.i of a long audio recording custom character , in arbitrary order (their original sequential order it not preserved) into one set to produce a super-fingerprint ={.sup.i} of the audio recording .

(90) Matching of two audio recordings custom character and can be performed by matching their corresponding super-fingerprints and produced from their corresponding sets of audio fragments and .

(91) Matching of two super-fingerprints custom character ={.sub..sup.i} and ={.sub..sup.j} can be done by matching pairs of fingerprints they contain, wherein the first fingerprint .sub..sup.i in the matched pair (.sub..sup.i,.sub..sup.j) is taken from the first super-fingerprint , and the second fingerprint .sub..sup.j in the matched pair (.sub..sup.i,.sub..sup.j) is taken from the second super-fingerprint custom character . Acoustic similarity of the two audio recordings is then calculated by combining and processing the results of the pairs matching.

(92) Various different methods can be applied to determine the degree of acoustic similarity of the two audio recordings based on matching results of their corresponding timeless fingerprint pairs

(93) In some embodiments, the degree of acoustic similarity between the audio recordings and their corresponding super-fingerprints can be calculated by performing operations such as determining if a number of hits in said hit-list exceeds a predetermined threshold; or determining if a maximal number of hits in said hit-list exceeds a predetermined threshold.

(94) In one particular, simple embodiment, the degree of acoustic similarity D of two audio recordings custom character and is determined as D=max({d.sup.i,j}), where d.sup.i,j is the degree of acoustic similarity of fingerprints in the corresponding timeless super-fingerprints .sub..sup.i and .sub..sup.j of the two audio recordings and .

(95) In another, more sophisticated embodiment, depicted in the FIG. 11, the number of hits h.sup.i,j is calculated for pairs of fingerprints .sub..sup.i custom character and .sub..sup.j (all with all matching), and a hit-list
={h.sup.i,j}
is produced wherein
h.sup.i,j=|C.sup.i,j|,
C.sup.i,j={F|F.sub..sup.i.sub..sup.j}
A normalized hit-list is then calculated as:

(96) $= {\frac{h^{i, j}}{\min (.Math._{}^{i} .Math., .Math._{}^{j} .Math.)} |_{}^{i},_{}^{j}, h^{i, j}}$

(97) The degree of acoustic similarity D is then calculated from the number custom character of members in the normalized hit-list having values greater than a pre-defined threshold t, 0<t1 in the following way:

(98) $- {h | h, h > t}, D = \frac{.Math. .Math.}{.Math. .Math.}$
Put alternatively, in some embodiments, the degree of acoustic similarity can be calculated by using at least one computer processor to normalizing each value of the previously discussed hit-list by dividing this value by a total amount of sub-fingerprints contained in a shortest timeless fingerprint of a corresponding fingerprint pair related to this value. This produces thus a normalized hit-list. The at least one computer processor can further calculate an amount of (how many) positions are in this normalized hit-list where the number of hits surpasses a predetermined threshold and/or the normalized value surpasses a predetermined threshold. The computer processor can then further normalize this amount (number) of positions by a total amount (number) of values in the normalized hit-list.

(99) In case of two identical audio recordings custom character and , their corresponding super-fingerprints will contain identical fingerprints and the normalized hit-list will consist of all 1's which results in D=1.0. To the contrary, in case of two totally different audio recordings, , their super-fingerprints will consist of unequal fingerprints and the normalized hit-list custom character will contain all zero values resulting in D=0.0.

Acoustic fingerprint extraction and matching

Assignee

Inventors

Cpc classification

Classification Explorer

G10L19/018

PHYSICS

Classification Explorer

G10L25/51

PHYSICS

Classification Explorer

G10L25/18

PHYSICS

Classification Explorer

G10H1/00

PHYSICS

Classification Explorer

G10L21/038

PHYSICS

Classification Explorer

G10H2240/135

PHYSICS

Classification Explorer

G10H2240/141

PHYSICS

International classification

Classification Explorer

G10L25/51

PHYSICS

Classification Explorer

G10L21/038

PHYSICS

Classification Explorer

G10L19/018

PHYSICS

Classification Explorer

G10L15/02

PHYSICS

Abstract

Claims

Description