Audio signal processing methods and systems

Abstract

Described are methods and systems of identifying one or more fundamental frequency component(s) of an audio signal. The methods and systems may include any one or more of an audio event receiving step, a signal discretization step, a masking step, and/or a transcription step.

Claims

1. A method of identifying at least one fundamental frequency component of an audio signal, the method comprising receiving and recording an audio event, and converting the recorded audio event into an audio signal, the method further comprising: (a) filtering the audio signal to produce a plurality of sub-band time domain signals; (b) transforming the plurality of sub-band time domain signals into a plurality of sub-band frequency domain signals by mathematical operators; (c) summing together the plurality of sub-band frequency domain signals to yield a single spectrum; (d) calculating the bispectrum of each of the plurality of sub-band time domain signals; (e) summing together the bispectra calculated in (d); (f) calculating the diagonal of the summed bispectra; (g) multiplying the single spectrum and the diagonal of the summed bispectra to produce a product spectrum; and (h) identifying at least one fundamental frequency component of the audio signal from the product spectrum or information contained in the product spectrum.

2. The method according to claim 1, wherein at least one identifiable fundamental frequency component is matched with a known audio event such that identification of the at least one fundamental frequency component enables identification of the known audio event.

3. The method according to claim 1, wherein the method further comprises visually representing on a screen or other display means at least one selected from the group consisting of: the product spectrum; information contained in the product spectrum; identifiable fundamental frequency components; and a representation of identifiable known audio events in the audio signal.

4. The method according to claim 1, wherein the product spectrum includes a plurality of peaks, and wherein at least one fundamental frequency component of the audio signal is identifiable from the locations of the peaks in the product spectrum.

5. The method according to claim 1, wherein filtering of the audio signal is carried out using a constant-Q filterbank applying a constant ratio of frequency to bandwidth across frequencies of the audio signal.

6. The method according to claim 5, wherein the filterbank comprises a plurality of spectrum analyzers and a plurality of filter and decimate blocks.

7. The method according to claim 1, wherein the mathematical operators for transforming the plurality of sub-band time domain signals into the plurality of sub-band frequency domain signals comprise fast Fourier transforms.

8. The method according to claim 1, wherein the audio signal comprises a plurality of audio signal segments, and wherein fundamental frequency components of the audio signal are identifiable from product spectra produced by the operation of steps (a) to (g) on the audio signal segments, or from the information contained in such product spectra for the audio signal segments.

9. The method according to claim 1, wherein the audio event comprises a plurality of audio event segments, each being converted into a plurality of audio signal segments, wherein fundamental frequency components of the audio event are identifiable from product spectra produced by operation of steps (a) to (g) of claim 1 on the audio signal segments, or from the information contained in such product spectra for the audio signal segments.

10. The method according to claim 1, wherein the method includes a signal discretization step, and wherein the signal discretization step enables discretizing the audio signal into time-based segments of varying sizes.

11. The method according to claim 10, wherein the segment size of the time-based segment is determinable by the energy characteristics of the audio signal.

12. The method according to claim 1, wherein the method includes a masking step, and wherein the masking step comprises applying a quantizing algorithm to map the frequency spectra of the product spectrum produced in step (g), and a mask bank consisting of a plurality of masks to be applied to the mapped frequency spectra.

13. The method according to claim 12, wherein the quantizing algorithm effects mapping the frequency spectra of the product spectrum to a series of audio event-specific frequency ranges, the mapped frequency spectra together constituting an array.

14. The method according to claim 12, wherein at least one mask in the mask bank contains fundamental frequency spectra associated with at least one known audio event.

15. The method according to claim 14, wherein the fundamental frequency spectra of a plurality of masks in the mask bank is set in accordance with the identified fundamental frequency component.

16. The method according to claim 13, wherein the mask bank operates by applying at least one mask to the array such that the frequency spectra of the at least one mask is subtracted from the frequency spectra in the array, in an iterative fashion from the lowest applicable fundamental frequency spectra mark to the highest applicable fundamental frequency spectra mark, until there is no frequency spectra left in the array below a minimum signal amplitude threshold.

17. The method according to claim 13, wherein the particular masks of the mask bank to be applied to the array are chosen based on which at least one fundamental frequency component(s) are identifiable in the product spectrum of the audio signal.

18. The method according to claim 13, further comprising iterative application of the masking step, wherein iterative application of the masking step comprises performing cross-correlation between the diagonal of the summed bispectra and masks in the mask bank, then selecting the mask having the highest cross-correlation value, the high correlation mask is then subtracted from the array, and this process continues iteratively until no frequency content below a minimum threshold remains in the array.

19. The method according to claim 14, wherein the masking step comprises producing a final array identifying each of the at least one known audio event present in the audio signal, wherein the at least one known audio event identifiable in the final array is determinable by observing which of the masks in the masking step are applied.

20. The method according to claim 19, wherein the method includes a transcription step, and wherein the transcription step comprises converting known audio events, identifiable by the masking step or by the product spectrum, into a visually representable transcription of the identified known audio events.

21. A non-transitory computer-readable medium for identifying at least one fundamental frequency component of an audio signal or audio event, the non-transitory computer-readable medium comprising: code components configured to enable a computer to perform a method of identifying at least one fundamental frequency component of an audio signal, the method comprising receiving and recording an audio event, and converting the recorded audio event into an audio signal, the method further comprising: (a) filtering the audio signal to produce a plurality of sub-band time domain signals; (b) transforming the plurality of sub-band time domain signals into a plurality of sub-band frequency domain signals by mathematical operators; (c) summing together the plurality of sub-band frequency domain signals to yield a single spectrum; (d) calculating the bispectrum of each of the plurality of sub-band time domain signals; (e) summing together the bispectra calculated in (d); (f) calculating the diagonal the summed bispectra; (g) multiplying the single spectrum and the diagonal of the summed bispectra to produce a product spectrum; and (h) identifying at least one fundamental frequency component of the audio signal from the product spectrum or information contained in the product spectrum.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Possible and preferred features of this disclosure will now be described with particular reference to preferred embodiments of the disclosure in the accompanying drawings. However, it is to be understood that the features illustrated in and described with reference to the drawings are not to be construed as limiting on the scope of the disclosure. In the drawings:

(2) FIG. 1 illustrates a preferred method for identifying fundamental frequency component(s), or MIFFC, embodying this disclosure;

(3) FIG. 1A illustrates a filterbank including a series of spectrum analyzers and filter and decimate blocks;

(4) FIG. 1B illustrates three major triad chordsC4 major triad, D4 major triad and G4 major triad.

(5) FIG. 2 illustrates a preferred method embodying this disclosure including an AERS, SDS, MIFFC, MS and TS;

(6) FIG. 3 illustrates a preferred system embodying this disclosure; and

(7) FIG. 4 is a diagram of a computer-readable medium embodying this disclosure.

DETAILED DESCRIPTION

(8) In relation to the applications and embodiments of the disclosure described herein, while the descriptions may, at times, present the methods and systems of the disclosure in a practical or working context, the disclosure is intended to be understood as providing the framework for the relevant steps and actions to be carried out, but not limited to scenarios where the methods are being carried out. More definitively, the disclosure may relate to the framework or structures necessary for improved signal processing, not limited to systems or instances where that improved processing is actually carried out.

(9) Referring to FIG. 1, there is depicted a method for identifying fundamental frequency component(s) 10, or MIFFC, for resolving the FFCs of a single time-domain frame of a complex audio signal, represented by the function x.sub.p[n] and also called a single-frame time domain signal (SFTDS). The MIFFC 10 comprises a filtering block 30, a DCQBS block 50, then a multiplication of the outputs of each of these blocks, yielding a product spectrum 60, which contains information about FFCs present in the original SFTDS input.

(10) Filtering Block

(11) First, a function representing an SFTDS is received as input into the filtering block 30 of the MIFFC 10. The SFTDS is pre-processed to contain that part of the signal occurring between a pre-determined onset and offset time. The SFTDS passes through a constant-Q filterbank 35 to produce multiple sub-band time-domain signals (SBTDSs) 38.

(12) The Constant-Q Filterbank

(13) The constant-Q applies a constant ratio of frequency to bandwidth (or resolution), represented by the letter Q, and is structured to generate good frequency resolution at the cost of poorer time resolution at the lower frequencies, and good time resolution at the cost of poorer frequency resolution at high frequencies.

(14) This choice is made because the frequency spacing between two human ear-distinguishable sound events may only be in the order of 1 or 2 Hz for lower frequency events; however, in the higher ranges, frequency spacing between adjacent human ear-distinguishable events is in the order of thousands of Hz. This means frequency resolution is not as important at higher frequencies as it is at low frequencies for humans. Furthermore, the human ear is most sensitive to sounds in the 3-4 kHz channel so a large proportion of sound events that the human ear is trained to distinguish occur in this region of the frequency spectrum.

(15) In the context of musical sounds, since the notes of melodies typically have notes of shorter duration than harmony or bass voices, it is logical to dedicate temporal resolution to higher frequencies. The above explains why a constant-Q filterbank is chosen; it also explains why such a filterbank is suitable in the context of analyzing music audio signals.

(16) With reference to FIG. 1A, the filterbank 35 is composed of a series of spectrum analyzers 31 and filter and decimate blocks 36 (one of each are labelled in FIG. 1A), in order to selectively filter the audio signal 4. Inside each spectrum analyzer block 31, there is preferably a Hanning window sub-block 32 having a length related to onset and offset times of the SFTDS.

(17) Specifically, the length of each frame is measured in sample numbers of digital audio data, which correspond to duration (in seconds). The actual sample number depends on the sampling rate of the generated audio signal; a sample rate of 11 kHz is taken. This means that 11,000 samples of audio data per second are generated. If the onset of the sound is at 1 second and the offset is at 2 seconds, this would mean that the onset sample number is 11,000 and the offset sample number is 22,000. Alternatives to Hanning windows include Gaussian and Hamming windows. Inside each spectrum analyzer block 31 is a fast Fourier transform sub-block 33. Alternative Transforms that may be used include Discrete Cosine Transforms and Discrete Wavelet Transforms, which may be suitable depending on the purpose and objectives of the analysis.

(18) Inside each filter and decimate block 36, there is an anti-aliasing low-pass filter sub-block 37 and a decimation sub-block 37A. The pairs of spectrum analyzer and filter and decimate blocks 31 and 36 work to selectively filter the audio signal 4 into pre-determined frequency channels. At the lowest channel filter of the filterbank 35, good quality frequency resolution is achieved at the cost of poor time resolution. While the center frequencies of the filter sub-blocks change, the bandwidth is preserved across each pre-determined frequency channel, resulting in a constant-Q filterbank 35.

(19) The numbers of pairs of spectrum analyzer and filter and decimate blocks 31 and 36 can be chosen depending on the frequency characteristics of the input signal. For example, when analyzing the frequency of audio signals from piano music, since the piano has eight octaves, eight pairs of these blocks can be used.

(20) The following equations derive the constant-Q transform. Bearing close relation to the Fourier transform, the constant-Q transform (CQT) contains a bank of filters, however, in contrast, it has geometrically spaced center frequencies:
f.sub.i=f.sub.o.Math.2.sup.i/b
for iZ, where b indicates the number of filters per octave. The bandwidth of the kth filter is chosen so as to preserve the octave relationship with the adjacent Fourier domain:

(21) ${BW}_{i} = f_{i + 1} - f_{i} = f_{i} (2^{\frac{1}{b}} - 1)$
In other words the transform can be thought of as a series of logarithmically spaced filters, with the kth filter having a spectral width some multiple of the previous filter's width. This produces a constant ratio of frequency:bandwidth (resolution), whereby

(22) $Q = \frac{f_{i}}{{BW}_{i}} = {(2^{\frac{1}{b}} - 1)}^{- 1}$
where f.sub.i is the center frequency of the ith band filter and BW.sub.i is the corresponding bandwidth. In Constant-Q filters, Q.sub.i=Q, where i custom character Q is constant and the bandwidth is preserved across each octave. From the above, the constant-Q transform may be derived as

(23) $x^{cq} [k] := \frac{1}{N_{k}} {.Math.}_{n = 0}^{N_{k}} x [n] w_{N_{k}} [n]^{\frac{- 2 j Qn}{N_{k}}}$
Where N.sub.k is the window length, w.sub.Nk is the windowing function, which is a function of window length, and the digital frequency is 2Q/N.sub.k. This constant-Q transform is applied in the diagonal bispectrum (or DCQBS) block described below.

(24) For a music signal context, in equation for Q above, by tweaking f.sub.i and b, it is possible to match note frequencies. Since there are 12 semitones (increments in frequency) in one octave, this can be achieved by choosing b=12 and f.sub.i corresponding to the center frequency of each filter. This can be helpful later in frequency analysis because the signals are already segmented into audio event ranges, so less spurious FFC note information is present. Different values for f.sub.i and b can be chosen so that the filterbank 35 is suited to the frequency structure of the input source. The total number of filters is represented by N.

(25) Returning to FIG. 1, after passing through the filterbank 35, the single audio frame input is filtered into N sub-band time domain signals 38. Each SBTDS is acted on by an FFT function in the spectrum analyzer blocks 31 to produce N sub-band frequency domain signals 39 (or SBFDS), which are then summed to deliver a constant-Q FFT single spectrum 40, being the single spectrum of the SFTDS that was originally input into the filtering block 30.

(26) In summary, the filtering block 30 produces two outputs: an FFT single spectrum 40 and N SBTDS 38. The user may specify the number of channels, b, being used so as to allow a trade-off between computational expense and frequency resolution in the constant-Q spectrum.

(27) DCQBS Block

(28) The DCQBS block 50 receives the N SBTDSs 38 as inputs and the bispectrum calculator 55 individually calculates the bispectrum for each. The bispectrum is described in detail below. Let an audio signal be defined by: x[k] where k custom character
k is the sample number, where k is an integer (e.g., x[1], . . . , x[22,000]).

(29) The magnitude spectrum of a signal is defined as the first order spectrum, produced by the discrete Fourier transform:

(30) $X () = {.Math.}_{k = -}^{} x [k]^{- j k}$

(31) The power spectral density (PSD) of a signal is defined as the second order spectrum:
PSD.sub.x()=X()X*()

(32) The bispectrum, B, is defined as the third order spectrum:
B.sub.x[.sub.1,.sub.2]=X(.sub.1)X(.sub.2)X*(.sub.1+.sub.2)

(33) After calculating the bispectrum for each N time-domain sub-band signal, the N bispectra are then summed to calculate a full, constant-Q bispectrum 54. Mathematically, the full constant-Q bispectrum 54 is a symmetric, complex-valued non-negative, positive-semi-definite matrix. Another name for this type of matrix is a diagonally dominant matrix. The mathematical diagonal of this matrix is taken by the diagonalizer 57, yielding a quasi-spectrum called the diagonal bispectrum 56. The benefit of taking the diagonal is two-fold: first, it is faster to compute than the full Constant-Q bispectrum due to having substantially less data points (more specifically, for an MM matrix, M.sup.2 points are required, whereas, its diagonal contains only M points, effectively square-rooting the number of required calculations). More importantly, the diagonal bispectrum 56 yields peaks at the fundamental frequencies of each input signal. In more detail, the diagonal constant-Q bispectrum 56 contains information pertaining to all frequencies, with constant bandwidth to frequency ratio, and it removes a great deal of harmonic content from the signal information while boosting the fundamental frequency amplitudes (after multiplication with the single spectrum), which permits a more accurate reading of the fundamental frequencies in a given signal.

(34) The output of the diagonalizer 57, the diagonal bispectrum 56, is then multiplied by the single spectrum 40 from the filtering block 30 to yield the product spectrum 60 as an output.

(35) Mathematics of the Product Spectrum

(36) The product spectrum 60 is the result of multiplying the single spectrum 40 with the diagonal bispectrum 56 of the SFTDS 20. It is described by recalling the bispectrum as:
B.sub.x[.sub.1,.sub.2]=X(.sub.1)X(.sub.2)X*(.sub.1+.sub.2)

(37) The diagonal constant-Q bispectrum is given by applying a constant-Q transform (see above) to the bispectrum, then taking the diagonal:
B.sub.X.sub.CQ[.sub.1,.sub.2]=X.sub.CQ(.sub.1)X.sub.CQ(.sub.2)X*.sub.CQ(.sub.1+.sub.2)
Diagonal Constant-Q Bispectrum:diag(B.sub.X.sub.CQ[.sub.1,w.sub.2])=diag(X.sub.CQ(.sub.1)X.sub.CQ(.sub.2)X*.sub.CQ(.sub.1+.sub.2))

(38) Now, by multiplying the result with the single constant-Q spectrum, the product spectrum is yielded:
diag(B.sub.X.sub.CQ[.sub.1,.sub.2])=diag(X.sub.CQ(.sub.1)X.sub.CQ(.sub.2)X*.sub.CQ(.sub.1+.sub.2)X.sub.CQ())

(39) The product spectrum 60 contains information about FFCs present in the original SFTDS, and this will be described below with reference to an application.

(40) Application

(41) This application describes the MIFFC 10 used to resolve the fundamental frequencies of known audio event constituting notes played on a piano, also with reference to FIG. 1. In this example, the audio signal 4 comprises three chords on the piano are played one after the other: C4 major triad (notes C, E, G, beginning with C in the 4.sup.th octave), D4 major triad (notes D, F#, A beginning with D in the 4.sup.th octave), and G4 major triad (notes G, B, D beginning with G in the 4.sup.th octave). This corresponds to the sheet music notation in FIG. 1B.

(42) Each of the chords is discretized in pre-processing so that the audio signal 4 representing these notes is constituted by three SFTDSs, x.sub.1 [n], x.sub.2[n] and x.sub.3[n], which are consecutively inserted into the filtering block 30. The length of each of the three SFTFDs is the same, and is determined by the length of time that each chord is played. Since the range of notes played is spread over two octaves, 16 channels are chosen for the filterbank 35. The first chord, whose SFTDS is represented by x.sub.1[n], passes through the filterbank 35 to produce 16-time sub-band domain signals (SBTDS), x.sub.1[k] (k: 1, 2 . . . 16). Similarly, 16 SBTDSs are resolved for each of x.sub.2[k] and x.sub.3[k].

(43) The filtering block 30 also applies an FFT to each of the 16 SBTDSs for x.sub.1[k], x.sub.2[k] and x.sub.3[k], to produce 16 sub-band frequency domain signals (SBFDSs) 38 for each of the chords. These sets of 16 SBFTSs are then summed together to form the single spectrum 40 for each of the chords; the single spectra are here identified as SS.sub.1, SS.sub.2, and SS.sub.3.

(44) The other output of the filtering block 30 is the 16 sub-band time-domain signals 38 for each of x.sub.1[k], x.sub.2[k] and x.sub.3[k], which are sequentially input into the DCQBS block 50. In the DCQBS block 50 of the MIFFC 10 in this application of the disclosure, the bispectrum of each of the SBTDSs for the first chord is calculated, summed and then the resulting matrix is diagonalized to produce the diagonal constant-Q bispectrum 56; then the same process is undertaken for the second and third chords. These three diagonal constant-Q bispectra 56 are represented here by DB.sub.1, DB.sub.2 and DB.sub.3.

(45) The diagonal constant-Q bispectra 56 for each of the chords are then multiplied with their corresponding single spectra 40 (i.e., DB.sub.1SS.sub.1; DB.sub.2SS.sub.2; and DB.sub.1SS.sub.1) to produce the product spectra 60 for each chord: PS.sub.3, PS.sub.3, and PS.sub.3. The fundamental frequencies of each of the notes in the known audio event constituting the C4 major triad chord, C (262 Hz), E (329 Hz) and G (392 Hz), are each clearly identifiable from the product spectrum 60 for the first chord from three frequency peaks in the product spectrum 60 localized at or around 262 Hz, 329 Hz, and 392 Hz. The fundamental frequencies for each of the notes in the known audio event constituting the D4 major triad chord and the known audio event constituting the G4 major triad chord are similarly resolvable from PS.sub.2 and PS.sub.3, respectively, based on the location of the frequency peaks in each respective product spectrum 60.

(46) Other Applications

(47) Just as the MIFFC 10 resolves information about the FFCs of a given musical signal, it is equally able to resolve information about the FFCs of other audio signals such as underwater sounds. Instead of a 16-channel filterbank (which was dependent on the two octaves over which piano music signal ranged in the first application), a filterbank 35 with a smaller or larger number of channels would be chosen to capture the range of frequencies in an underwater context. For example, the MIFFC 10 would preferably have a large number of channels if it were to distinguish between each of the following: (i) background noise of a very low frequency (e.g., resulting from underwater drilling); (ii) sounds emitted by a first category of sea-creatures (e.g., dolphins, whose vocalizations are said to range from 1 kHz to 200 kHz); and (iii) sounds emitted by a second category of sea-creatures (e.g., whales, whose vocalizations are said to range from 10 Hz to 30 kHz).

(48) In a related application, the MIFFC 10 could also be applied so as to investigate the FFCs of sounds emitted by creatures, underwater, on land or in the air, which may be useful in the context of geo-locating these creatures, or more generally, in analysis of the signal characteristics of sounds emitted by creatures, especially in situations where there are multiple sound sources and/or sounds having multiple FFCs.

(49) Similarly, the MIFFC 10 can be used to identify FFCs of vocal audio signals in situations where multiple persons are speaking simultaneously, for example, where signals from a first person with a high pitch voice may interfere with signals from a second person with a low pitch voice. Improved resolution of FFCs of vocal audio signals has application in hearing aids, and, in particular, the cochlear implant, to enhance hearing. In one particular application of the disclosure, the signal analysis of a hearing aid can be improved to assist a hearing impaired person achieve something approximating the cocktail party effect (when that person would not otherwise be able to do so). The cocktail party effect refers to the phenomenon of a listener being able to focus his or her auditory attention on a particular stimulus while filtering out a range of other stimuli, much the same way that a partygoer can focus on a single conversation in a noisy room. In this situation, by resolving the fundamental frequency components of differently pitched speakers in a room, the MIFFC can assist in a hearing impaired person's capacity to distinguish one speaker from another.

(50) A second embodiment of the disclosure is illustrated in FIG. 2, which depicts a five-step method 100 including an audio event receiving step (AERS) 1, a signal discretization step (SDS) 5, a method for identifying fundamental frequency component(s) (MIFFC) 10, a masking step (MS) 70, and a transcription step (TS) 80.

(51) Audio Event Receiving Step (AERS)

(52) The AERS 1 is preferably implemented by a microphone 2 for recording an audio event 3. The audio signal x[n] 4 is generated with a sampling frequency and resolution according to the quality of the signal.

(53) Signal Discretization Step (SDS)

(54) The SDS 5 discretizes the audio signal 4 into time-based windows. The SDS 5 discretizes the audio signal 4 by comparing the energy characteristics (the Note Average Energy approach) of the signal 4 to make a series of SFTDSs 20. The SDS 5 resolves the onset and offset times for each discretizable segment of the audio event 3. The SDS 5 determines the window length of each SFTDS 20 by reference to periodicity in the signal so that rapidly changing signals preferably have smaller window sizes and slowly changing signals have larger windows.

(55) Method for Identifying the Fundamental Frequency Component(s) (MIFFC)

(56) The MIFFC 10 of the second embodiment of the disclosure contains a constant-Q filterbank 35 as described in relation to the first embodiment. The MIFFC 10 of the second embodiment is further capable of performing the same actions as the MIFFC 10 in the first embodiment; that is, it has a filtering block 30 and a DCQBS block 50, which (collectively) are able to resolve multiple SBTDSs 38 from each SFTDS 20; apply fast Fourier transforms to create an equivalent SBFDS 39 for each SBTDS 38; sum together the SBFDSs 39 to form the single spectrum 40 for each SFTDS 20; calculate the bispectrum for each of the SBTDS 38 and then sum these bispectra together and diagonalize the result to form the diagonal bispectrum 56 for each SFTDS 20; and multiply the single spectrum 40 with the diagonal bispectrum 56 to produce the product spectrum 60 for each single frame of the audio fed through the MIFFC 10. FFCs (which can be associated with known audio events) of each SFTDS 20 are then identifiable from the product spectra produced.

(57) Masking Step (MS)

(58) The MS 70 applies a plurality (e.g., 88) of masks to sequentially resolve the presence of known audio events (e.g., notes) in the audio signal 4, one SFTFS 20 at a time. The MS 70 has masks that are made to be specific to the audio event 3 to be analyzed. The masks are made in the same acoustic environment (i.e., having the same echo, noise, and other acoustic dynamics) as that of the audio event 3 to be analyzed. The same audio source that is to be analyzed is used to produce the known audio events forming the masks and the full range of known audio events able to be produced by that audio source are captured by the masks. The MS 70 acts to check and refine the work of the MIFFC 10 to more accurately resolve the known audio events in the audio signal 4. The MS 70 operates in an iterative fashion to remove the frequency content associated with known audio events (each corresponding to a mask) in order to determine which known audio events are present in the audio signal 4.

(59) The MS 70 is set up by first creating a mask bank 75, after which the MS 70 is permitted to operate on the audio signal 4. The mask bank 75 is formed by separately recording, storing and calculating the diagonal bispectrum (DCQBS) 56 for each known audio event that is expected to be present in the audio signal 4 and using these as masks. The number of masks stored is the total number of known audio events that are expected to be present in the audio signal 4 under analysis. The masks applied to the audio signal 4 correspond to the masks associated with the fundamental frequencies indicated to be present in that audio signal 4 by the product spectrum 60 produced by the MIFFC 10, in accordance with the first embodiment of the disclosure described above.

(60) The mask bank 75 and the process of its application to the audio signal 4 use the product spectrum 60 as input audio signal 4. The MS 70 applies a threshold 71 to the signal so that discrete signals having a product spectrum amplitude less than the threshold amplitude are floored to zero. The threshold amplitude is chosen to be a fraction (one tenth) of the maximum amplitude of the audio signal 4.

(61) The MS 70 includes a quantizing algorithm 72 that maps the frequency axis of the product spectrum 60 to audio event-specific ranges. It starts by quantizing the lower frequencies before moving to the higher frequencies. The quantizing algorithm 72 iterates over each SFTDS 20 and resolves the audio event-specific ranges present in the audio signal 4. Then the mask bank 75 is applied, whereby masks are subtracted from the output of the quantizing algorithm 72 for each fundamental frequency indicated as present in the product spectrum 60 of the MIFFC 10. By iterative application of the MS 70, when there is no substantive amplitude remaining in the signal operated on by the MS 70, the SFTDS 20 is completely resolved (and, this is done until all SFTDSs 20 of the audio signal 4 have passed through the MS 70). The result is that, based on the masks applied to fully account for the spectral content of the audio signal 4, an array 76 of known audio events (or notes) associated with the masks is produced by the MS 70. This process continues until the final array 77 associated with all SFTDSs 20 has been produced. The final array 77 of data thereby indicates which known audio events (e.g., notes) are present in the entire audio signal 4. The final array 77 is used to check that the known audio events (notes) identified by the MIFFC 10 were correctly identified.

(62) Transcription Step (TS)

(63) The TS 80 includes a converter 81 for converting the final array 77 of the MS 70 into a file format 82 that is specific to the audio event 3. In the case of musical audio events, such a file form is the MIDI file. Then, the TS 80 uses an interpreter/transcriber 83 to read the MIDI file and then transcribe the audio event 3. The output transcription 84 comprises a visual representation of each known audio event identified (e.g., notes on a music staff).

(64) Each of the AERS 1, SDS 5, MIFFC 10, MS 70 and TS 80 in the second embodiment are realized by a written computer program that can be performed by a computer. In the case of the AERS 1, an appropriate audio event receiving and transducing device is connected to or inbuilt in a computer that is to carry out the AERS 1. The written program contains step by step instructions as to the logical and mathematical operations to be performed by the SDS 5, MIFFC 10, MS 70 and TS 80 on the audio signal 4 generated by the AERS 1 that represents the audio event 3.

(65) Application

(66) This application of the disclosure, with reference to FIG. 2, is a five-step method for converting a 10-second piece of random polyphonic notes played on a piano into sheet music. The method involves polyphonic mask building and polyphonic music transcription.

(67) The first step is the AERS 1, which uses a low-impedance microphone with neutral frequency response setting (suited to the broad frequency range of the piano) to transduce the audio events 3 (piano music) into an electrical signal. The sound from the piano is received using a sampling frequency of 12 kHz (well above the highest frequency note of the 88.sup.th key on a piano, C8, having 4186 Hz), with 16-bit resolution. These numbers are chosen to minimize computation but deliver sufficient performance.

(68) The audio signal 4 corresponding to the received random polyphonic piano notes is discretized into a series of SFTDSs 20. This is the second step of the method illustrated in FIG. 2. The Note Average Energy discretization approach is used to determine the length of each SFTDS 20. The signal is discretized (i.e., all the onset and offset times for the notes have been detected) when all of the SFTDS 20 have been resolved by the SDS 5.

(69) During the third step, the MIFFC 10 of the piano audio signal is applied. The filtering block 30 receives each SFTDS 20 and employs a constant-Q filterbank 35 to filter each SFTDS 20 of the signal into N (here, 88) SBTDSs 38, the number of sub-bands being chosen to correspond to the 88 different piano notes. The filterbank 35 similarly uses a series of 88 filter and decimate blocks 36 and spectrum analyzer blocks 31, and a hanning window 32 with a sample rate of 11 kHz.

(70) Each SBTDS 20 is fed through a fast Fourier transform function 33, which converts the signals to SBFTDs 39, which are summed to realize the constant-Q FFC single spectrum 40. The filtering block 30 provides two outputs: an FFT single spectrum 40 and 88 time-domain sub-band signals 38.

(71) The DCQBS block 50 receives these 88 sub-band time-domain signals 38 and calculates the bispectrum for each, individually. The 88 bispectra are then summed to calculate a full, constant-Q bispectrum 54 and then the diagonal of this matrix is taken, yielding the diagonal bispectrum 56. This signal is then multiplied by the single spectrum 40 from the filtering block 30 to yield the product spectrum 60, which is visually represented on a screen (the visual representation is not depicted in FIG. 2).

(72) From the product spectra 60 for each of the SFTDS 20, the user can identify the known audio events (piano notes) played during the 10 second piece. The notes are identifiable because they are matched to specific FFCs of the audio signal 4 and the FFCs are identifiable from the peaks in the product spectra 60 resulting from the third step of the method. This completes the third step of the method.

(73) While a useful method of confirming the known audio events present in an audio event, the masking step 70 is not necessary to identify the known audio events in an audio event because they can be obtained from product spectra 60 alone. In both polyphonic mask building and polyphonic music transcription, the masking step 70, being step four of the method, is of greater importance for higher polyphony audio events (where numerous FFCs are present in the signal).

(74) The mask bank 75 is formed prior to the AERS 1 receiving the 10 second random selection of notes in step one. It is formed by separately recording and calculating the product spectra 60 for each of the 88 piano notes, from the lowest note, A0, to the highest note, C8, and thereby forming a mask for each of these notes. The mask bank 75 illustrated in FIG. 2 has been formed by: inputting the product spectrum 60 for each of the 88 piano notes into the masking step 70; applying a threshold 71 to the signal by removing amplitudes of the signal that are less than or equal to 0.1 the maximum amplitude of the power spectrum (to minimize the spurious frequency content entering the method); applying the quantizing algorithm 72 to the signal so that the frequency axis of the product spectrum 60 is mapped to audio event-specific ranges (here the ranges are related to the frequency ranges, a negligible error, associated with MIDI numbers for the piano). This is an important step as higher order harmonics of lower notes are not the same as higher note fundamentals, due to equal-temperament tuning. In this application, the mapping is from frequency (Hz) to MIDI note number; the resultant signal is a 108 point array containing peaks at the detected MIDI-range locations; and the note masks (88 108-point MIDI pitch arrays) are then stored for application against the recorded random polyphonic piano notes.

(75) The masks are then used as templates to remove frequency content to progressively remove the superfluous harmonic frequency content in the signal to resolve the notes present in each SFTDS 20 of the random polyphonic piano music.

(76) As a concrete example for illustrative purposes, consider the C4 triad chord, D4 triad chord and G4 triad chord referred to in the context of FIG. 2. From the product spectra 60 for each of the three SFTDS 20, the user can identify the three chords played. The notes are identifiable because they are matched to specific FFCs of the audio signal 4 and the FFCs are identifiable from the peaks in the product spectra 60 resulting from the MIFFC 10. Then, in the masking step 70, three peaks in the array are found: MIDI note-number 60 (corresponding to known audio event C4), MIDI note-number 64 (corresponding to known audio event E4), and MIDI note-number 67 (corresponding to known audio event G4). In the presently described application, the method finds the lowest MIDI-note (lowest pitch) peak in the input signal first. Once found, the corresponding mask from the mask bank 75 is selected and multiplied by the amplitude of the input peak. In this case, the lowest pitch peak is C4, with amplitude of 221 Hz, which is multiplied by the C4 mask. The adjusted amplitude mask is then subtracted from the MIDI-spectrum output. Finally, the threshold-adjusted output MIDI array is calculated. The mask bank 75 has been iteratively applied to resolve all notes, the end result is empty MIDI-note output array, indicating that no more information is present for the first chord; the method then moves to the next chord, the D4 major triad, for processing; and then to the final chord, the G4 major triad, for processing. In this way, the masking step 70 complements and confirms the MIFFC 10 that identified the three chords being present in the audio signal 4. It is intended that the masking step 70 will be increasingly valuable for high polyphony audio events (such as, where four or more notes are played at the same time).

(77) In step five of the process, the transcription step 80, the final array output 77 of the masking step 70 (constituting a series of MIDI note-numbers) is input into a converter 81 so as to convert the array into a MIDI file 82. This conversion adds the quality of timing (obtained from signal onset and offset times for the SFTDS 20) to each of the notes resolved in the final array to create a consolidated MIDI file. A number of open source and proprietary computer programs can perform this task of converting a note array and timing information into a MIDI file format, including Sibelius, FL Studio, Cubase, Reason, Logic, Pro-tools, or a combination of these programs.

(78) The transcription step 80 then interprets the MIDI file (which contains sufficient information about the notes played and their timing to permit their notation on a musical staff, in accordance with usual notation conventions) and produces a sheet music transcription 84, which visually depicts the note(s) contained in each of the SFTDS 20. A number of open source and proprietary transcribing programs can assist in performing this task including Sibelius, Finale, Encore and MuseScore, or a combination of these programs.

(79) Then, the process is repeated for each of the SFTDSs 20 of the discretized signal produced by the second step of the method, until all of the random polyphonic notes played on the piano (constituting the audio event 3) have been transcribed to sheet music 84.

(80) FIG. 3 illustrates a computer-implemented system 10, which is a further embodiment of the disclosure. In the third embodiment of the disclosure, there is a system that includes two computers 20 and 30 connected by a network 40. In this system, the first computer is indicated by 20 and the second computer is labeled 30. The first computer 20 receives the audio event 3 and converts it into an audio signal (not shown in FIG. 3). Then, the SDS, MIFFC, MS and TS are performed on the audio signal, producing a transcription of the audio signal (also not shown in FIG. 3). The first computer 20 sends the transcribed audio signal over the network to the second computer 30, which has a database of transcribed audio signals stored in its memory. The second computer 30 is able to compare and match the transcription sent to it to a transcription in its memory. The second computer 30 then communicates over the network 40 to the first computer 10 the information from the matched transcription to enable the visual representation 50 of the matched transcription. This example describes how a song-matching system may operate, whereby the audio event 3 received by the first computer is an excerpt of a musical song, and the transcription (matched by the second computer) displayed on the screen of the first computer is sheet music for that musical song.

(81) FIG. 4 illustrates a computer-readable medium 10 embodying this disclosure; namely, software code for operating the MIFFC. The computer-readable medium 10 comprises a universal serial bus stick containing code components (not shown) configured to enable a computer 20 to perform the MIFFC and visually represent the identified FFCs on the computer screen 50.

(82) Throughout the specification and claims, the word comprise and its derivatives are intended to have an inclusive rather than exclusive meaning unless the contrary is expressly stated or the context requires otherwise. That is, the word comprise and its derivatives will be taken to indicate the inclusion of not only the listed components, steps or features that it directly references, but also other components, steps or features not specifically listed, unless the contrary is expressly stated or the context requires otherwise.

(83) In this specification, the term computer-readable medium may be used to refer generally to media devices including, but not limited to, removable storage drives and hard disks. These media devices may contain software that is readable by a computer system and the disclosure is intended to encompass such media devices.

(84) An algorithm or computer-implementable method is here, and generally, considered to be a self-consistent sequence of acts or operations leading to a desired result. These include physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as values, elements, terms, numbers, or the like.

(85) Unless specifically stated otherwise, use of terms throughout the specification such as transforming, computing, calculating, determining, resolving, or the like, refer to the action and/or processes of a computer or computing system, or similar numerical calculating apparatus, that manipulate and/or transform data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices. It should be understood, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities.

(86) It will be appreciated by those skilled in the art that many modifications and variations may be made to the embodiments described herein without departing from the spirit or scope of the disclosure.

Audio signal processing methods and systems

Inventors

Cpc classification

Classification Explorer

G10H1/383

PHYSICS

Classification Explorer

G10H2210/041

PHYSICS

Classification Explorer

G10H2250/251

PHYSICS

Classification Explorer

G10H2210/081

PHYSICS

Classification Explorer

G10H2210/066

PHYSICS

Classification Explorer

H04R3/04

ELECTRICITY

Classification Explorer

G10H2250/285

PHYSICS

Classification Explorer

G10L25/90

PHYSICS

Classification Explorer

G10H2250/215

PHYSICS

Classification Explorer

G10H2250/225

PHYSICS

Classification Explorer

G10H1/125

PHYSICS

Classification Explorer

G10L25/18

PHYSICS

Classification Explorer

G10H2250/235

PHYSICS

Classification Explorer

G10H2210/086

PHYSICS

International classification

Classification Explorer

H04R3/04

ELECTRICITY

Classification Explorer

G10H1/38

PHYSICS

Classification Explorer

G10L25/18

PHYSICS

Classification Explorer

G10H1/02

PHYSICS

Classification Explorer

G10H1/12

PHYSICS

Abstract

Claims

Description