CONTEXT-DEPENDENT PIANO MUSIC TRANSCRIPTION WITH CONVOLUTIONAL SPARSE CODING
20170243571 · 2017-08-24
Inventors
- Andrea Cogliati (Rochester, NY, US)
- Zhiyao Duan (Penfield, NY, US)
- Brendt Egon Wohlberg (Santa Fe, NM, US)
Cpc classification
G10H2240/145
PHYSICS
G10H2250/145
PHYSICS
G10H2210/066
PHYSICS
G10H2210/051
PHYSICS
G10H2210/086
PHYSICS
International classification
Abstract
The present disclosure presents a novel approach to automatic transcription of piano music in a context-dependent setting. Embodiments described herein may employ an efficient algorithm for convolutional sparse coding to approximate a music waveform as a summation of piano note waveforms convolved with associated temporal activations. The piano note waveforms may be pre-recorded for a particular piano that is to be transcribed and may optionally be pre-recorded in the specific environment where the piano performance is to be performed. During transcription, the note waveforms may be fixed and associated temporal activations may be estimated and post-processed to obtain the pitch and onset transcription. Experiments have shown that embodiments of the disclosure significantly outperform state-of-the-art music transcription methods trained in the same context-dependent setting, in both transcription accuracy and time precision, in various scenarios including synthetic, anechoic, noisy, and reverberant environments.
Claims
1. A method of transcribing a musical performance played on a piano, the method comprising: generating a waveform dictionary for use with the piano playing the musical performance, the waveform dictionary being generated in a supervised manner by recording a plurality of waveforms in a non-transitory computer-readable storage medium, each of the plurality of waveforms being associated with a key of the piano; recording the musical performance played on the piano; determining a plurality of activation vectors associated with the recorded performance using the plurality of recorded waveforms, each of the plurality of activation vectors corresponding to a key of the piano and comprising one or more activations of the corresponding key over time; detecting local maxima from the plurality of activation vectors; inferring note onsets from the detected local maxima; outputting the inferred note onsets and the determined plurality of activation vectors.
2. The method of claim 1, wherein the plurality of recorded waveforms are associated with each individual piano note of the piano.
3. The method of claim 1, wherein the plurality of recorded waveforms each have a duration of 0.5 second or more.
4. The method of claim 1, wherein the plurality of activation vectors are determined using a convolutional sparse coding algorithm.
5. The method of claim 1, wherein detecting local maxima from the plurality of activation vectors comprises discarding subsequent maxima following an initial local maxima that are within a predetermined time window.
6. The method of claim 5, wherein the predetermined time window is at least 50 ms.
7. The method of claim 1, wherein detecting local maxima from the plurality of activation vectors comprises discarding local maxima that are below a threshold that is associated with a highest peak in the plurality of activation vectors.
8. The method of claim 7, wherein the threshold is 10% of the highest peak in the plurality of activation vectors such that local maxima that are 10% or less than the highest peak in the plurality of activation vectors are discarded.
9. A system for transcribing a musical performance played on a piano, the system comprising: an audio recorder for recording a plurality of waveforms associated with keys of the piano and for recording the musical performance played on the piano; a non-transitory computer-readable storage medium operably coupled with the audio recorder for storing the plurality of waveforms associated with keys of the piano to form a dictionary of elements and for storing the musical performance played on the piano; a computer processor operably coupled with the non-transitory computer-readable storage medium and configured to: determine a plurality of activation vectors associated with the stored performance using the plurality of stored waveform, each of the plurality of activation vectors corresponding to a key of the piano and comprising one or more activations of the corresponding key over time s; detect local maxima from the plurality of activation vectors; infer note onsets from the detected local maxima; and output the inferred note onsets and the determined plurality of activation vectors.
10. The system of claim 9, wherein the plurality of stored waveforms are associated with all individual piano notes of the piano.
11. The system of claim 9, wherein the plurality of stored waveforms each have a duration of one second or more.
12. The system of claim 9, wherein the plurality of activation vectors are determined by the computer processor using a convolutional sparse coding algorithm.
13. The system of claim 9, wherein the computer processor detects local maxima from the plurality of activation vectors by discarding subsequent maxima following an initial local maxima that are within a predetermined time window.
14. The system of claim 13, wherein the predetermined time window is at least 50 ms.
15. The system of claim 9, wherein the computer processor detects local maxima from the plurality of activation vectors by discarding local maxima that are below a threshold that is associated with a highest peak in the plurality of activation vectors.
16. The system of claim 15, wherein the threshold is 10% of the highest peak in the plurality of activation vectors such that local maxima that are 10% or less than the highest peak in the plurality of activation vectors are discarded.
17. A non-transitory computer-readable storage medium comprising a set of computer executable instructions for transcribing a musical performance played on an instrument, wherein execution of the instructions by a computer processor causes the computer processor to carry out the steps of: generating a waveform dictionary for use with the piano playing the musical performance, the waveform dictionary being trained in a supervised manner by recording a plurality of waveforms in a non-transitory computer-readable storage medium, each of the plurality of waveforms being associated with a key of the instrument; recording the musical performance played on the instrument; determining a plurality of activation vectors associated with the recorded performance using the plurality of recorded waveforms, each of the plurality of activation vectors corresponding to a key of the piano and comprising one or more activations of the corresponding key over time; detecting local maxima from the plurality of activation vectors; inferring note onsets from the detected local maxima; outputting the inferred note onsets and the determined plurality of activation vectors.
18. The non-transitory computer-readable storage medium of claim 17, wherein the plurality of activation vectors are determined using a convolutional sparse coding algorithm.
19. The non-transitory computer-readable storage medium of claim 17, wherein detecting local maxima from the plurality of activation vectors comprises discarding local maxima that are below a threshold that is associated with a highest peak in the plurality of activation vectors.
20. The non-transitory computer-readable storage medium of claim 17, wherein detecting local maxima from the plurality of activation vectors comprises discarding subsequent maxima following an initial local maxima that are within a predetermined time window.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
DETAILED DESCRIPTION OF THE DISCLOSURE
[0054] The subject matter of embodiments of the present invention is described here with specificity, but the claimed subject matter may be embodied in other ways, may include different elements or steps, and may be used in conjunction with other existing or future technologies. While the below embodiments are described in the context of automated transcription of a piano performance, those of skill in the art will recognize that the systems and methods described herein can also transcribe performance by another instrument or instruments.
[0055]
[0056] The present disclosure describes a novel time-domain approach for transcribing polyphonic piano performances at the note-level. More specifically, the piano audio waveform may be modeled as a convolution of note waveforms (i.e., dictionary templates) and their activation weights (i.e., transcription of note onsets). Embodiments of the disclosure are useful for musicians, both professionals and amateurs, to transcribe their performances with much higher accuracy than state-of-the-art approaches. Compared to current state-of-the-art AMT approaches, embodiments of the disclosure may have one or more of the following advantages:
[0057] The transcription may be performed in the time domain and may avoid the time-frequency resolution trade-off by imposing structural constraints on the analyzed signal—i.e., a context specific dictionary and sparsity on the atom activations—resulting in better performance, especially for low-pitched notes;
[0058] Temporal evolution of piano notes may be modeled and pitch and onset may be estimated simultaneously in the same framework;
[0059] A much higher transcription accuracy and time precision may be achieved compared to a state-of-the-art AMT approach;
[0060] Embodiments may work in reverberant environments and may be robust to stationary noise to a certain degree.
[0061] As set forth above, a monaural, polyphonic piano audio recording, s(t), may be approximated with a sum of dictionary elements, d.sub.m(t), representing the waveform of each individual note of the piano, convolved with their activation vectors, x.sub.m(t):
s(t)≈Σ.sub.md.sub.m(t)*x.sub.m(t). (1)
[0062] The dictionary elements, d.sub.m(t), may be pre-set by sampling 12 the individual notes of a piano (e.g., all or a portion thereof) and may be fixed during transcription. In some embodiments, the dictionary elements may be pre-learned in a supervised manner by sampling 12 each individual note of a piano at a certain dynamic level, e.g., forte (80-91 dB sound pressure level (SPL) at 10 feet away from the piano), for 1 s. For example, in certain experimental implementations, a sampling frequency of 11,025 Hz was used to reduce the computational workload. The length of sampling may be selected by a parameter search. The choice of the dynamic level is not critical, however, louder dynamics may produce better results than softer dynamics, in certain embodiments. This may be due to the higher signal-to-noise and signal-to-quantization noise ratios of the louder note templates. Softer dynamics in music notation may include piano and mezzo piano (mp). Their intensity ranges may be 44-55 dB SPL and 55-67 dB SPL, respectively.
[0063] Another possible reason may be the richer spectral content of louder note templates. When trying to approximate a soft note with a louder template, the reconstructed signal may contain extra partials that are cancelled by negative activations of other notes to lower the data fidelity error. On the other hand, when trying to reconstruct a loud note with a softer template, the reconstructed signal may lack partials that need to be introduced with positive activations of other notes to increase the data fidelity. Optionally, embodiments described herein may be configured to only consider positive activations so negative activations do not introduce transcription errors, while positive activations might introduce false positives.
[0064] In certain embodiments, a dictionary may be trained 12 for a specific piano and acoustic environment. In fact, the training process may take less than 3 minutes in some embodiments (e.g., to record all notes of an 88 note piano). For example, in some embodiments, each note of a piano may each be played for about 1 second to train a dictionary. In some scenarios, such as piano practices, the acoustic environment of the piano may not substantially change, and a previously trained dictionary may be reused. Even for a piano performance in a new acoustic environment, taking insubstantial time (e.g., less than 5 minutes and in some embodiments about 3 minutes or less) to train the dictionary in addition to stage setup is acceptable for highly accurate transcription of the performance throughout the concert.
[0065] In some embodiments, the monaural, polyphonic piano audio recording, s(t), is recorded 14 under the conditions in which the plurality of waveforms are recorded 12. Embodiments of the disclosure may be more insensitive to reverb by recording 14 the audio to be transcribed in the same environment used for the dictionary training session, as is discussed in further detail below.
[0066] Once the dictionary is trained 12 and the piano performance is recorded 14, the recorded performance may be processed 16 using the plurality of recorded waveforms 12.
[0067] The activations, x.sub.m(t), may be estimated 22 using an efficient convolutional sparse coding algorithm [51], [55]. The following provides background for convolutional sparse coding and an efficient algorithm for its application to automatic music transcription.
[0068] A. Convolutional Sparse Coding
[0069] Sparse coding, the inverse problem of sparse representation of a particular signal, has been approached in several ways. One of the most widely used is Basis Pursuit DeNoising (BPDN) [49]:
where s is a signal to approximate, D is a dictionary matrix, x is the vector of activations of dictionary elements, and λ is a regularization parameter controlling the sparsity of x.
[0070] Convolutional Sparse Coding (CSC), also called shift-invariant sparse representation, extends the idea of sparse representation by using convolution instead of multiplication. Replacing the multiplication operator with convolution in Eq. (2) Convolutional Basis Pursuit DeNoising (CBPDN) [50] may be obtained:
where {d.sub.m} is a set of dictionary elements, also called filters; {x.sub.m} is a set of activations, also called coefficient maps; and A controls the sparsity penalty on the coefficient maps x.sub.m. Higher values of λ lead to sparser coefficient maps and a lower fidelity approximation to the signal, s.
[0071] CSC has been widely applied to various image processing problems, including classification, reconstruction, denoising and coding [51]. In the audio domain, s represents the audio waveform for analysis, {d.sub.m} represents a set of audio atoms, and {x.sub.m} represents their activations. Its applications to audio signals include music representations [38], [52] and audio classification [53]. However, its adoption has been limited by its computational complexity in favor of faster factorization techniques like NMF or PLCA.
[0072] CSC is computationally very expensive, due to the presence of the convolution operator. A straightforward implementation in the time-domain [54] has a complexity of O(M.sup.2N.sup.2L), where M is the number of atoms in the dictionary, N is the size of the signal and L is the length of the atoms.
[0073] B. Efficient Convolutional Sparse Coding
[0074] While any fast convolutional sparse coding algorithm may be used, an efficient algorithm for CSC has recently been proposed [51], [55]. This algorithm is based on the Alternating Direction Method of Multipliers (ADMM) for convex optimization [56]. The algorithm iterates over updates on three sets of variables. One of these updates is trivial, and the other can be computed in closed form with low computational cost. The additional update comprises a computationally expensive optimization due to the presence of the convolution operator. A natural way to reduce the computational complexity of convolution is to use the Fast Fourier Transform (FFT), as proposed by Bristow et al. [57] with a computational complexity of O(M.sup.3N). The computational cost of this subproblem has been further reduced to O(MN) by exploiting the particular structure of the linear systems resulting from the transformation into the spectral domain [51], [55]. The overall complexity of the resulting is O(MNlogN) since it is dominated by the cost of FFTs.
[0075] The activation vectors may be estimated 22 from the audio signal using an open source implementation [58] of the efficient convolutional sparse coding algorithm described above. In some embodiments, the sampling frequency of the audio mixture to be transcribed may be configured to match the sampling frequency used for the training stage (e.g., step 12). Accordingly, the audio mixtures may be downsampled as needed. For example, in some experimental implementations, the audio mixtures were downsampled to the sampling frequency of 11,025 Hz, mentioned above.
[0076] In some embodiments, 500 iterations may be used. Optionally, 200-400 iterations may be used in other embodiments as the algorithm generally converges after approximately 200 iterations. The result of this step is a set of raw activation vectors, which can be noisy due to the mismatch between the atoms in the dictionary and the instances in the audio mixture. Note that no non-negativity constraints may be applied in the formulation, so the activations can contain negative values. Negative activations can appear in order to correct mismatches in loudness and duration between the dictionary element and the actual note in the sound mixture. However, because the waveform of each note may be quite consistent across different instances, the strongest activations may be generally positive.
[0077] These activation vectors may be impulse trains, with each impulse indicating the onset of the corresponding note at a time. As mentioned above, however, in practice the estimated activations may contain some noise. After post-processing, the activation vectors may resemble impulse trains, and may recover the underlying ground-truth note-level transcription of the piece, an example of which is provided below.
[0078] For post processing, peak picking may be performed 24 by detecting local maxima from the raw activation vectors to infer note onsets. However, because the activation vectors are noisy, multiple closely located peaks are often detected from the activation of one note. To deal with this problem, the earliest peak within a time window may be kept and the others may be discarded 26. This may enforce local sparsity of each activation vector. In some embodiments, a 50 ms time window was selected because it represents a realistic limit on how fast a performer can play the same note repeatedly. For example,
[0079] Thereafter, the resulting peaks may be binarized 28 to keep only peaks that are within a predetermined threshold of the highest peak in the entire activation matrix. For example, in some embodiments, only the peaks that are higher than 10% of the highest peak may be kept. This step 28 may reduce ghost notes (i.e., false positives) and may increase the precision of the transcription. Optionally, the threshold may be between 1%-15% of the highest peak.
[0080] After processing the recorded musical performance 16 to determine the activation vectors and the note onsets, the inferred note onsets and activation vectors 18 may be outputted to a user (e.g., printed, displayed, electronically copied/transmitted, or the like). For example, the activation vectors and note onsets may be outputted in the form of a music notation or a piano roll associated with the musical performance.
[0081]
[0082]
[0083]
[0084] In certain embodiments, methods and systems may be based on the assumption that the waveform of a note of the piano is consistent when the note is played at different times. This assumption is valid, thanks to the mechanism of piano note production [59]. Each piano key is associated with a hammer, one to three strings, and a damper that touches the string(s) by default. When the key is pressed, the hammer strikes the string(s) while the damper is raised from the string(s). The string(s) vibrate freely to produce the note waveform until the damper returns to the string(s), when the key is released. The frequency of the note is determined by the string(s); it is stable and cannot be changed by the performer (e.g., vibrato is impossible). The loudness of the note is determined by the velocity of the hammer strike, which is affected by how hard the key is pressed. Modern pianos generally have three foot pedals: sustain pedal, sostenuto pedal, and soft pedal; some models omit the sostenuto pedal. The sustain pedal is commonly used. When it is pressed, all dampers of all notes are released from all strings, no matter whether a key is pressed or released. Therefore, its usage only affects the offset of a note, if the slight sympathetic vibration of strings across notes is ignored. The sostenuto pedal behaves similarly, but only releases dampers that are already raised without affecting other dampers. The soft pedal changes the way that the hammer strikes the string(s), hence it affects the timbre or the loudness, but its use is rare compared to the use of the other pedals.
[0085]
[0086] Plumbley et al. [38] suggested a model similar to the one proposed here, but with two major differences. First of all they attempted an unsupervised approach by learning the dictionary atoms from the audio mixture, by using an oracle estimation of the number of individual notes present in the piece; the dictionary atoms were manually labeled and ordered to represent the individual notes. Second, they used very short dictionary elements (125 ms), which was found not to be sufficient to achieve good accuracy in transcription. Moreover, their experimental section was limited to a single piano piece and no evaluation of the transcription was performed.
EXPERIMENTS
[0087] Experiments were conducted to answer two questions: (1) How sensitive is the proposed method to key parameters such as the sparsity parameter λ, and the length and loudness of the dictionary elements; and (2) how does the proposed method compare with state-of-the-art piano transcription methods in different settings such as anechoic, noisy, and reverberant environments?
[0088] In order to validate the method in a realistic scenario embodiments described herein were tested on pieces performed on a Disklavier, which is an acoustic piano with mechanical actuators that can be controlled via MIDI input. The Disklavier enables a realistic performance on an acoustic piano along with its ground truth note-level transcription. The ENSTDkCl collection of the MAPS dataset [23] was used. This collection contains 30 pieces of different styles and genres generated from high quality MIDI files that were manually edited to achieve realistic and expressive performances. The audio was recorded in a close microphone setting to minimize the effects of reverb.
[0089] F-measure was used to evaluate the note-level transcription [4]. It is defined as the harmonic mean of precision and recall, where precision is defined as the percentage of correctly transcribed notes among all transcribed notes, and recall is defined as the percentage of correctly transcribed notes among all ground-truth notes. A note is considered correctly transcribed if its estimated discretized pitch is the same as a reference note in the ground-truth and the estimated onset is within a given tolerance value (e.g., ±50 ms) of the reference note. Offsets were not considered in deciding the correctness.
[0090] A. Parameter Dependency
[0091] To investigate the dependency of the performance on the parameter A, a grid search was performed with values of λ logarithmically spaced from 0.4 to 0.0004 on the original ENSTDkCl collection in the MAPS dataset [23]. The results are shown in
[0092] The performance of the method and system with respect to the length of the dictionary elements was also investigated.
[0093] Finally, the effect of the dynamic level of the dictionary atoms was investigated. In general, the proposed method was found to be very robust to differences in dynamic levels, but better results may be obtained when louder dynamics were used during the training. A possible explanation can be seen in
[0094] B. Comparison to the State of the Art
[0095] Embodiments of the method described herein were compared with a state-of-the-art AMT method proposed by Benetos and Dixon [29], which was submitted for evaluation to MIREX 2013 as BW3. The method will be referred to as BW3-MIREX13. This method is based on probabilistic latent component analysis of a log-spectrogram energy and uses pre-extracted note templates from isolated notes. The templates are also pre-shifted along the log-frequency in order to support vibrato and frequency deviations, which are not an issue for piano music. The method is frame-based and does not model the temporal evolution of notes. To make a fair comparison, dictionary templates of both BW3-MIREX13 and the proposed method were learned on individual notes of the piano that was used in the test pieces. The implementation provided by the author was used along with the provided parameters, with the only exception of the hop size, which was reduced to 5 ms to test the onset detection accuracy.
[0096] 1) Anechoic Settings: In addition to the test with the original MAPS dataset, the proposed method was also tested on the same pieces re-synthesized with a virtual piano, in order to set a baseline of the performance in an ideal scenario, i.e., absence of noise and reverb. For the baseline experiment, all the pieces have been re-rendered from the MIDI files using a digital audio workstation (Logic Pro 9) with a sampled virtual piano plug-in (Steinway Concert Grand Piano from the Garritan Personal Orchestra); no reverb was used at any stage. For this set of experiments multiple onset tolerance values were tested to show the highest onset precision achieved by the proposed method.
[0097] The results are shown in
[0098]
[0099] The degradation of performance on the acoustic piano with small tolerance values drove further inspection of the algorithm and the ground truth. It was noticed that the audio and the ground truth transcription in the MAPS database are in fact not consistently lined up, i.e., different pieces show a different delay between the activation of the note in the MIDI file and the corresponding onset in the audio file.
[0100] 2) Robustness to Noise: In this section, the robustness of the proposed method to noise was investigated and the results were compared with BW3-MIREX13. Both white and pink noise were tested on the original ENSTDkCl collection of MAPS. White and pink noises can represent typical background noises (e.g., air conditioning) in houses or practice rooms. The results are shown in
[0101] 3) Robustness to Reverberation: In the third set of experiments, the performance of the proposed method was tested in the presence of reverberation. Reverberation exists in almost all real-world performing and recording environments, however, few systems have been designed and evaluated in reverberant environments in the literature. Reverberation is not even mentioned in recent surveys [1], [61]. A real impulse response of an untreated recording space was used with a T60 of about 2.5 s, and convolved it with the dictionary elements and the audio files.
[0102] Accordingly, some embodiments of the present disclosure provide an automatic music transcription algorithm based on convolutional sparse coding in the time-domain. The proposed algorithm consistently outperforms a state-of-the-art algorithm trained in the same scenario in all synthetic, anechoic, noisy, and reverberant settings, except for the case of pink noise at SNR=0 dB. The proposed method achieves high transcription accuracy and time precision in a variety of different scenarios, and is highly robust to moderate amounts of noise. It may also highly insensitive to reverb when the session is performed in the same environment used for recording the audio to be transcribed.
[0103] In further embodiments, a dictionary may be obtained or provided that contains notes of different lengths and different dynamics which may be used to estimate note offsets or dynamics. In such embodiments, group sparsity constraints may be introduced in order to avoid the concurrent activations of multiple templates for the same pitch.
[0104] While the methods and systems are described above for transcribing piano performances, other embodiments may be utilized on other percussive and plucked pitched instruments such as harpsichord, marimba, classical guitar, bells and carillon, given the consistent nature of their notes and the model's ability to capture temporal evolutions.
[0105] One or more computing devices may be adapted to provide desired functionality by accessing software instructions rendered in a computer-readable form. When software is used, any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein. However, software need not be used exclusively, or at all. For example, some embodiments of the methods and systems set forth herein may also be implemented by hard-wired logic or other circuitry, including but not limited to application-specific circuits. Combinations of computer-executed software and hard-wired logic or other circuitry may be suitable as well.
[0106] Embodiments of the methods disclosed herein may be executed by one or more suitable computing devices. Such system(s) may comprise one or more computing devices adapted to perform one or more embodiments of the methods disclosed herein. As noted above, such devices may access one or more computer-readable media that embody computer-readable instructions which, when executed by at least one computer, cause the at least one computer to implement one or more embodiments of the methods of the present subject matter. Additionally or alternatively, the computing device(s) may comprise circuitry that renders the device(s) operative to implement one or more of the methods of the present subject matter.
[0107] Any suitable computer-readable medium or media may be used to implement or practice the presently-disclosed subject matter, including but not limited to, diskettes, drives, and other magnetic-based storage media, optical storage media, including disks (e.g., CD-ROMS, DVD-ROMS, variants thereof, etc.), flash, RAM, ROM, and other memory devices, and the like.
[0108] The subject matter of embodiments of the present invention is described here with specificity, but this description is not necessarily intended to limit the scope of the claims. The claimed subject matter may be embodied in other ways, may include different elements or steps, and may be used in conjunction with other existing or future technologies. This description should not be interpreted as implying any particular order or arrangement among or between various steps or elements except when the order of individual steps or arrangement of elements is explicitly described.
[0109] Different arrangements of the components depicted in the drawings or described above, as well as components and steps not shown or described are possible. Similarly, some features and sub-combinations are useful and may be employed without reference to other features and sub-combinations. Embodiments of the invention have been described for illustrative and not restrictive purposes, and alternative embodiments will become apparent to readers of this patent. Accordingly, the present invention is not limited to the embodiments described above or depicted in the drawings, and various embodiments and modifications may be made without departing from the scope of the claims below.
[0110] List of References, each of which is incorporated herein in its entirety: [0111] [1] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, “Automatic music transcription: challenges and future directions,” Journal of Intelligent Information Systems, vol. 41, no. 3, pp. 407-434, 2013. [0112] [2] J. A. Moorer, “On the transcription of musical sound by computer,” Computer Music Journal, pp. 32-38, 1977. [0113] [3] M. Piszczalski and B. A. Galler, “Automatic music transcription,” Computer Music Journal, vol. 1, no. 4, pp. 24-31, 1977. [0114] [4] M. Bay, A. F. Ehmann, and J. S. Downie, “Evaluation of multiple-f0 estimation and tracking systems.” in Proc. ISMIR, 2009, pp. 315-320. [0115] [5] P. R. Cook, Music, cognition, and computerized sound. Cambridge, Mass.: Mit Press, 1999. [0116] [6] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. Sandler, “A tutorial on onset detection in music signals,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 1035-1047, 2005. [0117] [7] A. De Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” The Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917-1930, 2002. [0118] [8] D. Gabor, “Theory of communication. part 1: The analysis of information,” Journal of the Institution of Electrical Engineers Part III: Radio and Communication Engineering, vol. 93, no. 26, pp. 429-441, 1946. [0119] [9] A. Cogliati, Z. Duan, and B. Wohlberg, “Piano music transcription with fast convolutional sparse coding,” in Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on, September 2015, pp. 1-6. [0120] [10] C. Raphael, “Automatic transcription of piano music.” in Proc. ISMIR, 2002. [0121] [11] A. P. Klapuri, “Multiple fundamental frequency estimation based on harmonicity and spectral smoothness,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 804-816, 2003. [0122] [12] C. Yeh, A. Rd″bel, and X. Rodet, “Multiple fundamental frequency estimation of polyphonic music signals,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, 2005, pp. iii-225. [0123] [13] K. Dressler, “Multiple fundamental frequency extraction for mirex 2012,” Eighth Music Information Retrieval Evaluation eXchange (MIREX), 2012. [0124] [14] G. Poliner and D. Ellis, “A discriminative model for polyphonic piano transcription,” EURASIP Journal on Advances in Signal Processing, no. 8, pp. 154-162, January 2007. [0125] [15] A. Pertusa and J. M. Mesta, “Multiple fundamental frequency estimation using Gaussian smoothness,” in IEEE International Conference on Audio, Speech, and Signal Processing, April 2008, pp. 105-108. [0126] [16] S. Saito, H. Kameoka, K. Takahashi, T. Nishimoto, and S. Sagayama, “Specmurt analysis of polyphonic music signals,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 3, pp. 639-650, March 2008. [0127] [17] J. Nam, J. Ngiam, H. Lee, and M. Slaney, “A classification-based polyphonic piano transcription approach using learned feature representations.” in Proc. ISMIR, 2011, pp. 175-180. [0128] [18] S. Bock and M. Schedl, “Polyphonic piano note transcription with recurrent neural networks,” in IEEE International Conference on Audio, Speech, and Signal Processing, March 2012, pp. 121-124. [0129] [19] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription,” in 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. [0130] [20] S. Sigtia, E. Benetos, N. Boulanger-Lewandowski, T. Weyde, A. S. d'Avila Garcez, and S. Dixon, “A hybrid recurrent neural network for music transcription,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, Brisbane, Australia, April 2015, pp. 2061-2065. [0131] [21] M. Goto, “A real-time music-scene-description system: Predominant-f0 estimation for detecting melody and bass lines in real-world audio signals,” Speech Communication, vol. 43, no. 4, pp. 311-329, 2004. [0132] [22] Z. Duan, B. Pardo, and C. Zhang, “Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18, no. 8, pp. 2121-2133, 2010. [0133] [23] V. Emiya, R. Badeau, and B. David, “Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1643-1654, 2010. [0134] [24] P. Peeling and S. Godsill, “Multiple pitch estimation using non-homogeneous poisson processes,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1133-1143, October 2011. [0135] [25] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788-91, 1999. [0136] [26] P. Smaragdis, B. Raj, and M. Shashanka, “A probabilistic latent variable model for acoustic modeling,” In Workshop on Advances in Models for Acoustic Processing at NIPS, 2006. [0137] [27] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2003. [0138] [28] G. C. Grindlay and D. P. W. Ellis, “Transcribing multi-instrument polyphonic music with hierarchical eigeninstruments,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1159-1169, 2011. [0139] [29] E. Benetos and S. Dixon, “A shift-invariant latent variable model for automatic music transcription,” Computer Music Journal, vol. 36, no. 4, pp. 81-94, 2012. [0140] [30] S. A. Abdallah and M. D. Plumbley, “Polyphonic music transcription by non-negative sparse coding of power spectra,” in 5th International Conference on Music Information Retrieval (ISMIR), 2004, pp. 318-325. [0141] [31] K. O'Hanlon, H. Nagano, and M. D. Plumbley, “Structured sparsity for automatic music transcription,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012, pp. 441-444. [0142] [32] K. O'Hanlon and M. D. Plumbley, “Polyphonic piano transcription using non-negative matrix factorisation with group sparsity,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 3112-3116. [0143] [33] R. Meddis and M. J. Hewitt, “Virtual pitch and phase sensitivity of a computer model of the auditory periphery. i: pitch identification,” Journal of the Acoustical Society of America, vol. 89, pp. 2866-2882, 1991. [0144] [34] T. Tolonen and M. Karjalainen, “A computationally efficient multipitch analysis model,” IEEE Transactions on Speech and Audio Processing, vol. 8, no. 6, pp. 708-716, November 2000. [0145] [35] P. J. Walmsley, S. J. Godsill, and P. J. Rayner, “Polyphonic pitch tracking using joint bayesian estimation of multiple frame parameters,” in Applications of Signal Processing to Audio and Acoustics, 1999 IEEE Workshop on. IEEE, 1999, pp. 119-122. [0146] [36] A. T. Cemgil, H. J. Kappen, and D. Barber, “A generative model for music transcription,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 2, pp. 679-694, March 2006. [0147] [37] M. Davy, S. Godsill, and J. Idier, “Bayesian analysis of polyphonic western tonal music,” The Journal of the Acoustical Society of America, vol. 119, no. 4, pp. 2498-2517, 2006. [0148] [38] M. D. Plumbley, S. A. Abdallah, T. Blumensath, and M. E. Davies, “Sparse representations of polyphonic music,” Signal Processing, vol. 86, no. 3, pp. 417-431, 2006. [0149] [39] J. P. Bello, L. Daudet, and M. B. Sandler, “Automatic piano transcription using frequency and time-domain information,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 2242-2251, 2006. [0150] [40] L. Su and Y.-H. Yang, “Combining spectral and temporal representations for multipitch estimation of polyphonic music,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 10, pp. 1600-1612, October 2015. [0151] [41] M. Ryynanen and A. Klapuri, “Automatic transcription of melody, bass line, and chords in polyphonic music,” Computer Music Journal, vol. 32, no. 3, pp. 72-86, fall 2008. [0152] [42] Z. Duan and D. Temperley, “Note-level music transcription by maximum likelihood sampling,” in International Symposium on Music Information Retrieval Conference, October 2001. [0153] [43] M. Marolt, A. Kavcic, and M. Privosnik, “Neural networks for note onset detection in piano music,” in Proc. International Computer Music Conference, 2002, Conference Proceedings. [0154] [44] G. Costantini, R. Perfetti, and M. Todisco, “Event based transcription system for polyphonic piano music,” Signal Processing, vol. 89, no. 9, pp. 1798-1811, 2009. [0155] [45] A. Cogliati and Z. Duan, “Piano music transcription modeling note temporal evolution,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, 2015, pp. 429-433. [0156] [46] H. Kameoka, T. Nishimoto, and S. Sagayama, “A multipitch analyzer based on harmonic temporal structured clustering,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 982-994, 2007. [0157] [47] T. Berg-Kirkpatrick, J. Andreas, and D. Klein, “Unsupervised transcription of piano music,” in Advances in Neural Information Processing Systems, 2014, pp. 1538-1546. [0158] [48] S. Ewert, M. D. Plumbley, and M. Sandler, “A dynamic programming variant of non-negative matrix deconvolution for the transcription of struck string instruments,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, 2015, pp. 569-573. [0159] [49] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 33-61, 1998. [0160] [50] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2528-2535. [0161] [51] B. Wohlberg, “Efficient algorithms for convolutional sparse representations,” IEEE Transactions on Image Processing, 2015. [0162] [52] T. Blumensath and M. Davies, “Sparse and shift-invariant representations of music,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 14, no. 1, pp. 50-57, 2006. [0163] [53] R. Grosse, R. Raina, H. Kwong, and A. Y. Ng, “Shift-invariance sparse coding for audio classification,” arXiv preprint arXiv: 1206.5241, 2012. [0164] [54] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus, “Deconvolutional networks,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, June 2010, pp. 2528-2535. [0165] [55] B. Wohlberg, “Efficient convolutional sparse coding,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, May 2014, pp. 7173-7177. [0166] [56] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1-122, 2011. [0167] [57] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse coding,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, 2013, pp. 391-398. [0168] [58] B. Wohlberg, “SParse Optimization Research COde (SPORCO),” Matlab library available from http://math.lanl.gov/˜brendt/Software/SPORCO/, 2015, version 0.0.2. [0169] [59] H. Suzuki and I. Nakamura, “Acoustics of pianos,” Applied Acoustics, vol. 30, no. 2, pp. 147-205, 1990. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0003682X909043T. [0170] [60] P.-K. Jao, Y.-H. Yang, and B. Wohlberg, “Informed monaural source separation of music based on convolutional sparse coding,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, April 2015, pp. 236-240. [0171] [61] M. Davy and A. Klapuri, Signal Processing Methods for Music Transcription. Springer, 2006.