Enhanced chroma extraction from an audio codec
09697840 ยท 2017-07-04
Assignee
Inventors
Cpc classification
G10H1/383
PHYSICS
G10H2250/225
PHYSICS
G10L19/02
PHYSICS
G10H2210/066
PHYSICS
International classification
G10L19/02
PHYSICS
Abstract
The present document relates to methods and systems for music information retrieval (MIR). In particular, the present document relates to methods and systems for extracting a chroma vector from an audio signal. A method (900) for determining a chroma vector (100) for a block of samples of an audio signal (301) is described. The method (900) comprises receiving (901) a corresponding block of frequency coefficients derived from the block of samples of the audio signal (301) from a core encoder (412) of a spectral band replication based audio encoder (410) adapted to generate an encoded bitstream (305) of the audio signal (301) from the block of frequency coefficients; and determining (904) the chroma vector (100) for the block of samples of the audio signal (301) based on the received block of frequency coefficients.
Claims
1. A method for processing a block of samples of an audio signal, the method being performed at a spectral band replication based audio encoder which includes a core encoder adapted to derive a block of frequency coefficients from the block of samples of the audio signal and to generate an encoded bitstream of the audio signal from the block of frequency coefficients, and the method comprising: receiving the block of frequency coefficients from the core encoder of the spectral band replication based audio encoder; determining a chroma vector for the block of samples of the audio signal based on the received block of frequency coefficients, wherein determining the chroma vector comprises applying frequency dependent psychoacoustic processing to the received block of frequency coefficients or to one or more frequency coefficients which are determined on the basis of the received block of frequency coefficients; determining melodic and/or harmonic content of the block of samples of the audio signal based on the chroma vector for the block of samples of the audio signal; and storing the melodic and/or harmonic content on media or transferring the melodic and/or harmonic content via a network.
2. The method of claim 1, wherein the block of samples of the audio signal comprises N succeeding short-blocks of M samples each, respectively; the received block of frequency coefficients comprises N corresponding short-blocks of M frequency coefficients each, respectively, and wherein the method further comprises: estimating a long-block of frequency coefficients corresponding to the block of samples of the audio signal from the N short-blocks of M frequency coefficients; wherein the estimated long-block of frequency coefficients has an increased frequency resolution compared to the N short-blocks of frequency coefficients; and determining the chroma vector for the block of samples of the audio signal based on the estimated long-block of frequency coefficients.
3. The method of claim 2, wherein estimating the long-block of frequency coefficients comprises interleaving corresponding frequency coefficients of the N short-blocks of frequency coefficients, thereby yielding an interleaved long-block of frequency coefficients.
4. The method of claim 3, wherein estimating the long-block of frequency coefficients comprises decorrelating the N corresponding frequency coefficients of the N short-blocks of frequency coefficients by applying a transform with energy compaction property to the interleaved long-block of frequency coefficients.
5. The method of claim 2, wherein estimating the long-block of frequency coefficients comprises: forming a plurality of sub-sets of the N short-blocks of frequency coefficients; wherein the number of short-blocks per sub-set is selected based on the audio signal; for each sub-set, interleaving corresponding frequency coefficients of the short-blocks of frequency coefficients, thereby yielding an interleaved intermediate-block of frequency coefficients of the sub-set; and for each sub-set, applying a transform with energy compaction property, e.g. a DCT-II transform, to the interleaved intermediate-block of frequency coefficients of the sub-set, thereby yielding a plurality of estimated intermediate-blocks of frequency coefficients for the plurality of sub-sets.
6. The method of claim 5, wherein the frequency dependent psychoacoustic processing is applied to one of the plurality of estimated intermediate-blocks of frequency coefficients.
7. The method of claim 2, wherein estimating the long-block of frequency coefficients comprises applying a polyphase conversion to the N short-blocks of M frequency coefficients, wherein the polyphase conversion is based on a conversion matrix for mathematically transforming the N short-blocks of M frequency coefficients to an accurate long-block of NM frequency coefficients; and the polyphase conversion makes use of an approximation of the conversion matrix with a fraction of conversion matrix coefficients set to zero.
8. The method of claim 2, wherein estimating the long-block of frequency coefficients comprises: forming a plurality of sub-sets of the N short-blocks of frequency coefficients; wherein the number L of short-blocks per sub-set is selected based on the audio signal, L<N; applying an intermediate polyphase conversion to the plurality of sub-sets, thereby yielding a plurality of estimated intermediate-blocks of frequency coefficients; wherein the intermediate polyphase conversion is based on an intermediate conversion matrix for mathematically transforming L short-blocks of M frequency coefficients to an accurate intermediate-block of LM frequency coefficients; and wherein the intermediate polyphase conversion makes use of an approximation of the intermediate conversion matrix with a fraction of intermediate conversion matrix coefficients set to zero.
9. The method of claim 2, further comprising: estimating a super long-block of frequency coefficients corresponding to a plurality of blocks of samples from a corresponding plurality of long-blocks of frequency coefficients; wherein the estimated super long-block of frequency coefficients has an increased frequency resolution compared to the plurality of long-blocks of frequency coefficients.
10. The method of claim 9, wherein the frequency dependent psychoacoustic processing is applied to the estimated super long-block of frequency coefficients.
11. The method of claim 2, wherein the frequency dependent psychoacoustic processing is applied to the estimated long-block of frequency coefficients.
12. The method of claim 1, wherein applying frequency dependent psychoacoustic processing comprises: comparing a value derived from at least one frequency coefficient of the received block of frequency coefficients or from at least one frequency coefficient being determined on the basis of the received block of frequency coefficients to a frequency dependent energy threshold; and setting the frequency coefficient to zero if the frequency coefficient is below the energy threshold.
13. The method of claim 12, wherein the derived value corresponds to an average energy derived from a plurality of frequency coefficients for a corresponding plurality of frequencies.
14. The method of claim 1, wherein determining the chroma vector comprises: classifying plural frequency coefficients of the received block of frequency coefficients or being determined on the basis of the received block of frequency coefficients to tone classes of the chroma vector; and determining cumulated energies for the tone classes of the chroma vector based on the classified frequency coefficients.
15. An audio encoder adapted to encode an audio signal, the audio encoder comprising: a core encoder adapted to encode a downsampled component of the audio signal, wherein the core encoder is adapted to encode a block of samples of the downsampled component of the audio signal by transforming the block of samples of the downsampled component of the audio signal from the time domain into the frequency domain, thereby yielding a corresponding block of frequency coefficients in the frequency domain; and a processor adapted to determine a chroma vector of the block of samples of the downsampled component of the audio signal based on the block of frequency coefficients received from the core encoder, wherein the processor is further adapted to determine the chroma vector by applying frequency dependent psychoacoustic processing to the received block of frequency coefficients or to one or more frequency coefficients which are determined on the basis of the received block of frequency coefficients; wherein the chroma vector of the block of samples of the audio signal is indicative of melodic and/or harmonic content of the block of samples of the audio signal; wherein the melodic and/or harmonic content is to be stored on media or transferred via a network.
16. The encoder of claim 15, further comprising a spectral band replication encoder adapted to encode a corresponding high frequency component of the audio signal and also comprising a multiplexer adapted to generate an encoded bitstream from data provided by the core encoder and the spectral band replication encoder, wherein the multiplexer is adapted to add information derived from the chroma vector as metadata to the encoded bitstream.
17. An audio decoder adapted to decode an audio signal, the audio decoder being adapted to receive an encoded bitstream and adapted to extract a block of frequency coefficients from the encoded bitstream; wherein the extracted block of frequency coefficients is associated with a corresponding block of samples of a downsampled component of the audio signal; and the audio decoder comprising: a processor adapted to determine a chroma vector of the block of samples of the audio signal based on the extracted block of frequency coefficients, wherein the processor is further adapted to determine the chroma vector by applying frequency dependent psychoacoustic processing to the extracted block of frequency coefficients or to one or more frequency coefficients which are determined on the basis of the extracted block of frequency coefficients; wherein the processor is further adapted to determine melodic and/or harmonic content of the block of samples of the audio signal based on the chroma vector for the block of samples of the audio signal; wherein the melodic and/or harmonic content is to be stored on media or transferred via a network.
18. A non-transitory computer readable medium storing a software program adapted for execution on a processor and for performing the method steps of claim 1 when carried out on the processor.
19. A computer program product including a non-transitory computer readable medium comprising executable instructions for performing the method steps of claim 1 when executed on a computer.
Description
DESCRIPTION OF THE DRAWINGS
(1) The invention is explained below in an exemplary manner with reference to the accompanying drawings, wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
DETAILED DESCRIPTION OF THE INVENTION
(11) Today's storage solutions have the capacity to provide huge databases of musical content to users. Online streaming services like Simfy offer more than 13 million songs (audio files or audio signals), and these streaming services are faced with the challenge of navigating through large databases, and to select and stream appropriate music tracks to their subscribers. Similarly, users with a large personal collection of music stored in a database have the same problem of selecting appropriate music. In order to be able to handle such large amount of data, new ways of discovering music are desirable. In particular, it may be beneficial that a music retrieval system proposes similar kinds of music to a user when the user's preferred taste of music is known.
(12) In order to identify musical similarity, numerous high-level semantic features such as tempo, rhythm, beat, harmony, melody, genre and mood may be required and may need to be extracted from the musical content. Music-Information-Retrieval (MIR) offers methods to compute many of these musical features. Most MIR strategies rely on a mid-level descriptor, from which necessary high-level musical features are obtained. One example of a mid-level descriptor is the so-called chroma vector 100 illustrated in
(13) As illustrated in
(14) This distribution of semitone classes captures the harmonic content of an audio signal. The progression of chroma vectors over time is known as chromagram. The chroma vectors and the chromagram representation may be used to identify chord names (e.g., a C major chord comprising large chroma vector values of C, E, and G), to estimate the overall key of an audio signal (the key identifies the tonic triad, the chord, major/minor, which represents the final point of rest of a musical piece, or the focal point of a section of the musical piece), to estimate the mode of an audio signal (wherein the mode is a type of scale, e.g. a musical piece in a major or minor key), to detect intra- and inter-song similarity (harmony/melody similarity within a song or harmony/melody similarity over a collection of songs to create a playlist of similar songs), to identify a song and/or to extract a chorus of the song.
(15) As such, chroma vectors can be obtained by spectral folding of a short term spectrum of the audio signal into a single octave and a following fragmentation of the folded spectrum into a twelve-dimensional vector. This operation relies on an appropriate time-frequency representation of the audio signal, preferably having a high resolution in the frequency domain. The computation of such a time-frequency transformation of the audio signal is computational intensive and consumes the major computation power in known chromagram computation schemes.
(16) In the following, the basic scheme for determining a chroma vector is described. As can be seen from Table 1 (frequencies in Hz for semitones of Western music in the fourth octave), a direct mapping of tones to frequencies is possible when the reference pitch, generally 440 Hz for the tone A4, is known.
(17) TABLE-US-00001 TABLE 1 Hz 264 275 297 317 330 352 367 396 422 440 475 495 528 tone C C# D D# E F F# G G# A A# B C
The factor between the frequencies of two semitones is
(18)
and thus the factor between two octaves is
(19)
Since doubling the frequency is equivalent to raising a tone by one octave, this system can be seen as periodic and can be displayed in the cylindrical coordinate system 102, where the radial axis represents one of the 12 tones or one of the chroma values (referred to as c) and where the longitudinal position represents the tone height (referred to as h). Consequently, the perceived pitch or frequency f can be written as f=2.sup.c+h, c[0,1), hZ.
(20) When analyzing an audio signal (e.g. a musical piece) concerning its melody and harmony, a visual display showing its harmonic information over time is desirable. One way is the so-called chromagram where the spectral content of one frame is mapped onto a twelve-dimensional vector of semitones, called a chroma vector, and plotted versus time. The chroma value c can be obtained from a given frequency f by transposing the above mentioned equation as c=log.sub.2(f)log.sub.2(f), where is the flooring operation which corresponds to the spectral folding of the plurality of octaves onto a single octave (depicted by the Helix representation 102). Alternatively, the chroma vector may be determined by using a set of 12 bandpass filters per octave, wherein each bandpass is adapted to extract the spectral energy of a particular chroma from the magnitude spectrum of the audio signal at a particular time instant. As such, the spectral energy which corresponds to each chroma (or tone class) may be isolated from the magnitude spectrum and subsequently summed up to yield the chroma value c for the particular chroma. An example bandpass filter 200 for the class of tone A is illustrated in
(21) As outlined above, the determination of a chroma vector and a chromagram requires the determination of an appropriate time-frequency representation of the audio signal. This is typically linked to high computational complexity. In the present document, it is proposed to reduce the computational effort by integrating the MIR process into an existing audio processing scheme, which already makes use of a similar time-frequency transformation. Desirable qualities of such an existing audio processing scheme would be a time-frequency representation with a high-frequency resolution, an efficient implementation of the time-frequency transformation, and the availability of additional modules that can be used to potentially improve the reliability and quality of the resulting chromagram.
(22) Audio signals (notably music signals) are typically stored and/or transmitted in an encoded (i.e. compressed) format. This means that MIR processes should work in conjunction with encoded audio signals. It is therefore proposed to determine a chroma vector and/or a chromagram of an audio signal in conjunction with an audio encoder, which makes use of a time-frequency transformation. In particular, it is proposed to make use of a high efficiency (HE) encoder/decoder, i.e. an encoder/decoder which makes use of spectral band replication (SBR). An example for such a SBR based encoder/decoder is the HE-AAC (advanced audio coding) encoder/decoder. The HE-AAC codec was designed to deliver a rich listening experience at very low bit-rates and thus is widely used in broadcasting, mobile streaming and download services. An alternative SBR based codec is e.g. the mp3PRO codec, which makes use of an mp3 core encoder instead of an AAC core encoder. In the following, reference will be made to a HE-AAC codec. It should be noted, however, that the proposed methods and systems are also applicable to other audio codecs, notably to other SBR based codecs.
(23) As such, it is proposed in the present document, to make use of the time-frequency transformation available in HE-AAC in order to determine the chroma vectors/the chromagram of an audio signal. As such, the computational complexity for chroma vector determination is significantly reduced. Another advantage of using an audio encoder to obtain chromagrams, besides the saving of computational costs, is the fact that typical audio codecs focus on human perception. This means that typical audio codecs (such as the HE-AAC codec) provide good psychoacoustic tools that could be suitable for further chromagram enhancement. In other words, it is proposed to make use of the psychoacoustic tools available within an audio encoder to enhance the reliability of a chromagram.
(24) Furthermore, it should be noted that also the audio encoder itself benefits from the presence of an additional chromagram computation module since the chromagram computation module allows computing helpful metadata, e.g. chord information, which may be included into the metadata of the bitstream generated by the audio encoder. This additional metadata can be used to offer an enhanced consumer experience at the decoder side. In particular, the additional metadata may be used for further MIR applications.
(25)
(26) The chromagram determination module 310 makes use of a time-frequency transformation 311 to determine a short term magnitude spectrum 101 of the audio signal 301. Subsequently, the sequence of chroma vectors (i.e. the chromagram 313) is determined in unit 312 from the sequence of short-term magnitude spectra 101.
(27)
(28)
(29) The high frequency component of the audio signal is encoded using SBR parameters. For this purpose, the audio signal 301 is analyzed using an analysis filter bank 413 (e.g. a quadrature mirror filter bank (QMF) having e.g. 64 frequency bands). As a result, a plurality of subband signals of the audio signal is obtained, wherein at each time instant t (or at each sample k), the plurality of subband signals provides an indication of the spectrum of the audio signal 301 at this time instant t. The plurality of subband signals is provided to the SBR encoder 414. The SBR encoder 414 determines a plurality of SBR parameters, wherein the plurality of SBR parameters enables the reconstruction of the high frequency component of the audio signal from the (reconstructed) low frequency component at the corresponding decoder 430. The SBR encoder 414 typically determines the plurality of SBR parameters such that a reconstructed high frequency component that is determined based on the plurality of SBR parameters and the (reconstructed) low frequency component approximates the original high frequency component. For this purpose, the SBR encoder 414 may make use of an error minimization criterion (e.g. a mean square error criterion) based on the original high frequency component and the reconstructed high frequency component.
(30) The plurality of SBR parameters and the encoded bitstream of the low frequency component are joined within a multiplexer 415 (e.g. the encoder unit 304) to provide an overall bitstream, e.g. an HE-AAC bitstream 305, which may be stored or which may be transmitted. The overall bitstream 305 also comprises information regarding SBR encoder settings, which were used by the SBR encoder 414 to determine the plurality of SBR parameters. In addition, it is proposed in the present document to add metadata derived from a chromagram 313, 353 of the audio signal 301 to the overall bitstream 305.
(31) A corresponding decoder 430 may generate an uncompressed audio signal at the sampling rate fs_out=fs_in from the overall bitstream 305. The core decoder 431 separates the SBR parameters from the encoded bitstream of the low frequency component. Furthermore, the core decoder 431 (e.g. an AAC decoder) decodes the encoded bitstream of the low frequency component to provide a time domain signal of the reconstructed low frequency component at the internal sampling rate fs of the decoder 430. The reconstructed low frequency component is analyzed using an analysis filter bank 432. It should be noted that in the dual-rate mode the internal sampling rate fs is different at the decoder 430 from the input sampling rate fs_in and the output sampling rate fs_out, due to the fact that the AAC decoder 431 works in the downsampled domain, i.e. at an internal sampling rate fs which is half the input sampling rate fs_in and half the output sampling rate fs_out of the audio signal 301.
(32) The analysis filter bank 432 (e.g. a quadrature mirror filter bank having e.g. 32 frequency bands) typically has only half the number of frequency bands compared to the analysis filter bank 413 used at the encoder 410. This is due to the fact that only the reconstructed low frequency component and not the entire audio signal has to be analyzed. The resulting plurality of subband signals of the reconstructed low frequency component are used in the SBR decoder 433 in conjunction with the received SBR parameters to generate a plurality of subband signals of the reconstructed high frequency component. Subsequently, a synthesis filter bank 434 (e.g. a quadrature mirror filter bank of e.g. 64 frequency bands) is used to provide the reconstructed audio signal in the time domain. Typically, the synthesis filter bank 434 has a number of frequency bands, which is double the number of frequency bands of the analysis filter bank 432. The plurality of subband signals of the reconstructed low frequency component may be fed to the lower half of the frequency bands of the synthesis filter bank 434 and the plurality of subband signals of the reconstructed high frequency component may be fed to the higher half of the frequency bands of the synthesis filter bank 434. The reconstructed audio signal at the output of the synthesis filter bank 434 has an internal sampling rate of 2fs which corresponds to the signal sampling rates fs_out=fs_in.
(33) As such, the HE-AAC codec 400 provides a time-frequency transformation 413 for the determination of the SBR parameters. This time-frequency transformation 413 typically has, however, a very low frequency resolution and is therefore not suitable for chromagram determination. On the other hand, the core encoder 412, notably the AAC code encoder, also makes use of a time-frequency transformation (typically an MDCT) with a higher frequency resolution.
(34) The AAC core encoder breaks an audio signal into a sequence of segments, called blocks or frames. A time domain filter, called a window, provides smooth transitions from block to block by modifying the data in these blocks. The AAC core encoder is adapted to dynamically switch between two block lengths: M=1028 samples and M=128 samples, referred to as long-blocks and short-blocks, respectively. As such, the AAC core encoder is adapted to encode audio signals that vacillate between tonal (steady-state, harmonically rich complex spectra signals) (using a long-block) and impulsive (transient signals) (using a sequence of eight short-blocks).
(35) Each block of samples is converted into the frequency domain using a Modified Discrete Cosine Transform (MDCT). In order to circumvent the problem of spectral leakage, which typically occurs in the context of block-based (also referred to as frame-based) time frequency transformations, MDCT makes use of overlapping windows, i.e. MDCT is an example of a so-called overlapped lapped transform. This is illustrated in
(36)
This means that M frequency coefficients X[k] are determined from 2M signal samples x[l].
(37) Subsequently, the sequence of blocks of M frequency coefficients X[k] is quantized based on a psychoacoustic model. There are various psychoacoustic models used in audio coding, like the ones described in the standards ISO 13818-7:2005, Coding of Moving Pictures and Audio, 2005, or ISO 14496-3:2009, Information technologyCoding of audio-visual objectsPart3: Audio, 2009, or 3GPP, General Audio Codec audio processing functions; Enhanced aac-Plus general audio codec; Encoder Specification AAC part, 2004, which are incorporated by reference. The psychoacoustic models typically take into account the fact that the human ear has a different sensitivity for different frequencies. In other words, the sound pressure level (SPL) required for perceiving an audio signal at a particular frequency varies as a function of frequency. This is illustrated in
(38) In addition, it should be noted that the capacity of hearing of the human ear is subjected to masking. The term masking may be subdivided into spectral masking and temporal masking Spectral masking indicates that a masker tone at a certain energy level in a certain frequency interval may mask other tones in the direct spectral neighborhood of the frequency interval of the masker tone. This is illustrated in
(39) By way of example, the psychoacoustic model from the 3GPP standard may be used. This model determines an appropriate psychoacoustic masking threshold by calculating a plurality of spectral energies X.sub.en for a corresponding plurality of frequency bands b. The plurality of spectral energies X.sub.en[b] for a subband b (also referred to as frequency band b in the present document and also referred to as scale factor band in the context of HE-AAC) may be determined from the MDCT frequency coefficients X[k] by summing the squared MDCT coefficients, i.e. as
(40)
using a constant offset simulates a worst-case scenario, namely a tonal signal for the whole audio frequency range. In other words, the psychoacoustic model makes no distinction between tonal and non-tonal components. All signal frames are assumed to be tonal, which implies a worst-case scenario. As a result, tonal and non-tonal component distinction is not performed, and hence this psychoacoustic model is computationally efficient.
(41) The used offset value corresponds to a SNR (signal-to-noise ratio) value, which should be chosen appropriately to guarantee high audio quality. For standard AAC, a logarithmic SNR value of 29 dB is defined and the threshold in the subband b is determined as
(42)
The 3GPP model simulates the auditory system of a human by comparing the threshold Thr.sub.sc[b] in the subband b with a weighted version of the threshold Thr.sub.sc[b1] or Thr.sub.sc[b+1] of the neighboring subbands b1, b+1 and by selecting the maximum. The comparison is done using different frequency-dependent weighting coefficients s.sub.h[b] and s.sub.l[b] for the lower neighbor and for the higher neighbor, respectively, in order to simulate the different slopes of the asymmetric masking curve 602. Consequently, a first filtering operation, starting at the lowest subband and approximating a slope of 15 dB/Bark, is given by
Thr.sub.spr[b]=max(Thr.sub.sc[b],s.sub.h[b].Math.Thr.sub.sc[b1]),
and a second filtering operation, starting at the highest subband and approximating a slope of 30 dB/Bark, is given by
Thr.sub.spr[b]=max(Thr.sub.spr[b],s.sub.l[b].Math.Thr.sub.spr[b+1]).
(43) In order to obtain the overall threshold Thr[b] for the subband b from the calculated masking threshold Thr.sub.spr[b], also the threshold in quiet 601 (referred to as Thr.sub.quiet[b]) should be taken into account. This may be done by selecting the higher value of the two masking thresholds for each subband b, respectively, such that the more dominant part of the two curves is taken into account. This means that the overall masking threshold may be determined as
Thr[b]=max(Thr.sub.spr[b],Thr.sub.quiet[b]).
(44) Furthermore, in order to make the overall masking threshold Thr[b] more resistant to the problem of pre-echoes, the following additional modification may be applied. When a transient signal occurs, it is likely that there is a sudden increase or drop of energy in some subbands b from one block to another. Such jumps of energy may lead to a sudden increase of the masking threshold Thr[b] which would lead to a sudden reduction of the quantization quality. This could lead to audible errors in the encoded audio signal in form of pre-echo artifacts. As such, the masking threshold may be smoothed along the time axis by selecting the masking threshold Thr[b] for a current block as a function of the masking threshold Thr.sub.last[b] of a previous block. In particular, the masking threshold Thr[b] for a current block may be determined as
Thr[b]=max(rpmn.Math.Thr.sub.spr[b],min(Thr[b],rpelev.Math.Thr.sub.last[b])),
wherein rpmn, rpelv are appropriate smoothening parameters. This reduction of the masking threshold for transient signals causes higher SMR (Signal to Marking Ratio) values, resulting in a better quantization, and ultimately in less audible errors in form of pre-echo artifacts.
(45) The masking threshold Thr[b] is used within the quantization and coding unit 303 for quantizing MDCT coefficients of a block 501. A MDCT coefficient which lies below the masking threshold Thr[b] is quantized and coded less accurately, i.e. less bits are invested. The masking threshold Thr[b] can also be used in the context of perceptual processing 356 prior to (or in the context of) chromagram computation 352, as will be outlined in the present document.
(46) Overall, it may be summarized that the core encoder 412 provides:
(47) a representation of the audio signal 301 in the time-frequency domain, in the form of a sequence of MDCT coefficients (for long-blocks and for short-blocks); and a signal dependent perceptual model in the form of a frequency (subband) dependent masking threshold Thr[b] (for long-blocks and for short-blocks).
(48) This data can be used for the determination of a chromagram 353 of the audio signal 301. For long-blocks (M=1024 samples), the MDCT coefficients of a block typically have a sufficiently high frequency resolution for determining a chroma vector. Since the AAC core codec 412 in an HE-AAC encoder 410 operates at half the sampling frequency, the MDCT transform-domain representations used in HE-AAC have an even better frequency resolution for long-blocks than in the case of AAC without SBR encoding. By way of example, for an audio signal 301 at a sampling rate of 44.1 kHz, the frequency resolution of the MDCT coefficients for a long-block is f=10.77 Hz/bin, which is sufficiently high for determining a chroma vector for most Western popular music. In other words, the frequency resolution of long-blocks of the core encoder of an HE-AAC encoder is sufficiently high, in order to reliably assign the spectral energy to the different tone classes of a chroma vector (see
(49) On the other hand, for short-blocks (M=128), the frequency resolution is f=86.13 Hz/bin. As the fundamental frequencies (F0s) are not spaced by more than 86.13 Hz apart until the 6.sup.th octave, the frequency resolution provided by short-blocks is typically not sufficient for the determination of a chroma vector. Nevertheless, it may be desirable to also be able to determine a chroma vector for short-blocks, as the transient audio signal, which is typically associated with a sequence of short-blocks, may comprise tonal information (e.g. from a Xylophone or a Glockenspiel or a techno musical genre). Such tonal information may be important for reliable MIR applications.
(50) In the following, various example schemes for increasing the frequency resolution of a sequence of short-blocks are described. These example schemes have reduced computational complexity compared to the transformation of the original time domain audio signal block into the frequency domain. This means, these example schemes allow the determination of a chroma vector from the sequence of short-blocks at reduced computational complexity (compared to the determination directly from the time domain signal).
(51) As outlined above, an AAC encoder typically selects a sequence of eight short-blocks instead of a single long-block in order to encode a transient audio signal. As such, a sequence of eight MDCT coefficient blocks X.sub.l[k], 1=0, . . . , N1, with N=8 in the case of AAC, is provided. A first scheme for increasing the frequency resolution of short-block spectra may be to concatenate N frequency coefficient blocks X.sub.1 to X.sub.N of length M.sub.short (=128), and to interleave the frequency coefficients. This short-block interleaving scheme (SIS) rearranges the frequency coefficients according to their time index, to a new block X.sub.SIS of length M.sub.long=NM.sub.short (=1024). This may be done according to
X.sub.SIS[kN+l]=X.sub.l[k],k[0, . . . ,M.sub.short1],l[0, . . . ,N1]
This interleaving of frequency coefficients increases the number of frequency coefficients, thus increasing the resolution. But since N low-resolution coefficients of the same frequency, at different points in time, are mapped to N high-resolution coefficients of different frequencies, at the same point in time, an error with a variance of N/2 bins is introduced. Nevertheless, in the case of HE-AAC or AAC, this method allows to estimate a spectrum with M.sub.long=1024 coefficients by interleaving the coefficients of N=8 short-blocks with a length of M.sub.short=128.
(52) A further scheme for increasing the frequency resolution of a sequence of N short-blocks is based on the adaptive hybrid transform (AHT). The AHT exploits the fact that if a time signal remains relatively constant, its spectrum will typically not change rapidly. The decorrelation of such a spectral signal will lead to a compact representation in the low frequency bins. A transform for decorrelating signals may be the DCT-II (Discrete Cosine Transform) which approximates the Karhunen-Loeve-Transform (KLT). The KLT is optimal in the sense of decorrelation. However, the KLT is signal dependent and therefore not applicable without high complexity. The following formula of the AHT can be seen as the combination of the above-mentioned SIS and a DCT-II kernel for decorrelating the frequency coefficients of corresponding short-block frequency bins:
(53)
The block of frequency coefficients X.sub.AHT has an increased frequency resolution, with a reduced error variance compared to the SIS. At the same time, the computational complexity of the AHT scheme is lower compared to a complete MDCT of the long-block of audio signal samples.
(54) As such, the AHT may be applied over the N=8 short-blocks of a frame (that is equivalent to a long-block) to estimate a high-resolution long-block spectrum. The quality of resulting chromagrams thereby benefits from the approximation of a long-block spectrum, instead of using a sequence of short-block spectra. It should be noted that in general, the AHT scheme could be applied to an arbitrary number of blocks because the DCT-II is a non-overlapping transform. Therefore, it is possible to apply the AHT scheme to subsets of a sequence of short-blocks. This may be beneficial to adapt the AHT scheme to the particular conditions of the audio. By way of example, one could distinguish a plurality of different stationary entities within a sequence of short-blocks by computing a spectral similarity measure and by segmenting the sequence of short-blocks into different subsets. These subsets can then be processed with the AHT to increase the frequency resolution of the subsets.
(55) A further scheme for increasing the frequency resolution of a sequence of MDCT coefficient blocks X.sub.l[k], 1=0, . . . , N1 is to use a polyphase description of the underlying MDCT transformation of the sequence of short-blocks and the MDCT transformation of the long-block. By doing this, a conversion matrix Y can be determined which performs an exact transformation of the sequence of MDCT coefficient blocks X.sub.l[k], 1=0, . . . , N1 (i.e. the sequence of short-blocks) to the MDCT coefficient block for a long-block, i.e.
X.sub.PPC=Y.Math.[X.sub.0, . . . ,X.sub.N1],
wherein X.sub.PPC is a [3, MN] matrix representing the MDCT coefficients of a long-block and the influence of the two preceding frames, Y is the [MN,MN,3] conversion matrix (wherein the third dimension of the matrix Y represents the fact that the coefficients of the matrix Y are 3.sup.rd order polynomials, meaning that the matrix elements are equations described by az.sup.2+b z.sup.1+c z.sup.0, where z represents a delay of one frame) and [X.sub.0, . . . , X.sub.N1] is an [1, MN] vector formed of the MDCT coefficients of the N short-blocks. N is the number of short-blocks forming a long-block with length NM and M is the number of samples within a short-block.
(56) The conversion matrix Y is determined from a synthesis matrix G for transforming the N short-blocks back into the time domain and an analysis matrix H for transforming the time domain samples of a long-block into the frequency domain, i.e. Y=G.Math.H. The conversion matrix Y allows a perfect reconstruction of the long-block MDCT coefficients from the N sets of short-block MDCT coefficients. It can be shown that the conversion matrix Y is sparse, which means that a significant fraction of the matrix coefficients of the conversion matrix Y can be set to zero without significantly affecting the conversion accuracy. This is due to the fact that both matrices G and H comprise weighted DCT-IV transform coefficients. The resulting conversion matrix Y=G.Math.H is a sparse matrix, because the DCT is an orthogonal transformation. Therefore many of the coefficients of the conversion matrix Y can be disregarded in the calculation, as they are nearly zero. Typically, it is sufficient to consider a band of q coefficients around the main diagonal. This approach makes the complexity and the accuracy of the conversion from short-blocks to long-blocks scalable as q can be chosen from 1 to MN. It can be shown that the complexity of the conversion is O(q.Math.M.Math.N.Math.3) compared to the complexity of a long-block MDCT of O((MN).sup.2) or O(M.Math.N.Math.log(M.Math.N)) in a recursive implementation. This means that the conversion using a polyphase conversion matrix Y may be implemented at a lower computational complexity than the recalculation of an MDCT of the long-block.
(57) The details regarding the polyphase conversion are described in G. Schuller, M. Gruhne, and T. Friedrich, Fast audio feature extraction from compressed audio data, Selected Topics in Signal Processing, IEEE Journal of, 5(6):1262-1271, October 2011, which is incorporated by reference.
(58) As a result of the polyphase conversion, an estimate of the long-block MDCT coefficients X.sub.PPC is obtained, which provides N times higher frequency resolution than the short-block MDCT coefficients [X.sub.0, . . . , X.sub.N1]. This means that the estimated long-block MDCT coefficients X.sub.PPC typically have a sufficiently high frequency resolution for the determination of a chroma vector.
(59)
(60)
(61) The different frequency resolution provided by the various short-block to long-block conversion schemes outlined above is also reflected in the quality of the chroma vectors determined from the various estimates of the long-block MDCT coefficients. This is shown in
(62) As such, methods have been described which allow the determination of a chromagram based on the MDCT coefficients provided by an SBR based core encoder (e.g. an AAC core encoder). It has been outlined how the resolution of a sequence of short-block MDCT coefficients can be increased by approximating the corresponding long-block MDCT coefficients. The long-block MDCT coefficients can be determined at reduced computational complexity compared to a recalculation of the long-block MDCT coefficients from the time domain. As such, it is possible to also determine chroma vectors for transient audio signals at reduced computational complexity.
(63) In the following, methods for perceptually enhancing chromagrams are described. In particular, methods that make use of the perceptual model provided by an audio encoder are described.
(64) As has already been outlined above, the purpose of the psychoacoustic model in a perceptual and lossy audio encoder is typically to determine how fine certain parts of the spectrum are to be quantized depending on a given bit rate. In other words, the psychoacoustic model of the encoder provides a rating for the perceptual relevance for every frequency band b. Under the premise, that the perceptually relevant parts mainly comprise harmonic content, the application of the masking threshold should increase the quality of the chromagrams. Chromagrams for polyphonic signals should especially benefit, since noisy parts of the audio signal are disregarded or at least attenuated.
(65) It has already been outlined how a frame-wise (i.e. block-wise) masking threshold Thr[b] may be determined for the frequency band b. The encoder uses this masking threshold, by comparing the masking threshold Thr[b] for every frequency coefficient X[k] with the energy X.sub.en[b] of the audio signal in the frequency band b (which is also referred to as a scale factor band in the case of HE-AAC) which comprises the frequency index k. Whenever the energy value X.sub.en[b] falls below the masking value, X[k] is disregarded, i.e. X[k]=0X.sub.en[b]<Thr[b]. Typically, a coefficient-wise comparison of the frequency coefficients (i.e. energy values) X[k] with the masking threshold Thr[b] of the corresponding frequency band b only provides minor quality benefits over a band-wise comparison within a chord recognition application based on the chromagrams determined according to the methods described in the present document. On the other hand, a coefficient-wise comparison would lead to increased computational complexity. As such, a block-wise comparison using average energy values X.sub.en[b] per frequency band b may be preferable.
(66) Typically, the energy of a frequency band b (also referred to as scale factor band energy) which comprises a harmonic contributor should be higher than the perceptual masking threshold Thr[b]. On the other hand, the energy of a frequency band b which mainly comprises noise should be smaller than the masking threshold Thr[b]. As such, the encoder provides a perceptually motivated, noise reduced version of the frequency coefficients X[k] which can be used to determine a chroma vector for a given frame (and a chromagram for a sequence of frames).
(67) Alternatively, a modified masking threshold may be determined from the data available at the audio encoder. Given the scale factor band energy distribution X.sub.en[b] for a particular block (or frame), a modified masking threshold Thr.sub.constSMR may be determined using a constant SMR (Signal-to-Mask-Ratio) for all scale factor bands b, i.e. Thr.sub.constSMR[b]=X.sub.en[b]SMR. This modified masking threshold can be determined at low computational costs, as it only requires subtraction operations. Furthermore, the modified masking threshold strictly follows the energy of the spectrum, such that the amount of disregarded spectral data can be easily adjusted by adjusting the SMR value of the encoder.
(68) It should be noted that the SMR of a tone may be dependent on the tone amplitude and tone frequency. As such, alternatively to the above mentioned constant SMR, the SMR may be adjusted/modified based on the scale factor band energy X.sub.en[b] and/or the band index b.
(69) Furthermore, it should be noted that the scale factor band energy distribution X.sub.en[b] for a particular block (frame) can be received directly from the audio encoder. The audio encoder typically determines this scale factor band energy distribution X.sub.en[b] in the context of (psychoacoustic) quantization. The method for determining a chroma vector of a frame may receive the already computed scale factor band energy distribution X.sub.en[b] from the audio encoder (instead of computing the energy values) in order to determine the above mentioned masking threshold, thereby reducing the computational complexity of chroma vector determination.
(70) The modified masking threshold may be applied by setting X[k]=0X[k]<Thr[b]. If it is assumed that there is only one harmonic contributor per scale factor band b, the energy X.sub.en[b] in this band b and the coefficient X[k] of the energy spectrum should have similar values. Therefore, a reduction of X.sub.en[b] by a constant SMR value should yield a modified masking threshold which will catch only the harmonic parts of the spectrum. The non-harmonic part of the spectrum should be set to zero. The chroma vector of a frame (and the chromagram of a sequence of frames) may be determined from the modified (i.e. perceptually processed) frequency coefficients.
(71)
(72) In the present document, various methods and systems for determining a chroma vector and/or a chromagram at reduced computational complexity are described. In particular, it is proposed to make use of the time-frequency representation of an audio signal, which is provided by audio codecs (such as the HE-AAC codec). In order to provide a continuous chromagram (also for transient parts of the audio signal where the encoder has switched to short blocks, desirably or undesirably), methods for increasing the frequency resolution of short-block time-frequency representations are described. In addition, it is proposed to make use of the psychoacoustic model provided by the audio codec, in order to improve the perceptual salience of the chromagram.
(73) It should be noted that the description and drawings merely illustrate the principles of the proposed methods and systems. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope. Furthermore, all examples recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the proposed methods and systems and the concepts contributed by the inventors to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass equivalents thereof.
(74) The methods and systems described in the present document may be implemented as software, firmware and/or hardware. Certain components may e.g. be implemented as software running on a digital signal processor or microprocessor. Other components may e.g. be implemented as hardware and or as application specific integrated circuits. The signals encountered in the described methods and systems may be stored on media such as random access memory or optical storage media. They may be transferred via networks, such as radio networks, satellite networks, wireless networks or wireline networks, e.g. the Internet. Typical devices making use of the methods and systems described in the present document are portable electronic devices or other consumer equipment which are used to store and/or render audio signals.