HARMONIC TRANSPOSITION IN AN AUDIO CODING METHOD AND SYSTEM
20230027660 · 2023-01-26
Assignee
Inventors
Cpc classification
G10L19/24
PHYSICS
International classification
G10L19/022
PHYSICS
G10L19/02
PHYSICS
G10L19/24
PHYSICS
Abstract
The present invention relates to transposing signals in time and/or frequency and in particular to coding of audio signals. More particular, the present invention relates to high frequency reconstruction (HFR) methods including a frequency domain harmonic transposer. A method and system for generating a transposed output signal from an input signal using a transposition factor T is described. The system comprises an analysis window of length L.sub.a, extracting a frame of the input signal, and an analysis transformation unit of order M transforming the samples into M complex coefficients. M is a function of the transposition factor T. The system further comprises a nonlinear processing unit altering the phase of the complex coefficients by using the transposition factor T, a synthesis transformation unit of order M transforming the altered coefficients into M altered samples, and a synthesis window of length L.sub.s, generating a frame of the output signal.
Claims
1. An audio signal processing device for transposing an input audio signal by a transposition factor T to generate an output audio signal, the audio signal processing device comprising one or more components that: extract a frame of L time-domain samples of the input audio signal using an analysis window of length L having the function
2. The audio signal processing device of claim 1, wherein the oversampling factor F is greater or equal to (T+1)/2, and wherein the transposition factor T is an integer greater than 1.
3. The audio signal processing device of claim 1, wherein the altering of the phase comprises multiplying the phase by the transposition factor T.
4. The audio signal processing device of claim 1, wherein the analysis window has a length L with zero padding by additional (F−1)*L zeros.
5. The audio signal processing device of claim 1, wherein the one or more components further: shift the analysis window by an analysis stride along the input audio signal to generate successive frames of the input audio signal; shift successive frames of L time-domain output samples by a synthesis stride; and overlap and add the successive shifted frames of L time-domain output samples to generate the output signal.
6. The audio signal processing device of claim 5, wherein the one or more components further increase the sampling rate of the output signal by the transposition order T to yield a transposed output signal.
7. The audio signal processing device of claim 6, wherein the synthesis stride is T times the analysis stride.
8. A method, performed by an audio signal processing device, for transposing an input audio signal by a transposition factor T to generate an output audio signal, the method comprising: extracting a frame of L time-domain samples of the input audio signal using an analysis window of length L having the function
9. The method of claim 8, wherein transforming the L time-domain samples into M complex frequency-domain coefficients is performing one of a Fourier Transform, a Fast Fourier Transform, a Discrete Fourier Transform, a Wavelet Transform.
10. The method of claim 8, wherein the oversampling factor F is greater or equal to (T+1)/2, and wherein the transposition factor T is an integer greater than 1.
11. The method of claim 8, wherein the input audio signal comprises a low frequency component of an audio signal.
12. A non-transitory computer readable medium comprising instructions for execution on an audio signal processing device, wherein, when executed by the audio signal processing device, the instructions cause the audio signal processing device to perform the method of claim 8.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0043] The present invention will now be described by way of illustrative examples, not limiting the scope or spirit of the invention, with reference to the accompanying drawings, in which:
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
DETAILED DESCRIPTION
[0055] The below-described embodiments are merely illustrative for the principles of the present invention for Improved Harmonic Transposition. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
[0056] In the following, the principle of harmonic transposition in the frequency domain and the proposed improvements as taught by the present invention are outlined. A key component of the harmonic transposition is time stretching by an integer transposition factor T which preserves the frequency of sinusoids. In other words, the harmonic transposition is based on time stretching of the underlying signal by a factor T. The time stretching is performed such that frequencies of sinusoids which compose the input signal are maintained. Such time stretching may be performed using a phase vocoder. The phase vocoder is based on a frequency domain representation furnished by a windowed DFT filter bank with analysis window v.sub.a(n) and synthesis window v.sub.s(n). Such analysis/synthesis transform is also referred to as short-time Fourier Transform (STFT).
[0057] A short-time Fourier transform is performed on a time-domain input signal to obtain a succession of overlapped spectral frames. In order to minimize possible side-band effects, appropriate analysis/synthesis windows, e.g. Gaussian windows, cosine windows, Hamming windows, Hann windows, rectangular windows, Bartlett windows, Blackman windows, and others, should be selected. The time delay at which every spectral frame is picked up from the input signal is referred to as the hop size or stride. The STFT of the input signal is referred to as the analysis stage and leads to a frequency domain representation of the input signal. The frequency domain representation comprises a plurality of subband signals, wherein each subband signal represents a certain frequency component of the input signal.
[0058] The frequency domain representation of the input signal may then be processed in a desired way. For the purpose of time-stretching of the input signal, each subband signal may be time-stretched, e.g. by delaying the subband signal samples. This may be achieved by using a synthesis hop-size which is greater than the analysis hop-size. The time domain signal may be rebuilt by performing an inverse (Fast) Fourier transform on all frames followed by a successive accumulation of the frames. This operation of the synthesis stage is referred to as overlap-add operation. The resulting output signal is a time-stretched version of the input signal comprising the same frequency components as the input signal. In other words, the resulting output signal has the same spectral composition as the input signal, but it is slower than the input signal i.e. its progression is stretched in time.
[0059] The transposition to higher frequencies may then be obtained subsequently, or in an integrated manner, through downsampling of the stretched signals. As a result the transposed signal has the length in time of the initial signal, but comprises frequency components which are shifted upwards by a pre-defined transposition factor.
[0060] In mathematical terms, the phase vocoder may be described as follows. An input signal x(t) is sampled at a sampling rate R to yield the discrete input signal x(n). During the analysis stage, a STFT is determined for the input signal x(n) at particular analysis time instants t.sub.a.sup.k for successive values k. The analysis time instants are preferably selected uniformly through t.sub.a.sup.k=k.Math.Δt.sub.a, where Δt.sub.a is the analysis hop factor or analysis stride. At each of these analysis time instants t.sub.a.sup.k, a Fourier transform is calculated over a windowed portion of the original signal x(n), wherein the analysis window v.sub.a(t) is centered around t.sub.a.sup.k, i.e. v.sub.a(t−t.sub.a.sup.k). This windowed portion of the input signal x(n) is referred to as a frame. The result is the STFT representation of the input signal x(n), which may be denoted as:
where
is the center frequency of the m.sup.th subband signal of the STFT analysis and M is the size of the discrete Fourier transform (DFT). In practice, the window function v.sub.a(n) has a limited time span, i.e. it covers only a limited number of samples L, which is typically equal to the size M of the DFT. By consequence, the above sum has a finite number of terms. The subband signals X(t.sub.a.sup.k,Ω.sub.m) are both a function of time, via index k, and frequency, via the subband center frequency Ω.sub.m.
[0061] The synthesis stage may be performed at synthesis time instants t.sub.s.sup.k which are typically uniformly distributed according to t.sub.s.sup.k=k.Math.Δt.sub.s, where Δt.sub.s is the synthesis hop factor or synthesis stride. At each of these synthesis time instants, a short-time signal y.sub.k(n) is obtained by inverse-Fourier-transforming the STFT subband signal Y(t.sub.s.sup.k, Ω.sub.m), which may be identical to X(t.sub.a.sup.k,Ω.sub.m), at the synthesis time instants t.sub.s.sup.k. However, typically the STFT subband signals are modified, e.g. time-stretched and/or phase modulated and/or amplitude modulated, such that the analysis subband signal X(t.sub.a.sup.k,Ω.sub.m) differs from the synthesis subband signal Y(t.sub.s.sup.k,Ω.sub.m). In a preferred embodiment, the STFT subband signals are phase modulated, i.e. the phase of the STFT subband signals is modified. The short-term synthesis signal y.sub.k(n) can be denoted as
[0062] The short-term signal y.sub.k(n) may be viewed as a component of the overall output signal y(n) comprising the synthesis subband signals Y(t.sub.s.sup.k,Ω.sub.m) for m=0, . . . , M−1, at the synthesis time instant t.sub.s.sup.k. I.e. the short-term signal y.sub.k(n) is the inverse DFT for a specific signal frame. The overall output signal y(n) can be obtained by overlapping and adding windowed short-time signals y.sub.k(n) at all synthesis time instants t.sub.s.sup.k. I.e. the output signal y(n) may be denoted as
where v.sub.s(n−t.sub.s.sup.k) is the synthesis window centered around the synthesis time instant t.sub.s.sup.k. It should be noted that the synthesis window typically has a limited number of samples L, such that the above mentioned sum only comprises a limited number of terms.
[0063] In the following, the implementation of time-stretching in the frequency domain is outlined. A suitable starting point in order to describe aspects of the time stretcher is to consider the case T=1, i.e. the case where the transposition factor T equals 1 and where no stretching occurs. Assuming the analysis time stride Δt.sub.a and the synthesis time stride Δt.sub.s of the DFT filter bank to be equal, i.e. Δt.sub.a=Δt.sub.s=Δt, the combined effect of analysis followed by synthesis is that of an amplitude modulation with the Δt-periodic function
where q(n)=v.sub.a(n)v.sub.s(n) is the point-wise product of the two windows, i.e. the point-wise product of the analysis window and the synthesis window. It is advantageous to choose the windows such that K(n)=1 or another constant value, since then the windowed DFT filter bank achieves perfect reconstruction. If the analysis window v.sub.a(n) is given, and if the analysis window is of sufficiently long duration compared to the stride Δt, one can obtain perfect reconstruction by choosing the synthesis window according to
For T>1, i.e. for a transposition factor greater than 1, a time stretch may be obtained by performing the analysis at stride
whereas the synthesis stride is maintained at Δt.sub.s=Δt. In other words, a time stretch by a factor T may be obtained by applying a hop factor or stride at the analysis stage which is T times smaller than the hop factor or stride at the synthesis stage. As can be seen from the formulas provided above, the use of a synthesis stride which is T times greater than the analysis stride will shift the short-term synthesis signals y.sub.k(n) by T times greater intervals in the overlap-add operation. This will eventually result in a time-stretch of the output signal y(n).
[0064] It should be noted that the time stretch by the factor T may further involve a phase multiplication by a factor T between the analysis and the synthesis. In other words, time stretching by a factor T involves phase multiplication by a factor T of the subband signals.
[0065] In the following it is outlined how the above described time-stretching operation may be translated into a harmonic transposition operation. The pitch-scale modification or harmonic transposition may be obtained by performing a sample-rate conversion of the time stretched output signal y(n). For performing a harmonic transposition by a factor T, an output signal y(n) which is a time-stretched version by the factor T of the input signal x(n) may be obtained using the above described phase vocoding method. The harmonic transposition may then be obtained by downsampling the output signal y(n) by a factor T or by converting the sampling rate from R to TR. In other words, instead of interpreting the output signal y(n) as having the same sampling rate as the input signal x(n) but of T times duration, the output signal y(n) may be interpreted as being of the same duration but of T times the sampling rate. The subsequent downsampling of T may then be interpreted as making the output sampling rate equal to the input sampling rate so that the signals eventually may be added. During these operations, care should be taken when downsampling the transposed signal so that no aliasing occurs.
[0066] When assuming the input signal x(n) to be a sinusoid and when assuming a symmetric analysis windows v.sub.a(n), the method of time stretching based on the above described phase vocoder will work perfectly for odd values of T, and it will result in a time stretched version of the input signal x(n) having the same frequency. In combination with a subsequent downsampling, a sinusoid y(n) with a frequency which is T times the frequency of the input signal x(n) will be obtained.
[0067] For even values of T, the time stretching/harmonic transposition method outlined above will be more approximate, since negative valued side lobes of the frequency response of the analysis window v.sub.a(n) will be reproduced with different fidelity by the phase multiplication. The negative side lobes typically come from the fact that most practical windows (or prototype filters) have numerous discrete zeros located on the unit circle, resulting in 180 degree phase shifts. When multiplying the phase angles using even transposition factors the phase shifts are typically translated to 0 (or rather multiples of 360) degrees depending on the transposition factor used. In other words, when using even transposition factors, the phase shifts vanish. This will typically give rise to aliasing in the transposed output signal y(n). A particularly disadvantageous scenario may arise when a sinusoidal is located in a frequency corresponding to the top of the first side lobe of the analysis filter. Depending on the rejection of this lobe in the magnitude response, the aliasing will be more or less audible in the output signal. It should be noted that, for even factors T, decreasing the overall stride Δt typically improves the performance of the time stretcher at the expense of a higher computational complexity.
[0068] In EP0940015B1/WO98/57436 entitled “Source coding enhancement using spectral band replication” which is incorporated by reference, a method has been described on how to avoid aliasing emerging from a harmonic transposer when using even transposition factors. This method, called relative phase locking, assesses the relative phase difference between adjacent channels, and determines whether a sinusoidal is phase inverted in either channel. The detection is performed by using equation (32) of EP0940015B1. The channels detected as phase inverted are corrected after the phase angles are multiplied with the actual transposition factor.
[0069] In the following a novel method for avoiding aliasing when using even and/or odd transposition factors Tis described. In contrary to the relative phase locking method of EP0940015B1, this method does not require the detection and correction of phase angles. The novel solution to the above problem makes use of analysis and synthesis transform windows that are not identical. In the perfect reconstruction (PR) case, this corresponds to a bi-orthogonal transform/filter bank rather than an orthogonal transform/filter bank.
[0070] To obtain a bi-orthogonal transform given a certain analysis window v.sub.a(n), the synthesis window v.sub.s(n) is chosen to follow
where c is a constant, Δt.sub.s is the synthesis time stride and L is the window length. If the sequence s(n) is defined as
i.e. v.sub.a(n)=v.sub.s(n) is used for both analysis and synthesis windowing, then the condition for an orthogonal transform is
s(m)=c,0≤m<Δt.sub.s.
[0071] However, in the following another sequence w(n) is introduced, wherein w(n) is a measure on how much the synthesis window v.sub.s(n) deviates from the analysis window v.sub.a(n), i.e. how much the bi-orthogonal transform differs from the orthogonal case. The sequence w(n) is given by
The condition for perfect reconstruction is then given by
For a possible solution, w(n) could be restricted to be periodic with the synthesis time stride Δt.sub.s, i.e. w(n)=w(n+Δt.sub.si),∀i,n. Then, one obtains
The condition on the synthesis window v.sub.s(n) is hence
By deriving the synthesis windows v.sub.s(n) as outlined above, a much larger freedom when designing the analysis window v.sub.a(n) is provided. This additional freedom may be used to design a pair of analysis/synthesis windows which does not exhibit aliasing of the transposed signal.
[0072] To obtain an analysis/synthesis window pair that suppresses aliasing for even transposition factors, several embodiments will be outlined in the following. According to a first embodiment the windows or prototype filters are made long enough to attenuate the level of the first side lobe in the frequency response below a certain “aliasing” level. The analysis time stride Δt.sub.a will in this case only be a (small) fraction of the window length L. This typically results in smearing of transients, e.g. in percussive signals.
[0073] According to a second embodiment, the analysis window v.sub.a(n) is chosen to have dual zeros on the unit circle. The phase response resulting from a dual zero is a 360 degree phase shift. These phase shifts are retained when the phase angles are multiplied with the transposition factors, regardless if the transposition factors are odd or even. When a proper and smooth analysis filter v.sub.a(n), having dual zeros on the unit circle, is obtained, the synthesis window is obtained from the equations outlined above.
[0074] In an example of the second embodiment, the analysis filter/window v.sub.a(n) is the “squared sine window”, i.e. the sine window
convolved with itself as v.sub.a(n)=v(n).Math.v(n). However, it should be noted that the resulting filter/window v.sub.a(n) will be odd symmetric with length L.sub.a=2L−1, i.e. an odd number of filter/window coefficients. When a filter/window with an even length is more appropriate, in particular an even symmetric filter, the filter may be obtained by first convolving two sine windows of length L. Then, a zero is appended to the end of the resulting filter. Subsequently, the 2L long filter is resampled using linear interpolation to a length L even symmetric filter, which still has dual zeros only on the unit circle.
[0075] Overall, it has been outlined, how a pair of analysis and synthesis windows may be selected such that aliasing in the transposed output signal may be avoided or significantly reduced. The method is particularly relevant when using even transposition factors.
[0076] Another aspect to consider in the context of vocoder based harmonic transposers is phase unwrapping. It should be noted that whereas great care has to be taken related to phase unwrapping issues in general purpose phase vocoders, the harmonic transposer has unambiguously defined phase operations when integer transposition factors T are used. Thus, in preferred embodiments the transposition order T is an integer value. Otherwise, phase unwrapping techniques could be applied, wherein phase unwrapping is a process whereby the phase increment between two consecutive frames is used to estimate the instantaneous frequency of a nearby sinusoid in each channel.
[0077] Yet another aspect to consider, when dealing with the transposition of audio and/or voice signals, is the processing of stationary and/or transient signal sections. Typically, in order to be able to transpose stationary audio signals without intermodulation artifacts, the frequency resolution of the DFT filter bank has to be rather high, and therefore the windows are long compared to transients in the input signals x(n), notably audio and/or voice signals. As a result, the transposer has a poor transient response. However, as will be described in the following, this problem can be solved by a modification of the window design, the transform size and the time stride parameters. Hence, unlike many state of the art methods for phase vocoder transient response enhancement, the proposed solution does not rely on any signal adaptive operation such as transient detection.
[0078] In the following, the harmonic transposition of transient signals using vocoders is outlined. As a starting point, a prototype transient signal, a discrete time Dirac pulse at time instant t=t.sub.0,
is considered. The Fourier transform of such a Dirac pulse has unit magnitude and a linear phase with a slope proportional to t.sub.0:
[0079] Such Fourier transform can be considered as the analysis stage of the phase vocoder described above, wherein a flat analysis window v.sub.a(n) of infinite duration is used. In order to generate an output signal y(n) which is time-stretched by a factor T, i.e. a Dirac pulse δ(t−Tt.sub.0) at the time instant t=Tt.sub.0, the phase of the analysis subband signals should be multiplied by the factor T in order to obtain the synthesis subband signal Y(Ω.sub.m)=exp(−jΩ.sub.mTt.sub.0) which yields the desired Dirac pulse δ(t−Tt.sub.0) as an output of an inverse Fourier Transform.
[0080] This shows that the operation of phase multiplication of the analysis subband signals by a factor T leads to the desired time-shift of a Dirac pulse, i.e. of a transient input signal. It should be noted that for more realistic transient signals comprising more than one non-zero sample, the further operations of time-stretching of the analysis subband signals by a factor T should be performed. In other words, different hop sizes should be used at the analysis and the synthesis side.
[0081] However, it should be noted that the above considerations refer to an analysis/synthesis stage using analysis and synthesis windows of infinite lengths. Indeed, a theoretical transposer with a window of infinite duration would give the correct stretch of a Dirac pulse δ(t−t.sub.0). For a finite duration windowed analysis, the situation is scrambled by the fact that each analysis block is to be interpreted as one period interval of a periodic signal with period equal to the size of the DFT.
[0082] This is illustrated in
[0083] In a real-world system, where both the analysis and synthesis windows are of finite length, the pulse train actually contains a few pulses only (depending on the transposition factor), one main pulse, i.e. the wanted term, a few pre-pulses and a few post-pulses, i.e. the unwanted terms. The pre- and post-pulses emerge because the DFT is periodic (with L). When a pulse is located within an analysis window, so that the complex phase gets wrapped when multiplied by T (i.e. the pulse is shifted outside the end of the window and wraps back to the beginning), an unwanted pulse emerges. The unwanted pulses may have, or may not have, the same polarity as the input pulse, depending on the location in the analysis window and the transposition factor.
[0084] This can be seen mathematically when transforming the Dirac pulse δ(t−t.sub.0) situated in the interval −L/2≤t.sub.0<L/2 using a DFT with length L centered around t=0,
The analysis subband signals are phase multiplied with a factor T to obtain the synthesis subband signals Y(Ω.sub.m)=exp(−jΩ.sub.mTt.sub.0). Then the inverse DFT is applied to obtain the periodic synthesis signal:
i.e. a Dirac pulse train with period L.
[0085] In the example of
[0086] As the analysis and synthesis stage move along the time axis according to the hop factor or time stride Δt, the pulse δ(t−t.sub.0) 112 will have another position relative to the center of the respective analysis window 111. As outlined above, the operation to achieve time-stretching consists in moving the pulse 112 to T times its position relative to the center of the window. As long as this position is within the window 121, this time-stretch operation guarantees that all contributions add up to a single time stretched synthesized pulse δ(t−Tt.sub.0) at t=Tt.sub.0.
[0087] However, a problem occurs for the situation of
[0088] The principle of the solution proposed by the present invention is described in reference to
[0089] It should be noted that in a preferred embodiment the synthesis window and the analysis window have equal “nominal” lengths. However, when using implicit resampling of the output signal by discarding or inserting samples in the frequency bands of the transform or filter bank, the synthesis window size will typically be different from the analysis size, depending on the resampling or transposition factor.
[0090] The minimum value of F, i.e. the minimum frequency domain oversampling factor, can be deduced from
i.e. for any input pulse comprised within the analysis window 311, the undesired image δ(t−Tt.sub.0+FL) at time instant t=Tt.sub.0−FL must be located to the left of the left edge of the synthesis window at
Equivalently, the condition
must be met, which leads to the rule
As can be seen from formula (3), the minimum frequency domain oversampling factor F is a function of the transposition/time-stretching factor T. More specifically, the minimum frequency domain oversampling factor F is proportional to the transposition/time-stretching factor T.
[0091] By repeating the line of thinking above for the case where the analysis and synthesis windows have different lengths one obtains a more general formula. Let L.sub.A and L.sub.s be the lengths of the analysis and synthesis windows, respectively, and let M be the DFT size employed. The rule extending formula (3) is then
That this rule indeed is an extension of (3) can be verified by inserting M=FL, and L.sub.A=L.sub.S=L in (4) and dividing by L on both side of the resulting equation.
[0092] The above analysis is performed for a rather special model of a transient, i.e. a Dirac pulse. However, the reasoning can be extended to show that when using the above described time-stretching scheme, input signals which have a near flat spectral envelope and which vanish outside a time interval [a, b] will be stretched to output signals which are small outside the interval [Ta,Tb]. It can also be checked by studying spectrograms of real audio and/or speech signals that pre-echoes disappear in the stretched signals when the above described rule for selecting an appropriate frequency domain oversampling factor is respected. A more quantitative analysis also reveals that pre-echoes are still reduced when using frequency domain oversampling factors which are slightly inferior to the value imposed by the condition of formula (3). This is due to the fact that typical window functions v.sub.s(n) are small near their edges, thereby attenuating undesired pre-echoes which are positioned near the edges of the window functions.
[0093] In summary, the present invention teaches a new way to improve the transient response of frequency domain harmonic transposers, or time-stretchers, by introducing an oversampled transform, where the amount of oversampling is a function of the transposition factor chosen.
[0094] In the following, the application of harmonic transposition according to the invention in audio decoders is described in further detail. A common use case for a harmonic transposer is in an audio/speech codec system employing so-called bandwidth extension or high frequency regeneration (HFR). It should be noted that even though reference may be made to audio coding, the described methods and systems are equally applicable to speech coding and in unified speech and audio coding (USAC).
[0095] In such HFR systems the transposer may be used to generate a high frequency signal component from a low frequency signal component provided by the so-called core decoder. The envelope of the high frequency component may be shaped in time and frequency based on side information conveyed in the bit-stream.
[0096]
[0097] As outlined in the context of
[0098] The overall transposition order may be obtained in different ways. A first possibility is to up-sample the decoder output signal by the factor 2 at the entrance to the transposer as pointed out above. In such cases, the time-stretched signal would need to be down-sampled by a factor T, in order to obtain the desired output signal which is frequency transposed by a factor T A second possibility would be to omit the pre-processing step and to directly perform the time-stretching operations on the core decoder output signal. In such cases, the transposed signals must be down-sampled by a factor T/2 to retain the global up-sampling factor of 2 and in order to achieve frequency transposition by a factor T In other words, the up-sampling of the core decoder signal may be omitted when performing a down-sampling of the output signal of the transposer 402 of T/2 instead of T. It should be noted, however, that the core signal still needs to be up-sampled in the up-sampler 404 prior to combining the signal with the transposed signal.
[0099] It should also be noted that the transposer 402 may use several different integer transposition factors in order to generate the high frequency component. This is shown in
[0100]
[0101] The altered coefficients or altered subband signals are retransformed into the time domain using the synthesis transformation unit 605. For each set of altered complex coefficients, this yields a frame of altered samples, i.e. a set of M altered samples. Using the synthesis window unit 606, L samples may be extracted from each set of altered samples, thereby yielding a frame of the output signal. Overall, a sequence of frames of the output signal may be generated for the sequence of frames of the input signal. This sequence of frames is shifted with respect to one another by the synthesis stride in the synthesis stride unit 607. The synthesis stride may be T times greater than the analysis stride. The output signal is generated in the overlap-add unit 608, where the shifted frames of the output signal are overlapped and samples at the same time instant are added. By traversing the above system, the input signal may be time-stretched by a factor T, i.e. the output signal may be a time-stretched version of the input signal.
[0102] Finally, the output signal may be contracted in time using the contracting unit 609. The contracting unit 609 may perform a sampling rate conversion of order T, i.e. it may increase the sampling rate of the output signal by a factor T, while keeping the number of samples unchanged. This yields a transposed output signal, having the same length in time as the input signal but comprising frequency components which are up-shifted by a factor T with respect to the input signal. The combining unit 609 may also perform a down-sampling operation by a factor T, i.e. it may retain only every T.sup.th sample while discarding the other samples. This down-sampling operation may also be accompanied by a low pass filter operation. If the overall sampling rate remains unchanged, then the transposed output signal comprises frequency components which are up-shifted by a factor T with respect to the frequency components of the input signal.
[0103] It should be noted that the contracting unit 609 may perform a combination of rate-conversion and down-sampling. By way of example, the sampling rate may be increased by a factor 2. At the same time the signal may be down-sampled by a factor T/2. Overall, such combination of rate-conversion and down-sampling also leads to an output signal which is a harmonic transposition of the input signal by a factor T. In general, it may be stated that the contracting unit 609 performs a combination of rate conversion and/or down-sampling in order to yield a harmonic transposition by the transposition order T This is particularly useful when performing harmonic transposition of the low bandwidth output of the core audio decoder 401. As outlined above, such low bandwidth output may have been down-sampled by a factor 2 at the encoder and may therefore require up-sampling in the up-sampling unit 404 prior to merging it with the reconstructed high frequency component. Nevertheless, it may be beneficial for reducing computation complexity to perform harmonic transposition in the transposition unit 402 using the “non-up-sampled” low bandwidth output. In such cases, the contracting unit 609 of the transposition unit 402 may perform a rate-conversion of order 2 and thereby implicitly perform the required up-sampling operation of the high frequency component. By consequence, transposed output signals of order T are down-sampled in the contracting unit 609 by the factor T/2.
[0104] In the case of multiple parallel transposers of different transposition orders such as shown in
[0105] As just mentioned, the analysis window may be common to the signals of different transposition factors. When using a common analysis window, an example of the stride of windows 700 applied to the low band signal is depicted in
[0106] An example of the stride of windows applied to the low band signal, e.g. the output signal of the core decoder, is depicted in
[0107] In the synthesis stages, the synthesis strides Δt.sub.s of the synthesis windows are determined as a function of the transposition order T used in the respective transposer. As outlined above, the time-stretch operation also involves time stretching of the subband signals, i.e. time stretching of the suite of frames. This operation may be performed by choosing a synthesis hop factor or synthesis stride Δt.sub.s which is increased over the analysis stride Δt.sub.a by a factor T Consequently, the synthesis stride Δt.sub.sT for the transposer of order Tis given by Δt.sub.sT=TΔt.sub.a.
[0108]
[0109] In the following, the aspect of time alignment of transposed sequences of different transposition factors when using common analysis windows is addressed. In other words, the aspect of aligning the output signals of frequency transposers employing a different transposition order is addressed. When using the methods outlined above, Dirac-functions δ(t−t.sub.0) are time-stretched, i.e. moved along the time axis, by the amount of time given by the applied transposition factor T. In order to convert the time-stretching operation into a frequency shifting operation, a decimation or down-sampling using the same transposition factor T is performed. If such decimation by the transposition factor or transposition order T is performed on the time-stretched Dirac-function δ(t−Tt.sub.0), the down-sampled Dirac pulse will be time aligned with respect to the zero-reference time 710 in the middle of the first analysis window 701. This is illustrated in
[0110] However, when using different orders of transposition T, the decimations will result in different offsets for the zero-reference, unless the zero-reference is aligned with “zero” time of the input signal. By consequence, a time offset adjustment of the decimated transposed signals need to be performed, before they can be summed up in the summing unit 502. As an example, a first transposer of order T=3 and a second transposer of order T=4 are assumed. Furthermore, it is assumed that the output signal of the core decoder is not up-sampled. Then the transposer decimates the third order time-stretched signal by a factor 3/2, and the fourth order time-stretched signal by a factor 2. The second order time-stretched signal, i.e. T=2, will just be interpreted as having a higher sampling frequency compared to the input signal, i.e. a factor 2 higher sampling frequency, effectively making the output signal pitch-shifted by a factor 2.
[0111] It can be shown that in order to align the transposed and down-sampled signals, time offsets by
need to be applied to the transposed signals before decimation, i.e. for the third and fourth order transpositions, offsets of
have to be applied respectively. To verify this in a concrete example, the zero-reference for a second order time-stretched signal will be assumed to correspond to time instant or sample
i.e. to the zero-reference 710 in
due to down-sampling by a factor of 3/2. If the time offset according to the above mentioned rule is added before decimation, the reference will translate into
This means that the reference of the down-sampled transposed signal is aligned with the zero-reference 710. In a similar manner, for the fourth order transposition without offset the zero-reference corresponds to
but when using the proposed offset, the reference translates into
which again is aligned with the 2.sup.nd order zero-reference 710, i.e. the zero-reference for the transposed signal using T=2.
[0112] Another aspect to be considered when simultaneously using multiple orders of transposition relates to the gains applied to the transposed sequences of different transposition factors. In other words, the aspect of combining the output signals of transposers of different transposition order may be addressed. There are two principles when selecting the gain of the transposed signals, which may be considered under different theoretical approaches. Either, the transposed signals are supposed to be energy conserving, meaning that the total energy in the low band signal which subsequently is transposed to constitute a factor-T transposed high band signal is preserved. In this case the energy per bandwidth should be reduced by the transposition factor T since the signal is stretched by the same amount Tin frequency. However, sinusoids, which have their energy within an infinitesimally small bandwidth, will retain their energy after transposition. This is due to the fact that in the same way as a Dirac pulse is moved in time by the transposer when time-stretching, i.e. in the same way that the duration in time of the pulse is not changed by the time-stretching operation, a sinusoidal is moved in frequency when transposing, i.e. the duration in frequency (in other words the bandwidth) is not changed by the frequency transposing operation. I.e. even though the energy per bandwidth is reduced by T, the sinusoidal has all its energy in one point in frequency so that the point-wise energy will be preserved.
[0113] The other option when selecting the gain of the transposed signals is to keep the energy per bandwidth after transposition. In this case, broadband white noise and transients will display a flat frequency response after transposition, while the energy of sinusoids will increase by a factor T.
[0114] A further aspect of the invention is the choice of analysis and synthesis phase vocoder windows when using common analysis windows. It is beneficial to carefully choose the analysis and synthesis phase vocoder windows, i.e. v.sub.a(n) and v.sub.s(n). Not only should the synthesis window v.sub.s(n) adhere to Formula 2 above, in order to allow for perfect reconstruction. Furthermore, the analysis window v.sub.a(n) should also have adequate rejection of the side lobe levels. Otherwise, unwanted “aliasing” terms will typically be audible as interference with the main terms for frequency varying sinusoids. Such unwanted “aliasing” terms may also appear for stationary sinusoids in the case of even transposition factors as mentioned above. The present invention proposes the use of sine windows because of their good side lobe rejection ratio. Hence, the analysis window is proposed to be
[0115] The synthesis windows v.sub.s(n) will be either identical to the analysis window v.sub.a(n) or given by formula (2) above if the synthesis hop-size Δt.sub.s is not a factor of the analysis window length L, i.e. if the analysis window length L is not integer dividable by the synthesis hop-size. By way of example, if L=1024, and Δt.sub.s=384, then 1024/384=2.667 is not an integer. It should be noted that it is also possible to select a pair of bi-orthogonal analysis and synthesis windows as outlined above. This may be beneficial for the reduction of aliasing in the output signal, notably when using even transposition orders T
[0116] In the following, reference is made to
[0117] The enhanced Spectral Band Replication (eSBR) unit 1001 of the encoder 1000 may comprise high frequency reconstruction components outlined in the present document. In some embodiments, the eSBR unit 1001 may comprise a transposition unit outlined in the context of
[0118] The decoder 1100 shown in
[0119] Furthermore,
[0134]
[0135] In
[0136] Typically the QMF filter bank 1202 comprise 32 QMF frequency bands. In such cases, the low frequency component 1213 has a bandwidth of f.sub.s/4, where f.sub.s/2 is the sampling frequency of the signal 1213. The high frequency component 1212 typically has a bandwidth of f.sub.s/2 and is filtered through the QMF bank 1203 comprising 64 QMF frequency bands.
[0137] In the present document, a method for harmonic transposition has been outlined. This method of harmonic transposition is particularly well suited for the transposition of transient signals. It comprises the combination of frequency domain oversampling with harmonic transposition using vocoders. The transposition operation depends on the combination of analysis window, analysis window stride, transform size, synthesis window, synthesis window stride, as well as on phase adjustments of the analysed signal. Through the use of this method undesired effects, such as pre- and post-echoes, may be avoided. Furthermore, the method does not make use of signal analysis measures, such as transient detection, which typically introduce signal distortions due to discontinuities in the signal processing. In addition, the proposed method only has reduced computational complexity. The harmonic transposition method according to the invention may be further improved by an appropriate selection of analysis/synthesis windows, gain values and/or time alignment.