Audio processor and method for processing an audio signal using vertical phase correction
10770083 ยท 2020-09-08
Assignee
Inventors
Cpc classification
G10L19/02
PHYSICS
G10L19/025
PHYSICS
G10L19/22
PHYSICS
International classification
G10L19/025
PHYSICS
G10L19/02
PHYSICS
Abstract
An audio processor for processing an audio signal includes a target phase measure determiner for determining a target phase measure for the audio signal in a time frame, a phase error calculator for calculating a phase error using a phase of the audio signal in the time frame and the target phase measure, and a phase corrector configured for correcting the phase of the audio signal in the time frame using the phase error.
Claims
1. A decoding device for decoding an encoded audio signal, the decoding device comprising: a core decoder configured for decoding the encoded audio signal in a first time frame to obtain a set of subbands of a baseband in the first time frame and for decoding the encoded audio signal in a second time frame to obtain a set of subbands of the baseband in the second time frame; a patcher configured for patching the set of subbands of the baseband in the first time frame, wherein the set of subbands in the first time frame forms a patch, to further subbands in the first time frame, adjacent to the baseband, to achieve a decoded audio signal comprising frequencies higher than the frequencies in the baseband for the first time frame and for patching the set of subbands of the baseband in the second time frame, wherein the set of subbands in the second time frame forms a patch, to further subbands in the second time frame, adjacent to the baseband, to achieve a decoded audio signal comprising frequencies higher than the frequencies in the baseband for the second time frame; an audio processor for processing an audio signal in the first time frame comprising the set of subbands or the further subbands in the first time frame, the audio processor comprising: a target phase measure determiner for determining a target phase measure for the audio signal in a first time frame; a phase error calculator for calculating a phase error using a phase of the audio signal in the first time frame and the target phase measure; and a phase corrector configured for correcting phases of the set of subbands of the patch or of the further subbands according to the target phase measure in the first time frame, and a further audio processor for processing an audio signal in the second time frame comprising the set of subbands or the further subbands in the second time frame, the further audio processor comprising: a further target phase measure determiner for determining a further target phase measure for the audio signal in the second time frame; a further phase error calculator for calculating a further phase error using a further phase of the audio signal in the second time frame and the target phase measure; and a further phase corrector configured for correcting phases of the set of subbands of the patch or of the further subbands according to the target phase measure in the second time frame, wherein the further audio processor is configured for receiving a phase derivative over frequency and to correct a transient in the audio signal in the second time frame using the received phase derivative over frequency.
2. The decoding device according to claim 1, wherein the patcher is configured for patching the set of subbands of the baseband, wherein the set of subbands forms a further patch, to further subbands of the time frame, adjacent to the patch; and wherein the audio processor is configured for correcting the phases within the subbands of the further patch; or wherein the patcher is configured for patching a corrected patch to further subbands of the time frame, adjacent to the patch.
3. A method for decoding an encoded audio signal, the method comprising: decoding the encoded audio signal in a first time frame to obtain a set of subbands of a baseband in the first time frame and for decoding the encoded audio signal in a second time frame to obtain a set of subbands of the baseband in the second time frame; patching the set of subbands of the baseband in the first time frame, wherein the set of subbands in the first time frame forms a patch, to further subbands in the first time frame, adjacent to the baseband, to achieve decoded audio signal comprising frequencies higher than the frequencies in the baseband for the first time frame; patching the set of subbands of the baseband in the second time frame, wherein the set of subbands in the second time frame forms a patch, to further subbands in the second time frame, adjacent to the baseband, to achieve a decoded audio signal comprising frequencies higher than the frequencies in the baseband for the second time frame; determining a target phase measure for an audio signal in the first time frame comprising the set of subbands or the further subbands in the first time frame; calculating a phase error using the phase of the audio signal in the first time frame and a target phase measure; and correcting phases of the set of subbands of the patch or of the further subbands according to the target phase measure in the first time frame; and determining a further target phase measure for an audio signal in the second time frame comprising the set of subbands or the further subbands in the second time frame; calculating a further phase error using a further phase of the audio signal in the second time frame and the target phase measure; and correcting phases of the set of subbands of the patch or of the further subbands according to the target phase measure in the second time frame, wherein a phase derivative over frequency is received, and wherein a transient in the audio signal in the second time frame is corrected using the received phase derivative over frequency.
4. A non-transitory digital storage medium having a computer program stored thereon to perform, when running on a computer, the method for decoding an encoded audio signal, said method comprising: decoding the encoded audio signal in a first time frame to obtain a set of subbands of a baseband in the first time frame and for decoding the encoded audio signal in a second time frame to obtain a set of subbands of the baseband in the second time frame; patching the set of subbands of the baseband in the first time frame, wherein the set of subbands in the first time frame forms a patch, to further subbands in the first time frame, adjacent to the baseband, to achieve a decoded audio signal comprising frequencies higher than the frequencies in the baseband for the first time frame; patching the set of subbands of the baseband in the second time frame, wherein the set of subbands in the second time frame forms a patch, to further subbands in the second time frame, adjacent to the baseband, to achieve a decoded audio signal comprising frequencies higher than the frequencies in the baseband for the second time frame; determining a target phase measure for an audio signal in the first time frame comprising the set of subbands or the further subbands in the first time frame; calculating a phase error using the phase of the audio signal in the first time frame and a target phase measure; and correcting phases of the set of subbands of the patch or of the further subbands according to the target phase measure in the first time frame; and determining a further target phase measure for an audio signal in the second time frame comprising the set of subbands or the further subbands in the second time frame; calculating a further phase error using a further phase of the audio signal in the second time frame and the target phase measure; and correcting phases of the set of subbands of the patch or of the further subbands according to the target phase measure in the second time frame, wherein a phase derivative over frequency is received, and wherein a transient in the audio signal in the second time frame is corrected using the received phase derivative over frequency.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7) Quadrature Mirror Filter bank bins), defined by a time frame and a subband;
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36)
(37)
(38)
(39)
(40)
(41)
(42)
(43)
(44)
(45)
(46)
(47)
(48)
(49)
(50)
(51)
(52)
(53)
(54)
(55)
(56)
(57)
(58)
(59)
(60)
(61)
(62)
(63)
(64)
(65)
(66)
(67)
(68)
(69)
(70)
(71)
(72)
(73)
(74)
(75)
(76)
(77)
(78)
(79)
(80)
(81)
(82)
(83)
(84)
(85)
(86)
(87)
(88)
(89)
(90)
DETAILED DESCRIPTION OF THE INVENTION
(91) In the following, embodiments of the invention will be described in further detail. Elements shown in the respective figures having the same or a similar functionality will have associated therewith the same reference signs.
(92) Embodiments of the present invention will be described with regard to a specific signal processing. Therefore,
(93) 1 Introduction
(94) Perceptual audio coding has proliferated as mainstream enabling digital technology for all types of applications that provide audio and multimedia to consumers using transmission or storage channels with limited capacity. Modern perceptual audio codecs are expected to deliver satisfactory audio quality at increasingly low bit rates. In turn, one has to put up with certain coding artifacts that are most tolerable by the majority of listeners. Audio Bandwidth Extension (BWE) is a technique to artificially extend the frequency range of an audio coder by spectral translation or transposition of transmitted lowband signal parts into the highband at the price of introducing certain artifacts.
(95) The finding is that some of these artifacts are related to the change of the phase derivative within the artificially extended highband. One of these artifacts is the alteration of phase derivative over frequency (see also vertical phase coherence) [8]. Preservation of said phase derivative is perceptually important for tonal signals having a pulse-train like time domain waveform and a rather low fundamental frequency. Artifacts related to a change of the vertical phase derivative correspond to a local dispersion of energy in time and are often found in audio signals which have been processed by BWE techniques. Another artifact is the alteration of the phase derivative over time (see also horizontal phase coherence) which is perceptually important for overtone-rich tonal signals of any fundamental frequency. Artifacts related to an alteration of the horizontal phase derivative correspond to a local frequency offset in pitch and are often found in audio signals which have been processed by BWE techniques.
(96) The present invention presents means for readjusting either the vertical or horizontal phase derivative of such signals when this property has been compromised by application of so-called audio bandwidth extension (BWE). Further means are provided to decide if a restoration of the phase derivative is perceptually beneficial and whether adjusting the vertical or horizontal phase derivative is perceptually advantageous.
(97) Bandwidth-extension methods, such as spectral band replication (SBR) [9], are often used in low-bit-rate codecs. They allow transmitting only a relatively narrow low-frequency region alongside with parametric information about the higher bands. Since the bit rate of the parametric information is small, significant improvement in the coding efficiency can be obtained.
(98) Typically the signal for the higher bands is obtained by simply copying it from the transmitted low-frequency region. The processing is usually performed in the complex-modulated quadrature-mirror-filter-bank (QMF) [10] domain, which is assumed also in the following. The copied-up signal is processed by multiplying the magnitude spectrum of it with suitable gains based on the transmitted parameters. The aim is to obtain a similar magnitude spectrum as that of the original signal. On the contrary, the phase spectrum of the copied-up signal is typically not processed at all, but, instead, the copied-up phase spectrum is directly used.
(99) The perceptual consequences of using directly the copied-up phase spectrum is investigated in the following. Based on the observed effects, two metrics for detecting the perceptually most significant effects are suggested. Moreover, methods how to correct the phase spectrum based on them are suggested. Finally, strategies for minimizing the amount of transmitted parameter values for performing the correction are suggested.
(100) The present invention is related to the finding that preservation or restoration of the phase derivative is able to remedy prominent artifacts induced by audio bandwidth extension (BWE) techniques. For instance, typical signals, where the preservation of the phase derivative is important, are tones with rich harmonic overtone content, such as voiced speech, brass instruments or bowed strings.
(101) The present invention further provides means to decide iffor a given signal framea restoration of the phase derivative is perceptually beneficial and whether adjusting the vertical or horizontal phase derivative is perceptually advantageous.
(102) The invention teaches an apparatus and a method for phase derivative correction in audio codecs using BWE techniques with the following aspects: 1. Quantification of the importance of phase derivative correction 2. Signal dependent prioritization of either vertical (frequency) phase derivative correction or horizontal (time) phase derivative correction 3. Signal dependent switching of correction direction (frequency or time) 4. Dedicated vertical phase derivative correction mode for transients 5. Obtaining stable parameters for a smooth correction 6. Compact side information transmission format of correction parameters
(103) 2 Presentation of Signals in the QMF Domain
(104) A time-domain signal x(m), where m is discrete time, can be presented in the time-frequency domain, e.g. using a complex-modulated Quadrature Mirror Filter bank (QMF). The resulting signal is X(k, n), where k is the frequency band index and n the temporal frame index. The QMF of 64 bands and the sampling frequency f.sub.s of 48 kHz are assumed for visualizations and embodiments. Thus, the bandwidth f.sub.BW of each frequency band is 375 Hz and the temporal hop size t.sub.hop (17 in
(105) The resulting signal is X(k, n), where k is the frequency band index and n the temporal frame index. X(k, n) is a complex signal. Thus, it can also be presented using the magnitude X.sup.mag(k, n) and the phase components X.sup.pha(k, n) with j being the complex number
(106)
(107) The audio signals are presented mostly using X.sup.mag(k, n) and X.sup.pha(k, n) (see
(108)
(109) 3 Audio Data
(110) The audio data used to show an effect of a described audio processing are named trombone for an audio signal of a trombone, violin for an audio signal of a violin, and violin+clap for the violin signal with a hand clap added in the middle.
(111) 4 Basic Operation of SBR
(112)
(113) Assuming a signal X(k, n), the bandwidth-extended (BWE) signal Z(k, n) is obtained from the input signal X(k, n) by copying up certain parts of the transmitted low-frequency frequency band. An SBR algorithm starts by selecting a frequency region to be transmitted. In this example, the bands from 1 to 7 are selected:
1k7:X.sub.trans(k, n)=X(k, n).(2)
(114) The amount of frequency bands to be transmitted depends on the desired bit rate. The figures and the equations are produced using 7 bands, and from 5 to 11 bands are used for the corresponding audio data. Thus, the cross-over frequencies between the transmitted frequency region and the higher bands are from 1875 to 4125 Hz, respectively. The frequency bands above this region are not transmitted at all, but instead, parametric metadata is created for describing them. X.sub.trans(k, n) is coded and transmitted. For the sake of simplicity, it is assumed that the coding does not modify the signal in any way, even though it has to be seen that the further processing is not limited to the assumed case.
(115) In the receiving end, the transmitted frequency region is directly used for the corresponding frequencies.
(116) For the higher bands, the signal may be created somehow using the transmitted signal. One approach is simply to copy the transmitted signal to higher frequencies. A slightly modified version is used here. First, a baseband signal is selected. It could be the whole transmitted signal, but in this embodiment the first frequency band is omitted. The reason for this is that the phase spectrum was noticed to be irregular for the first band in many cases. Thus, the baseband to be copied up is defined as
1k6:X.sub.base(k, n)=X.sub.trans(k+1, n).(3)
(117) Other bandwidths can also be used for the transmitted and the baseband signals. Using the baseband signal, raw signals for the higher frequencies are created
Y.sub.raw(k, n, i)=X.sub.base(k, n),(4)
(118) where Y.sub.raw(k, n, i) is the complex QMF signal for the frequency patch i. The raw frequency-patch signals are manipulated according to the transmitted metadata by multiplying them with gains g(k, n, i)
Y(k, n, i)=Y.sub.raw(k, n, i)g(k, n, i).(5)
(119) It should be noted that the gains are real valued, and thus, only the magnitude spectrum is affected and thereby adapted to a desired target value. Known approaches show how the gains are obtained. The target phase remains non-corrected in said known approaches.
(120) The final signal to be reproduced is obtained by concatenating the transmitted and the patch signals for seamlessly extending the bandwidth to obtain a BWE signal of the desired bandwidth. In this embodiment, i=7 is assumed.
Z(k, n)=X.sub.trans(k, n),
Z(k+6i+1, n)=Y(k, n, i).(6)
(121)
(122) It is assumed that the parametric representation of the higher bands is perfect, i.e., the magnitude spectrum of the reconstructed signal is identical to that of the original signal
Z.sup.mag(k, n)=X.sup.mag(k, n).(7)
(123) However, it should be noted that the phase spectrum is not corrected in any way by the algorithm, so it is not correct even if the algorithm worked perfectly. Therefore, embodiments show how to additionally adapt and correct the phase spectrum of Z(k, n) to a target value such that an improvement of the perceptual quality is obtained. In embodiments, the correction can be performed using three different processing modes, horizontal, vertical and transient. These modes are separately discussed in the following.
(124) Z.sup.mag(k, n) and Z.sup.pha(k, n) are depicted in
(125) 5 Meaning of the Phase Spectrum in the QMF Domain
(126) Often it is thought that the index of the frequency band defines the frequency of a single tonal component, the magnitude defines the level of it, and the phase defines the timing of it. However, the bandwidth of a QMF band is relatively large, and the data is oversampled. Thus, the interaction between the time-frequency tiles (i.e., QMF bins) actually defines all of these properties.
(127) A time-domain presentation of a single QMF bin with three different phase values, i.e., X.sup.mag(3,1)=1 and X.sup.pha(3,1)=0, /2, or is depicted in
(128) Considering a case where only one frequency band is non-zero for all temporal frames, i.e.,
n :X.sup.mag(3, n)=1.(8)
(129) By changing the phase between the temporal frames with a fixed value , i.e.,
X.sup.pha(k, n)=X.sup.pha(k, n1)+,(9)
(130) a sinusoid is created. The resulting signal (i.e., the time-domain signal after inverse QMF transform) is presented in
(131) Correspondingly, if the phase is selected randomly, the result is narrow-band noise (see
(132)
(133) Considering a case where only one temporal frame is non-zero for all frequency bands, i.e.,
k :X.sup.mag(k, 3)=1.(10)
(134) By changing the phase between the frequency bands with a fixed value , i.e.,
X.sup.pha(k, n)=X.sup.pha(k1, n)+,(11)
a transient is created. The resulting signal (i.e., the time-domain signal after inverse QMF transform) is presented in
(135) Correspondingly, if the phase is selected randomly, the result is a short noise burst (see
(136)
(137) 6 Measures for Describing Perceptually Relevant Properties of the Phase Spectrum
(138) As discussed in Section 4, the phase spectrum in itself looks quite messy, and it is difficult to see directly what its effect on perception is. Section 5 presented two effects that can be caused by manipulating the phase spectrum in the QMF domain: (a) constant phase change over time produces a sinusoid and the amount of phase change controls the frequency of the sinusoid, and (b) constant phase change over frequency produces a transient and the amount of phase change controls the temporal position of the transient.
(139) The frequency and the temporal position of a partial are obviously significant to human perception, so detecting these properties is potentially useful. They can be estimated by computing the phase derivative over time (PDT)
X.sup.pdt(k, n)=X.sup.pha(k, n+1)X.sup.pha(k, n)(12)
(140) and by computing the phase derivative over frequency (PDF)
X.sup.pdf(k, n)=X.sup.pha(k+1, n)X.sup.pha(k, n).(13)
(141) X.sup.pdt(k, n) is related to the frequency and X.sup.pdf(k, n) to the temporal position of a partial. Due to the properties of the QMF analysis (how the phases of the modulators of the adjacent temporal frames match at the position of a transient), is added to the even temporal frames of X.sup.pdf(k, n) in the figures for visualization purposes in order to produce smooth curves.
(142) Next it is inspected how these measures look like for our example signals.
(143) For the trombone, X.sup.pdt is relatively noisy. On the contrary, the X.sup.pdf appears to have about the same value at all frequencies. In practice, this means that all the harmonic components are aligned in time producing a transient-like signal. The temporal locations of the transients are determined by the X.sup.pdf values.
(144) The same derivatives can also be computed for the SBR-processed signals Z(k, n) (see
(145) Correspondingly, PDF of the frequency patches is otherwise identical to that of the baseband, but at the cross-over frequencies the PDF is, in practice, random. At the cross-over, the PDF is actually computed between the last and the first phase value of the frequency patch, i.e.,
Z.sup.pdt(7, n)=Z.sup.pha(8, n)Z.sup.pha(7, n)=Y.sup.pha(1, n, i)Y.sup.pha(6, n, i)(14)
(146) These values depend on the actual PDF and the cross-over frequency, and they do not match with the values of the original signal.
(147) For the trombone, the PDF values of the copied-up signal are correct apart from the cross-over frequencies. Thus, the temporal locations of the most of the harmonics are in the correct places, but the harmonics at the cross-over frequencies are practically at random locations. The perceptual effect of this is discussed in Section 7.
(148) 7 Human Perception of Phase Errors
(149) Sounds can roughly be divided into two categories: harmonic and noise-like signals. The noise-like signals have, already by definition, noisy phase properties. Thus, the phase errors caused by SBR are assumed not to be perceptually significant with them. Instead, it is concentrated on harmonic signals. Most of the musical instruments, and also speech, produce harmonic structure to the signal, i.e., the tone contains strong sinusoidal components spaced in frequency by the fundamental frequency.
(150) Human hearing is often assumed to behave as if it contained a bank of overlapping band-pass filters, referred to as the auditory filters. Thus, the hearing can be assumed to handle complex sounds so that the partial sounds inside the auditory filter are analyzed as one entity. The width of these filters can be approximated to follow the equivalent rectangular bandwidth (ERB) [11], which can be determined according to
ERB=24.7(4.37 f.sub.c+1),(15)
(151) where f.sub.c is the center frequency of the band (in kHz). As discussed in Section 4, the cross-over frequency between the baseband and the SBR patches is around 3 kHz. At these frequencies the ERB is about 350 Hz. The bandwidth of a QMF frequency band is actually relatively close to this, 375 Hz. Hence, the bandwidth of the QMF frequency bands can be assumed to follow ERB at the frequencies of interest.
(152) Two properties of a sound that can go wrong due to erroneous phase spectrum were observed in Section 6: the frequency and the timing of a partial component. Concentrate on the frequency, the question is, can human hearing perceive the frequencies of individual harmonics? If it can, then the frequency offset caused by SBR should be corrected, and if not, then correction is not required.
(153) The concept of resolved and unresolved harmonics [12] can be used to clarify this topic. If there is only one harmonic inside the ERB, the harmonic is called resolved. It is typically assumed that the human hearing processes resolved harmonics individually and, thus, is sensitive to the frequency of them. In practice, changing the frequency of resolved harmonics is perceived to cause inharmonicity.
(154) Correspondingly, if there are multiple harmonics inside the ERB, the harmonics are called unresolved. The human hearing is assumed not to process these harmonics individually, but instead, their joint effect is seen by the auditory system. The result is a periodic signal and the length of the period is determined by the spacing of the harmonics. The pitch perception is related to the length of the period, so human hearing is assumed to be sensitive to it. Nevertheless, if all harmonics inside the frequency patch in SBR are shifted by the same amount, the spacing between the harmonics, and thus the perceived pitch, remains the same. Hence, in the case of unresolved harmonics, human hearing does not perceive frequency offsets as inharmonicity.
(155) Timing-related errors caused by SBR are considered next. By timing the temporal position, or the phase, of a harmonic component is meant. This should not be confused with the phase of a QMF bin. The perception of timing-related errors was studied in detail in [13]. It was observed that for the most of the signals human hearing is not sensitive to the timing, or the phase, of the harmonic components. However, there are certain signals with which the human hearing is very sensitive to the timing of the partials. The signals include, for example, trombone and trumpet sounds and speech. With these signals, a certain phase angle takes place at the same time instant with all harmonics. Neural firing rate of different auditory bands were simulated in [13]. It was found out that with these phase-sensitive signals the produced neural firing rate is peaky at all auditory bands and that the peaks are aligned in time. Changing the phase of even a single harmonic can change the peakedness of the neural firing rate with these signals. According to the results of the formal listening test, human hearing is sensitive to this [13]. The produced effects are the perception of an added sinusoidal component or a narrowband noise at the frequencies where the phase was modified.
(156) In addition, it was found out that the sensitivity to the timing-related effects depends on the fundamental frequency of the harmonic tone [13]. The lower the fundamental frequency, the larger are the perceived effects. If the fundamental frequency is above about 800 Hz, the auditory system is not sensitive at all to the timing-related effects.
(157) Thus, if the fundamental frequency is low and if the phase of the harmonics is aligned over frequency (which means that the temporal positions of the harmonics are aligned), changes in the timing, or in other words the phase, of the harmonics can be perceived by the human hearing. If the fundamental frequency is high and/or the phase of the harmonics is not aligned over frequency, the human hearing is not sensitive to changes in the timing of the harmonics.
(158) 8 Correction Methods
(159) In Section 7, it was noted that humans are sensitive to errors in the frequencies of resolved harmonics. In addition, humans are sensitive to errors in the temporal positions of the harmonics if the fundamental frequency is low and if the harmonics are aligned over frequency. SBR can cause both of these errors, as discussed in Section 6, so the perceived quality can be improved by correcting them. Methods for doing so are suggested in this section.
(160)
(161) 8.1 Correcting Frequency ErrorsHorizontal Phase Derivative Correction
(162) As discussed in Section 7, humans can perceive an error in the frequency of a harmonic mostly when there is only one harmonic inside one ERB. Furthermore, the bandwidth of a QMF frequency band can be used to estimate ERB at the first cross over. Hence, the frequency has to be corrected only when there is one harmonic inside one frequency band. This is very convenient, since Section 5 showed that, if there is one harmonic per band, the produced PDT values are stable, or slowly changing over time, and can potentially be corrected using low bit rate.
(163)
(164) Embodiments show the phase corrector 70 being configured for correcting subband signals 95 of different subbands of the audio signal 55 within the time frame 75, so that frequencies of corrected subband signals 95 have frequency values being harmonically allocated to a fundamental frequency of the audio signal 55. The fundamental frequency is the lowest frequency occurring in the audio signal 55, or in other words, the first harmonics of the audio signal 55.
(165) Furthermore, the phase corrector 70 is configured for smoothing the deviation 105 for each subband 95 of the plurality of subbands over a previous time frame, the current time frame, and a future time frame 75a to 75c and is configured for reducing rapid changes of the deviation 105 within a subband 95. According to further embodiments, the smoothing is a weighted mean, wherein the phase corrector 70 is configured for calculating the weighted mean over the previous, the current and the future time frames 75a to 75c, weighted by a magnitude of the audio signal 55 in the previous, the current and the future time frame 75a to 75c.
(166) Embodiments show the previously described processing steps vector based. Therefore, the phase corrector 70 is configured for forming a vector of deviations 105, wherein a first element of the vector refers to a first deviation 105a for the first subband 95a of the plurality of subbands and a second element of the vector refers to a second deviation 105b for the second subband 95b of the plurality of subbands from a previous time frame 75a to a current time frame 75b. Furthermore, the phase corrector 70 can apply the vector of deviations 105 to the phases 45 of the audio signal 55, wherein the first element of the vector is applied to a phase 45a of the audio signal 55 in a first subband 95a of a plurality of subbands of the audio signal 55 and the second element of the vector is applied to a phase 45b of the audio signal 55 in a second subband 95b of the plurality of subbands of the audio signal 55.
(167) From another point of view, it can be stated that the whole processing in the audio processor 50 is vector-based, wherein each vector represents a time frame 75, wherein each subband 95 of the plurality of subband comprises an element of the vector. Further embodiments focus on the target phase measure determiner which is configured for obtaining a fundamental frequency estimate 85b for a current time frame 75b, wherein the target phase measure determiner 65 is configured for calculating a frequency estimate 85 for each subband of the plurality of subbands for the time frame 75 using the fundamental frequency estimate 85 for the time frame 75. Furthermore, the target phase measure determiner 65 may convert the frequency estimates 85 for each subband 95 of the plurality of subbands into a phase derivative over time using a total number of subbands 95 and a sampling frequency of the audio signal 55. For clarification it has to be noted that the output 85 of the target phase measure determiner 65 may be either the frequency estimate or the phase derivative over time, depending on the embodiment. Therefore, in one embodiment the frequency estimate already comprises the right format for further processing in the phase corrector 70, wherein in another embodiment the frequency estimate has to be converted into a suitable format, which may be a phase derivative over time.
(168) Accordingly, the target phase measure determiner 65 may be seen as vector based as well. Therefore, the target phase measure determiner 65 can form a vector of frequency estimates 85 for each subband 95 of the plurality of subbands, wherein the first element of the vector refers to a frequency estimate 85a for a first subband 95a and a second element of the vector refers to a frequency estimate 85b for a second subband 95b. Additionally, the target phase measure determiner 65 can calculate the frequency estimate 85 using multiples of the fundamental frequency, wherein the frequency estimate 85 of the current subband 95 is that multiple of the fundamental frequency which is closest to the center of the subband 95, or wherein the frequency estimate 85 of the current subband is a border frequency of the current subband 95 if none of the multiples of the fundamental frequency are within the current subband 95.
(169) In other words, the suggested algorithm for correcting the errors in the frequencies of the harmonics using the audio processor 50 functions is as follows. First, the PDT is computed and the SBR processed signal Z.sup.pdt. Z.sup.pdt(k, n)=Z.sup.pha(k, n+1)Z.sup.pha(k, n). The difference between it and a target PDT for the horizontal correction is computed next:
D.sup.pdt(k, n)=Z.sup.pdt(k, n)Z.sub.th.sup.pdt(k, n).(16a)
(170) At this point the target PDT can be assumed to be equal to the PDT of the input of the input signal
Z.sub.th.sup.pdt(k, n)=X.sup.pdt(k, n).(16b)
(171) Later it will be presented how the target PDT can be obtained with a low bit rate.
(172) This value (i.e. the error value 105) is smoothened over time using a Hann window W(l). Suitable length is, for example, 41 samples in the QMF domain (corresponding to an interval of 55 ms). The smoothing is weighted by the magnitude of the corresponding time-frequency tiles
D.sub.sm.sup.pdt(k, n)=circmean{D.sup.pdt(k, n+), W(l)Z.sup.mag(k, n+l)}, 20l20,(17)
(173) where circmean {a, b} denotes computing the circular mean for angular values a weighted by values b. The smoothened error in the PDT D.sub.sm.sup.pdt(k, n) is depicted in
(174) Next, a modulator matrix is created for modifying the phase spectrum in order to obtain the desired PDT
Q.sup.pha(k, n+1)=Q.sup.pha(k, n)D.sub.sm.sup.pdt(k, n).(18)
(175) The phase spectrum is processed using this matrix
Z.sub.ch.sup.pha(k, n)=Z.sup.pha(k, n)+Q.sup.pha(k, n).(19)
(176)
(177) Using X.sup.pdt(k, n) as a target PDT, it is likely to transmit the PDT-error values D.sub.sm.sup.pdt(k, n) for each time-frequency tile. A further approach calculating the target PDT such that the bandwidth for transmission is reduced is shown in section 9.
(178) In further embodiments, the audio processor 50 may be part of a decoder 110. Therefore, the decoder 110 for decoding an audio signal 55 may comprise the audio processor 50, a core decoder 115, and a patcher 120. The core decoder 115 is configured for core decoding an audio signal 25 in a time frame 75 with a reduced number of subbands with respect to the audio signal 55. The patcher patches a set of subbands 95 of the core decoded audio signal 25 with a reduced number of subbands, wherein the set of subbands forms a first patch 40a, to further subbands in the time frame 75, adjacent to the reduced number of subbands, to obtain an audio signal 55 with a regular number of subbands. Additionally, the audio processor 50 is configured for correcting the phases 45 within the subbands of the first patch 40a according to a target function 85. The audio processor 50 and the audio signal 55 have been described with respect to
(179) According to further embodiments, the patcher 120 is configured for patching a set of subbands 95 of the audio signal 25, wherein the set of subbands forms a second patch, to further subbands of the time frame, adjacent to the first patch and wherein the audio processor 50 is configured for correcting the phase 45 within the subbands of the second patch. Alternatively, the patcher 120 is configured for patching the corrected first patch to further subbands of the time frame, adjacent to the first patch.
(180) In other words, in the first option the patcher builds an audio signal with a regular number of subbands from the transmitted part of the audio signal and thereafter the phases of each patch of the audio signal are corrected. The second option first corrects the phases of the first patch with respect to the transmitted part of the audio signal and thereafter builds the audio signal with the regular number of subbands with the already corrected first patch.
(181) Further embodiments show the decoder 110 comprising a data stream extractor 130 configured for extracting a fundamental frequency 140 of the current time frame 75 of the audio signal 55 from a data stream 135, wherein the data stream further comprises the encoded audio signal 145 with a reduced number of subbands. Alternatively, the decoder may comprise a fundamental frequency analyzer 150 configured for analyzing the core decoded audio signal 25 in order to calculate the fundamental frequency 140. In other words, options for deriving the fundamental frequency 140 are for example an analysis of the audio signal in the decoder or in the encoder, wherein in the latter case the fundamental frequency may be more accurate at the cost of a higher data rate, since the value has to be transmitted from the encoder to the decoder.
(182)
(183) In an alternative embodiment an intelligent gap filling encoder may be used for encoding the audio signal 55. Therefore, the core encoder encodes a full bandwidth audio signal, wherein at least one subband of the audio signal is left out. Therefore, the parameter extractor 165 extracts parameters for reconstructing the subbands being left out from the encoding process of the core encoder 160.
(184)
(185)
(186)
(187)
(188)
(189) The described methods 2300, 2400 and 2500 may be implemented in a program code of a computer program for performing the methods when the computer program runs on a computer.
(190) 8.2 Correcting Temporal ErrorsVertical Phase Derivative Correction
(191) As discussed previously, humans can perceive an error in the temporal position of a harmonic if the harmonics are synced over frequency and if the fundamental frequency is low. In Section 5 it was shown that the harmonics are synced if the phase derivative over frequency is constant in the QMF domain. Therefore, it is advantageous to have at least one harmonic in each frequency band. Otherwise the empty frequency bands would have random phases and would disturb this measure. Luckily, humans are sensitive to the temporal location of the harmonics only when the fundamental frequency is low (see Section 7). Thus, the phase derivate over frequency can be used as a measure for determining perceptually significant effects due to temporal movements of the harmonics.
(192)
(193)
(194) Regarding further embodiments, the plurality of subbands 95 is grouped into a baseband 30 and a set of frequency patches 40, the baseband 30 comprising one subband 95 of the audio signal 55 and the set of frequency patches 40 comprises the at least one subband 95 of the baseband 30 at a frequency higher than the frequency of the at least one subband in the baseband. It has to be noted that the patching of the audio signal has already been described with respect to
(195)
(196) A further embodiment is depicted at the bottom of
(197) In a further embodiment, the audio signal phase derivative calculator 210 is configured for calculating a mean of phase derivatives over frequency 215 for a plurality of subband signals comprising higher frequencies than the baseband signal 30 to detect transients in the subband signal 95. It has to be noted that the transient correction is similar to the vertical phase correction of the audio processor 50 with the difference that the frequencies in the baseband 30 do not reflect the higher frequencies of a transient. Therefore, these frequencies have to be taken into consideration for the phase correction of a transient.
(198) After the initialization step, the phase corrector 70 is configured for recursively updating, based on the frequency patches 40, the further modified patch signal 40 by adding the mean of the phase derivatives over frequency 215, weighted by the subband index of the current subband 95, to the phase of the subband signal with the highest subband index in the previous frequency patch. The advantageous embodiment is a combination of the previously described embodiments, where the phase corrector 70 calculates a weighted mean of the modified patch signal 40 and the further modified patch signal 40 to obtain a combined modified patch signal 40. Therefore, the phase corrector 70recursively updates, based on the frequency patches 40, a combined modified patch signal 40 by adding the mean of the phase derivatives over frequency 215, weighted by the subband index of the current subband 95 to the phase of the subband signal with the highest subband index in the previous frequency patch of the combined modified patch signal 40. To obtain the combined modified patches 40a, 40b, etc., the switch 220b is shifted to the next position after each recursion, starting at the combined modified patch 40afor the initialization step, switching to the combined modified patch 40b after the first recursion and so on.
(199) Furthermore, the phase corrector 70 may calculate a weighted mean of a patch signal 40 and the modified patch signal 40 using a circular mean of the patch signal 40 in the current frequency patch weighted with a first specific weighting function and the modified patch signal 40 in the current frequency patch weighted with a second specific weighting function.
(200) In order to provide an interoperability between the audio processor 50 and the audio processor 50, the phase corrector 70 may form a vector of phase deviations, wherein the phase deviations are calculated using a combined modified patch signal 40 and the audio signal 55.
(201)
(202) The second correction mode is therefore applied on the combined modified patch signal 40 to obtain the modified patch signal 40 for the second time frame 75b. Additionally, the first correction mode is applied on the patches of the audio signal 55 in the second time frame 75b to obtain the patch signal 40. Again, a combination of the patch signal 40 and the modified patch signal 40 results in the combined modified patch signal 40. The processing scheme described for the second time frame is applied to the third time frame 75c and any further time frame of the audio signal 55 accordingly.
(203)
(204)
(205)
(206) According to a further embodiment, the patcher 120 is configured for patching the set of subbands 95 of the audio signal 25, wherein the set of subbands forms a further patch, to further subbands of the time frame, adjacent to the patch, and wherein the audio processor 50 is configured for correcting the phases within the subbands of the further patch. Alternatively, the patcher 120 is configured for patching the corrected patch to further subbands of the time frame adjacent to the patch.
(207) A further embodiment is related to a decoder for decoding an audio signal comprising a transient, wherein the audio processor 50 is configured to correct the phase of the transient. The transient handling is described in other words in section 8.4. Therefore, the decoder 110 comprises a further audio processor 50 for receiving a further phase derivative over frequency and to correct transients in the audio signal 32 using the received phase derivative over frequency. Furthermore, it has to be noted that the decoder 110 of
(208)
(209)
(210)
(211)
(212)
(213) In other words, the suggested algorithm for correcting the errors in the temporal positions of the harmonics functions is as follows. First, a difference between the phase spectra of the target signal and the SBR-processed signal (Z.sub.tv.sup.pha(k, n) and Z.sup.pha) is computed
D.sup.pha(k, n)=Z.sup.pha(k, n)Z.sub.tv.sup.pha(k, n),(20a)
which is depicted in
Z.sub.tv.sup.pha(k, n)=X.sup.pha(k, n).(20b)
(214) Later it will be presented how the target phase spectrum can be obtained with a low bit rate.
(215) The vertical phase derivative correction is performed using two methods, and the final corrected phase spectrum is obtained as a mix of them.
(216) First, it can be seen that the error is relatively constant inside the frequency patch, and the error jumps to a new value when entering a new frequency patch. This makes sense, since the phase is changing with a constant value over frequency at all frequencies in the original signal. The error is formed at the cross-over and the error remains constant inside the patch. Thus, a single value is enough for correcting the phase error for the whole frequency patch. Furthermore, the phase error of the higher frequency patches can be corrected using this same error value after multiplication with the index number of the frequency patch.
(217) Therefore, circular mean of the phase error is computed for the first frequency patch
D.sub.avg.sup.pha(n)=circmean{D.sup.pha(k, n)}, 8k13.(21)
(218) The phase spectrum can be corrected using it
Y.sub.cv1.sup.pha(k, n, i)=Y.sup.pha(k, n, i)i.Math.D.sub.avg.sup.pha(n).(22)
(219) This raw correction produces an accurate result if the target PDF, e.g. the phase derivative over frequency X.sup.pdf(k, n), is exactly constant at all frequencies. However, as can be seen in
(220) The other correction method begins by computing a mean of the PDF in the baseband
X.sub.avg.sup.pdf(n)=circmean{X.sub.base.sup.pdf(k, n)}.(23)
(221) The phase spectrum can be corrected using this measure by assuming that the phase is changing with this average value, i.e.,
Y.sub.cv2.sup.pha(k, n, 1)=X.sub.base.sup.pha(6, n)+k.Math.X.sub.avg.sup.pdf(n),
Y.sub.cv2.sup.pha(k, n, i)=Y.sub.cv.sup.pha(6, n, i1)+k.Math.X.sub.avg.sup.pdf(n),(24)
(222) wherein Y.sub.cv.sup.pha is the combined patch signal of the two correction methods.
(223) This correction provides good quality at the cross-overs, but can cause a drift in the PDF towards higher frequencies. In order to avoid this, the two correction methods are combined by computing a weighted circular mean of them
Y.sub.cf.sup.pha(k, n, i)=circmean{Y.sub.cv12.sup.pha(k, n, i, c),W.sub.fc(k, c)},(25)
(224) where c denotes the correction method (Y.sub.cv1.sup.pha or Y.sub.cv2.sup.pha)and W.sub.fc(k, c) is the weighting function
W.sub.fc(k, 1)=[0.2, 0.45, 0.7, 1, 1, 1],
W.sub.fc(k, 2)=[0.8, 0.55, 0.3, 0, 0, 0].(26a)
(225) The resulting phase spectrum Y.sub.cv.sup.pha(k, n, i) suffers neither from discontinuities nor drifting. The error compared to the original spectrum and the PDF of the corrected phase spectrum are depicted in
(226) The corrected phase spectrum Z.sub.dv.sup.pha(k, n) is obtained by concatenating the corrected frequency patches Y.sub.cv.sup.pha(k, n, i). To be compatible with the horizontal-correction mode, the vertical phase correction can be presented also using a modulator matrix (see Eq. 18)
Q.sup.pha(k, n)=Z.sub.cv.sup.pha(k, n)Z.sup.pha(k, n).(26b)
(227) 8.3 Switching Between Different Phase-Correction Methods
(228) Sections 8.1 and 8.2 showed that SBR-induced phase errors can be corrected by applying PDT correction to the violin and PDF correction to the trombone. However, it was not considered how to know which one of the corrections should be applied to an unknown signal, or if any of them should be applied. This section proposes a method for automatically selecting the correction direction. The correction direction (horizontal/vertical) is decided based on the variation of the phase derivatives of the input signal.
(229) Therefore, in
(230) Furthermore, the variation determiner 275 may be configured for determining a standard deviation measure of a phase derivative over time (PDT) for a plurality of time frames of the audio signal 55 as the variation 290a of the phase in the first variation mode and for determining a standard deviation measure of a phase derivative over frequency (PDF) for a plurality of subbands of the audio signal 55 as the variation 290b of the phase in the second variation mode. Therefore, the variation comparator 280 compares the measure of the phase derivative over time as the first variation 290a and the measure of the phase derivative over frequency as a second variation 290b for time frames of the audio signal.
(231) Embodiments show the variation determiner 275 for determining a circular standard deviation of a phase derivative over time of a current and a plurality of previous frames of the audio signal 55 as the standard deviation measure and for determining a circular standard deviation of a phase derivative over time of a current and a plurality of future frames of the audio signal 55 for a current time frame as the standard deviation measure. Furthermore, the variation determiner 275 calculates, when determining the first variation 290a, a minimum of both circular standard deviations. In a further embodiment, the variation determiner 275 calculates the variation 290a in the first variation mode as a combination of a standard deviation measure for a plurality of subbands 95 in a time frame 75 to form an averaged standard deviation measure of a frequency. The variation comparator 280 is configured for performing the combination of the standard deviation measures by calculating an energy-weighted mean of the standard deviation measures of the plurality of subbands using magnitude values of the subband signal 95 in the current time frame 75 as an energy measure.
(232) In an advantageous embodiment, the variation determiner 275 smoothens the averaged standard deviation measure, when determining the first variation 290a, over the current time frame, a plurality of previous time frames and a plurality of future time frames. The smoothing is weighted according to an energy calculated using corresponding time frames and a windowing function. Furthermore, the variation determiner 275 is configured for smoothing the standard deviation measure, when determining the second variation 290b over the current time frame, a plurality of previous time frames, and a plurality of future time frames 75, wherein the smoothing is weighted according to the energy calculated using corresponding time frames 75 and a windowing function. Therefore, the variation comparator 280 compares the smoothened average standard deviation measure as the first variation 290a determined using the first variation mode and compares the smoothened standard deviation measure as the second variation 290b determined using the second variation mode.
(233) An advantageous embodiment is depicted in
(234) The second processing path comprises a PDF calculator 300b for calculating a phase derivative over frequency 305b from the audio signal 55 or a phase of the audio signal. A circular standard deviation calculator 310b forms a standard deviation measures 335b of the phase derivative over frequency 305. The standard deviation measure 305 is smoothened by a smoother 340b to form a smooth standard deviation measure 345b. The smoothened average standard deviation measure 345a and the smoothened standard deviation measure 345b are the first and the second variation, respectively. The variation comparator 280 compares the first and the second variation and the correction data calculator 285 calculates the phase correction data 295 based on the comparing of the first and the second variation.
(235) Further embodiments show the calculator 270 handling three different phase correction modes. A figurative block diagram is shown in
(236) The variation comparator 280 has to determine a suitable correction mode based on three variations. Based on this decision, the correction data calculator 285 calculates the phase correction data 295 in accordance with a third variation mode if a transient is detected. Furthermore, the correction data calculator 85 calculates the phase correction data 295 in accordance with a first variation mode, if an absence of a transient is detected and if the first variation 290a, determined in the first variation mode, is smaller or equal than the second variation 290b, determined in the second variation mode. Accordingly, the phase correction data 295 is calculated in accordance with the second variation mode, if an absence of a transient is detected and if the second variation 290b, determined in the second variation mode, is smaller than the first variation 290a, determined in the first variation mode.
(237) The correction data calculator 285 is further configured for calculating the phase correction data 295 for the third variation 290c for a current time frame, one or more previous and one or more future time frames. Accordingly, the correction data calculator 285 is configured for calculating the phase correction data 295 for the second variation mode 290b for a current time frame, one or more previous time frames and one or more future time frames. Furthermore, the correction data calculator 285 is configured for calculating correction data 295 for a horizontal phase correction and the first variation mode, calculating correction data 295 for a vertical phase correction in the second variation mode, and calculating correction data 295 for a transient correction in the third variation mode.
(238)
(239) In other words, the PDT of the violin is smooth over time whereas the PDF of the trombone is smooth over frequency. Hence, the standard deviation (STD) of these measures as a measure of the variation can be used to select the appropriate correction method. The STD of the phase derivative over time can be computed as
X.sup.std1(k, n)=circstd{X.sup.pdt(k, n+l)}, 23l0,
X.sup.std2(k, n)=circstd{X.sup.pdt(k, n+l)}, 0l23,
X.sup.stdt(k, n)=min{X.sup.std1(k, n), X.sup.std2(k, n)},(27)
(240) and the STD of the phase derivative over frequency as
X.sup.stdf(n)=circstd{X.sup.pdf(k, n)}, 2k13,(28)
(241) where circstd{ } denotes computing circular STD (the angle values could potentially be weighted by energy in order to avoid high STD due to noisy low-energy bins, or the STD computation could be restricted to bins with sufficient energy). The STDs for the violin and the trombone are shown in
(242) The used correction method for each temporal frame is selected based on which of the STDs is lower. For that, X.sup.stdt(k, n) values have to be combined over frequency. The merging is performed by computing an energy-weighted mean for a predefined frequency range
(243)
(244) The deviation estimates are smoothened over time in order to have smooth switching, and thus to avoid potential artifacts. The smoothing is performed using a Hann window and it is weighted by the energy of the temporal frame
(245)
where W(l) is the window function and X.sup.mag(n)=.sub.k=1.sup.64X.sup.mag(k, n) is the sum of X.sup.mag(k, n) over frequency. A corresponding equation is used for smoothing X.sup.stdf(n).
(246) The phase-correction method is determined by comparing X.sub.sm.sup.stdt(n) and X.sub.sm.sup.stdf(n). The default method is PDT (horizontal) correction, and if X.sub.sm.sup.stdf(n)<X.sub.sm.sup.stdt(n), PDF (vertical) correction is applied for the interval [n5, n+5]. If both of the deviations are large, e.g. larger than a predefined threshold value, neither of the correction methods is applied, and bit-rate savings could be made.
(247) 8.4 Transient HandlingPhase Derivative Correction for Transients
(248) The violin signal with a hand clap added in the middle is presented
(249) The solution to the problem is straightforward. First, the transients are detected using a simple energy-based method. The instant energy of mid/high frequencies is compared to a smoothened energy estimate. The instant energy of mid/high frequencies is computed as
(250)
(251) The smoothing is performed using a first-order IIR filter
X.sub.sm.sup.magmh(n)=0.1.Math.X.sup.magmh(n)+0.9.Math.X.sub.sm.sup.magmh(n1).(32)
(252) If X.sup.magmh(n)/X.sub.sm.sup.magmh(n)>, a transient has been detected. The threshold can be fine-tuned to detect the desired amount of transients. For example, =2 can be used. The detected frame is not directly selected to be the transient frame. Instead, the local energy maximum is searched from the surrounding of it. In the current implementation the selected interval is [n2, n+7]. The temporal frame with the maximum energy inside this interval is selected to be the transient.
(253) In theory, the vertical correction mode could also be applied for transients. However, in the case of transients, the phase spectrum of the baseband often does not reflect the high frequencies. This can lead to pre- and post-echoes in the processed signal. Thus, slightly modified processing is suggested for the transients.
(254) The average PDF of the transient at high frequencies is computed
X.sub.avghi.sup.pdf(n)=circmean{X.sup.pdf(k, n)}, 11k36.(33)
(255) The phase spectrum for the transient frame is synthesized using this constant phase change as in Eq. 24, but X.sub.avg.sup.pdf(n) is replaced by X.sub.avghi.sup.pdf(n). The same correction is applied to the temporal frames within the interval [n2, n+2] ( is added to the PDF of the frames n1 and n+1 due to the properties of the QMF, see Section 6). This correction already produces a transient to a suitable position, but the shape of the transient is not necessarily as desired, and significant side lobes (i.e., additional transients) can be present due to the considerable temporal overlap of the QMF frames. Hence, the absolute phase angle has to be correct, too. The absolute angle is corrected by computing the mean error between the synthesized and the original phase spectrum. The correction is performed separately for each temporal frame of the transient.
(256) The result of the transient correction is presented in
(257) 9 Compression of the Correction Data
(258) Section 8 showed that the phase errors can be corrected, but the adequate bit rate for the correction was not considered at all. This section suggests methods how to represent the correction data with low bit rate.
(259) 9.1 Compression of the PDT Correction DataCreating the Target Spectrum for the Horizontal Correction
(260) There are many possible parameters that could be transmitted to enable the PDT correction. However, since D.sub.sm.sup.pdt(k, n) is smoothened over time, it is a potential candidate for low-bit-rate transmission.
(261) First, an adequate update rate for the parameters is discussed. The value was updated only for every N frames and linearly interpolated in between. The update interval for good quality is about 40 ms. For certain signals a bit less is advantageous and for others a bit more. Formal listening tests would be useful for assessing an optimal update rate. Nevertheless, a relatively long update interval appears to be acceptable.
(262) An adequate angular accuracy for D.sub.sm.sup.pdt(k, n) was also studied. 6 bits (64 possible angle values) is enough for perceptually good quality. Furthermore, transmitting only the change in the value was tested. Often the values appear to change only a little, so uneven quantization can be applied to have more accuracy for small changes. Using this approach, 4 bits (16 possible angle values) was found to provide good quality.
(263) The last thing to consider is an adequate spectral accuracy. As can be seen in
(264) 9.1.1 Using Frequency Estimation for Compressing PDT Correction Data
(265) As discussed in Section 5, the phase derivative over time basically means the frequency of the produced sinusoid. The PDTs of the applied 64-band complex QMF can be transformed to frequencies using the following equation
(266)
(267) The produced frequencies are inside the interval f.sub.inter(k)=[f.sub.c(k)f.sub.BW, f.sub.c(k)+f.sub.BW], where f.sub.c(k) is the center frequency of the frequency band k and f.sub.BW is 375 Hz. The result is shown in
(268) The same plot can be applied to the direct copy-up Z.sup.freq(k, n) and the corrected Z.sub.ch.sup.freq(k, n) SBR (see
(269) Since the frequencies of X.sup.freq(k, n) are spaced by the same amount, the frequencies of all frequency bands can be approximated if the spacing between the frequencies is estimated and transmitted. In the case of harmonic signals, the spacing should be equal to the fundamental frequency of the tone. Thus, only a single value has to be transmitted for representing all frequency bands. In the case of more irregular signals, more values are needed for describing the harmonic behavior. For example, the spacing of the harmonics slightly increases in the case of a piano tone [14]. For simplicity, it is assumed in the following that the harmonics are spaced by the same amount. Nonetheless, this does not limit the generality of the described audio processing.
(270) Thus, the fundamental frequency of the tone is estimated for estimating the frequencies of the harmonics. The estimation of fundamental frequency is a widely studied topic (e.g., see [14]). Therefore, a simple estimation method was implemented to generate data used for further processing steps. The method basically computes the spacings of the harmonics, and combines the result according to some heuristics (how much energy, how stable is the value over frequency and time, etc.). In any case, the result is a fundamental-frequency estimate for each temporal frame X.sup.f.sup.
(271) Here, the fundamental frequency X.sup.f.sup.
(272) Alternatively, the fundamental frequency could be estimated in the decoding stage, and no information has to be transmitted. However, better estimates can be expected if the estimation is performed with the original signal in the encoding stage.
(273) The decoder processing begins by obtaining a fundamental-frequency estimate X.sup.f.sup.
(274) The frequencies of the harmonics can be obtained by multiplying it with an index vector
:X.sup.harm(, n)=.Math.X.sup.f.sup.
(275) The result is depicted in
(276) The transmitted parameter of the algorithm is the fundamental frequency X.sup.f.sup.
(277) The next step of the algorithm is to find a suitable value for each frequency band. This is performed by selecting the value of X.sup.harm(, n) which is closest to the center frequency of each band f.sub.c(k) to reflect that band. If the closest value is outside the possible values of the frequency band (f.sub.inter(k)), the border value of the band is used. The resulting matrix X.sub.eh.sup.freq(k, n) contains a frequency for each time-frequency tile.
(278) The final step of the correction-data compression algorithm is to convert the frequency data back to the PDT data
(279)
(280) where mod( ) denotes the modulo operator. The actual correction algorithm works as presented in Section 8.1. Z.sub.th.sup.pdt(k, n) in Eq. 16a is replaced by X.sub.eh.sup.pdt(k, n) as the target PDT, and Eqs. 17-19 are used as in Section 8.1. The result of the correction algorithm with compressed correction data is shown in
(281) Embodiments use more accuracy for low frequencies and less for high frequencies, using the total of 12 bits for each value. The resulting bit rate is about 0.5 kbps (without any compression, such as entropy coding). This accuracy produces equal perceived quality as no quantization. However, significantly lower bit rate can probably be used in many cases producing good enough perceived quality.
(282) One option for low-bit-rate schemes is to estimate the fundamental frequency in the decoding phase using the transmitted signal. In this case no values have to be transmitted. Another option is to estimate the fundamental frequency using the transmitted signal, compare it to the estimate obtained using the broadband signal, and to transmit only the difference. It can be assumed that this difference could be represented using very low bit rate.
(283) 9.2 Compression of the PDF Correction Data
(284) As discussed in Section 8.2, the adequate data for the PDF correction is the average phase error of the first frequency patch D.sub.avg.sup.pha(n). The correction can be performed for all frequency patches with the knowledge of this value, so the transmission of only one value for each temporal frame may be used. However, transmitting even a single value for each temporal frame can yield too high a bit rate.
(285) Inspecting
(286) Hence, the PDF (or the location of a transient) can be transmitted only sparsely in time, and the PDF behavior in between these time instants could be estimated using the knowledge of the fundamental frequency. The PDF correction can be performed using this information. This idea is actually dual to the PDT correction, where the frequencies of the harmonics are assumed to be equally spaced. Here, the same idea is used, but instead, the temporal locations of the transients are assumed to be equally spaced. A method is suggested in the following that is based on detecting the positions of the peaks in the waveform, and using this information, a reference spectrum is created for phase correction.
(287) 9.2.1 Using Peak Detection for Compressing PDF Correction DataCreating the Target Spectrum for the Vertical Correction
(288) The positions of the peaks have to be estimated for performing successful PDF correction. One solution would be to compute the positions of the peaks using the PDF value, similarly as in Eq. 34, and to estimate the positions of the peaks in between using the estimated fundamental frequency. However, this approach would involve a relatively stable fundamental-frequency estimation. Embodiments show a simple, fast to implement, alternative method, which shows that the suggested compression approach is possible.
(289) A time-domain representation of the trombone signal is shown in
(290) Using the transmitted metadata, a time-domain signal is created, which consists of impulses in the positions of the estimated peaks (see
(291) The waveform of signals having vertical phase coherence is typically peaky and reminiscent of a pulse train. Thus, it is suggested that the target phase spectrum for the vertical correction can be estimated by modeling it as the phase spectrum of a pulse train that has peaks at corresponding positions and a corresponding fundamental frequency.
(292) The position closest to the center of the temporal frame is transmitted for, e.g., every 20.sup.th temporal frame (corresponding to an interval of 27 ms). The estimated fundamental frequency, which is transmitted with equal rate, is used to interpolate the peak positions in between the transmitted positions.
(293) Alternatively, the fundamental frequency and the peak positions could be estimated in the decoding stage, and no information has to be transmitted. However, better estimates can be expected if the estimation is performed with the original signal in the encoding stage.
(294) The decoder processing begins by obtaining a fundamental-frequency estimate X.sup.f.sup.
Z.sub.tv.sup.pha(k, n)=X.sub.ev.sup.pha(k, n).(37)
(295) The suggested method uses the encoding stage to transmit only the estimated peak positions and the fundamental frequencies with the update rate of, e.g., 27 ms. In addition, it should be noted that errors in the vertical phase derivative are perceivable only when the fundamental frequency is relatively low. Thus, the fundamental frequency can be transmitted with a relatively low bit rate.
(296) The result of the correction algorithm with compressed correction data is shown in
(297) 9.3 Compression of the Transient Handling Data
(298) As transients can be assumed to be relatively sparse, it can be assumed that this data could be directly transmitted. Embodiments show transmitting six values per transient: one value for the average PDF, and five values for the errors in the absolute phase angle (one value for each temporal frame inside the interval [n2, n+2]). An alternative is to transmit the position of the transient (i.e. one value) and to estimate the target phase spectrum X.sub.et.sup.pha(k, n) as in the case of the vertical correction.
(299) If the bit rate needed to be compressed for the transients, similar approach could be used as for the PDF correction (see Section 9.2). Simply the position of the transient could be transmitted, i.e., a single value. The target phase spectrum and the target PDF could be obtained using this location value as in Section 9.2.
(300) Alternatively, the transient position could be estimated in the decoding stage and no information has to be transmitted. However, better estimates can be expected if the estimation is performed with the original signal in the encoding stage.
(301) All of the previously described embodiments may be seen separately from the other embodiments or in a combination of embodiments. Therefore,
(302)
(303)
(304) Accordingly, the decoder 110 comprises a third target spectrum generator 65c, wherein the third target spectrum generator 65c generates a target spectrum for a third time frame of the subband of the audio signal 32 using third correction data 295c. Furthermore, the decoder 110 comprises a third phase corrector 70c for correcting a phase 45 of the subband signal and the time frame of the audio signal 32 determined with a third phase correction algorithm, wherein the correction is performed by reducing a difference between a measure of the time frame of the subband of the audio signal and the target spectrum 85c. The audio subband signal calculator 350 can calculate the audio subband signal for a third time frame different from the first and the second time frames using the phase correction of the third phase corrector.
(305) According to an embodiment, the first phase corrector 70a is configured for storing a phase corrected subband signal 91a of a previous time frame of the audio signal or for receiving a phase corrected subband signal of the previous time frame 375 of the audio signal from a second phase corrector 70b of the third phase corrector 70c. Furthermore, the first phase corrector 70a corrects the phase 45 of the audio signal 32 in a current time frame of the audio subband signal based on the stored or the received phase corrected subband signal of the previous time frame 91a, 375.
(306) Further embodiments show the first phase corrector 70a performing a horizontal phase correction, the second phase corrector 70b performing a vertical phase correction, and the third phase corrector 70c performing a phase correction for transients.
(307) From another point of view,
(308) A second demultiplexer 130 (DEMUX) first divides the received metadata 135 into activation data 365 and correction data 295a-c for the different correction modes. Based on the activation data, the computation of the target spectrum is activated for the right correction mode (others can be idle). Using the target spectrum, the phase correction is performed to the received BWE signal using the desired correction mode. It should be noted that as the horizontal correction 70a is performed recursively (in other words: dependent on previous signal frames), it receives the previous correction matrices also from other correction modes 70b, c. Finally, the corrected signal, or the unprocessed one, is set to the output based on the activation data.
(309) After having corrected the phase data, the underlying BWE synthesis further downstream is continued, in the case of the current example the SBR synthesis. Variations might exist where exactly the phase correction is inserted into the BWE synthesis signal flow. Advantageously, the phase-derivative correction is done as an initial adjustment on the raw spectral patches having phases Z.sup.pha(k, n) and all additional BWE processing or adjustment steps (in SBR this can be noise addition, inverse filtering, missing sinusoids, etc.) are executed further downstream on the corrected phases Z.sub.c.sup.pha(k, n).
(310)
(311) Many other embodiments can be thought of where the signal processor blocks are switched. For example, the magnitude processor 125 and the block A may be swapped. Therefore, the block A works on the reconstructed audio signal 35, where the magnitude values of the patches have already been corrected. Alternatively, the audio subband signal calculator 350 may be located after the magnitude processor 125 in order to form the corrected audio signal from the phase corrected and the magnitude corrected part of the audio signal.
(312) Furthermore, the decoder 110 comprises a synthesizer 100 for synthesizing the phase and magnitude corrected audio signal to obtain the frequency combined processed audio signal 90. Optionally, since neither the magnitude nor the phase correction is applied on the core decoded audio signal 25, said audio signal may be transmitted directly to the synthesizer 100. Any optional processing block applied in one of the previously described decoders 110 or 110 may be applied in the decoder 110 as well.
(313)
(314) According to embodiments, the calculator 270 comprises a set of correction data calculators 285a-c for correcting the phase correction in accordance with a first variation mode, a second variation mode, or a third variation mode. Furthermore, the calculator 270 determines activation data 365 for activating one correction data calculator of the set of correction data calculators 285a-c. The output signal former 170 forms the output signal comprising the activation data, the parameters, the core encoded audio signal, and the phase correction data.
(315)
(316) Embodiments show the calculator 270 comprising a metadata former 390, which forms a metadata stream 295 comprising the calculated correction data 295a, 295b, or 295c and the activation data 365. The activation data 365 may be transmitted to the decoder if the correction data itself does not comprise sufficient information of the current correction mode. Sufficient information may be for example a number of bits used to represent the correction data, which is different for the correction data 295a, the correction data 295b, and the correction data 295c. Furthermore, the output signal former 170 may additionally use the activation data 365, such that the metadata former 390 can be neglected.
(317) From another point of view, the block diagram of
(318) The correction-mode-computation block first computes the correction mode that is applied for each temporal frame. Based on the activation data 365, correction-data 295a-c computation is activated in the right correction mode (others can be idle). Finally, multiplexer (MUX) combines the activation data and the correction data from the different correction modes.
(319) A further multiplexer (not depicted) merges the phase-derivative correction data into the bit stream of the BWE and the perceptual encoder that is being enhanced by the inventive correction.
(320)
(321)
(322) The methods 5800 and 5900 as well as the previously described methods 2300, 2400, 2500, 3400, 3500, 3600 and 4200, may be implemented in a computer program to be performed on a computer.
(323) It has to be noted that the audio signal 55 is used as a general term for an audio signal, especially for the original i.e. unprocessed audio signal, the transmitted part of the audio signal X.sub.trans(k, n) 25, the baseband signal X.sub.base(k, n) 30, the processed audio signal comprising higher frequencies 32 when compared to the original audio signal, the reconstructed audio signal 35, the magnitude corrected frequency patch Y(k, n, i) 40, the phase 45 of the audio signal, or the magnitude 47 of the audio signal. Therefore, the different audio signals may be mutually exchanged due to the context of the embodiment.
(324) Alternative embodiments relate to different filter bank or transform domains used for the inventive time-frequency processing, for example the short time Fourier transform (STFT) a Complex Modified Discrete Cosine Transform (CMDCT), or a Discrete Fourier Transform (DFT) domain. Therefore, specific phase properties related to the transform may be taken into consideration. In detail, if e.g. copy-up coefficients are copied from an even number to an odd number or vice versa, i.e. the second subband of the original audio signal is copied to the ninth subband instead of the eighth subband as described in the embodiments, the conjugate complex of the patch may be used for the processing. The same applies to a mirroring of the patches instead of using e.g. the copy-up algorithm, to overcome the reversed order of the phase angles within a patch.
(325) Other embodiments might reassign side information from the encoder and estimate some or all useful correction parameters on decoder site. Further embodiments might have other underlying BWE patching schemes that for example use different baseband portions, a different number or size of patches or different transposition techniques, for example spectral mirroring or single side band modulation (SSB). Variations might also exist where exactly the phase correction is converted into the BWE synthesis signal flow. Furthermore, the smoothing is performed using a sliding Hann window, which may be replaced for better computational efficiency by, e.g. a first-order IIR.
(326) The use of state of the art perceptual audio codecs often impairs the phase coherence of the spectral components of an audio signal, especially at low bit rates, where parametric coding techniques like bandwidth extension are applied. This leads to an alteration of the phase derivative of the audio signal. However, in certain signal types the preservation of the phase derivative is important. As a result, the perceptual quality of such sounds is impaired. The present invention readjusts the phase derivative either over frequency (vertical) or over time (horizontal) of such signals if a restoration of the phase derivative is perceptually beneficial. Further, a decision is made whether adjusting the vertical or horizontal phase derivative is perceptually advantageous. The transmission of only very compact side information is needed to control the phase derivative correction processing. Therefore, the invention improves sound quality of perceptual audio coders at moderate side information costs.
(327) In other words, spectral band replication (SBR) can cause errors in the phase spectrum. The human perception of these errors was studied revealing two perceptually significant effects: differences in the frequencies and the temporal positions of the harmonics. The frequency errors appear to be perceivable only when the fundamental frequency is high enough that there is only one harmonic inside an ERB band. Correspondingly, the temporal-position errors appear to be perceivable only if the fundamental frequency is low and if the phases of the harmonics are aligned over frequency.
(328) The frequency errors can be detected by computing the phase derivative over time (PDT). If the PDT values are stable over time, differences in them between the SBR-processed and the original signals should be corrected. This effectively corrects the frequencies of the harmonics, and thus, the perception of inharmonicity is avoided.
(329) The temporal-position errors can be detected by computing the phase derivative over frequency (PDF). If the PDF values are stable over frequency, differences in them between the SBR-processed and the original signals should be corrected. This effectively corrects the temporal positions of the harmonics, and thus, the perception of modulating noises at the cross-over frequencies is avoided.
(330) Although the present invention has been described in the context of block diagrams where the blocks represent actual or logical hardware components, the present invention can also be implemented by a computer-implemented method. In the latter case, the blocks represent corresponding method steps where these steps stand for the functionalities performed by corresponding logical or physical hardware blocks.
(331) Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
(332) The inventive transmitted or encoded signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
(333) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disc, a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
(334) Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(335) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may, for example, be stored on a machine readable carrier.
(336) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(337) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(338) A further embodiment of the inventive method is, therefore, a data carrier (or a non-transitory storage medium such as a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitory.
(339) A further embodiment of the invention method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may, for example, be configured to be transferred via a data communication connection, for example, via the internet.
(340) A further embodiment comprises a processing means, for example, a computer or a programmable logic device, configured to, or adapted to, perform one of the methods described herein.
(341) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(342) A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
(343) In some embodiments, a programmable logic device (for example, a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
(344) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
REFERENCES
(345) [1] Painter, T.: Spanias, A. Perceptual coding of digital audio, Proceedings of the IEEE, 88(4), 2000; pp. 451-513. [2] Larsen, E.; Aarts, R. Audio Bandwidth Extension: Application of psychoacoustics, signal processing and loudspeaker design, John Wiley and Sons Ltd, 2004, Chapters 5, 6. [3] Dietz, M.; Liljeryd, L.; Kjorling, K.; Kunz, O. Spectral Band Replication, a Novel Approach in Audio Coding, 112th AES Convention, April 2002, Preprint 5553. [4] Nagel, F.; Disch, S.; Rettelbach, N. A Phase Vocoder Driven Bandwidth Extension Method with Novel Transient Handling for Audio Codecs, 126th AES Convention, 2009. [5] D. Griesinger The Relationship between Audience Engagement and the ability to Perceive Pitch, Timbre, Azimuth and Envelopment of Multiple Sources Tonmeister Tagung 2010. [6] D. Dorran and R. Lawlor, Time-scale modification of music using a synchronized subband/time domain approach, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. IV 225-IV 228, Montreal, May 2004. [7] J. Laroche, Frequency-domain techniques for high quality voice modification, Proceedings of the International Conference on Digital Audio Effects, pp. 328-322, 2003. [8] Laroche, J.; Dolson, M.; Phase-vocoder: about this phasiness business, Applications of
(346) Signal Processing to Audio and Acoustics, 1997. 1997 IEEE ASSP Workshop on, vol., no., pp. 4 pp., 19-22, October 1997 [9] M. Dietz, L. Liljeryd, K. Kjorling, and O. Kunz, Spectral band replication, a novel approach in audio coding, in AES 112th Convention, (Munich, Germany), May 2002. [10] P. Ekstrand, Bandwidth extension of audio signals by spectral band replication, in IEEE Benelux Workshop on Model based Processing and Coding of Audio, (Leuven, Belgium), November 2002. [11] B. C. J. Moore and B. R. Glasberg, Suggested formulae for calculating auditory-filter bandwidths and excitation patterns, J. Acoust. Soc. Am., vol. 74, pp. 750-753, September 1983. [12] T. M. Shackleton and R. P. Carlyon, The role of resolved and unresolved harmonics in pitch perception and frequency modulation discrimination, J. Acoust. Soc. Am., vol. 95, pp. 3529-3540, June 1994. [13] M.-V. Laitinen, S. Disch, and V. Pulkki, Sensitivity of human hearing to changes in phase spectrum, J. Audio Eng. Soc., vol. 61, pp. 860{877, November 2013. [14] A. Klapuri, Multiple fundamental frequency estimation based on harmonicity and spectral smoothness, IEEE Transactions on Speech and Audio Processing, vol. 11, November 2003.