Apparatus and method for center signal scaling and stereophonic enhancement based on a signal-to-downmix ratio
09743215 · 2017-08-22
Assignee
Inventors
- Christian Uhle (Ursensolien, DE)
- Peter Prokein (Erlangen, DE)
- Oliver Hellmuth (Erlangen, DE)
- Sebastian Scharrer (Hersbruck, DE)
- Emanuel Habets (Spardorf, DE)
Cpc classification
H04S3/00
ELECTRICITY
H04S2400/03
ELECTRICITY
H04S3/02
ELECTRICITY
H04S2400/05
ELECTRICITY
International classification
H04S7/00
ELECTRICITY
Abstract
An apparatus for generating a modified audio signal having two or more modified audio channels from an audio input signal comprising two or more audio input channels is provided. The apparatus has an information generator for generating signal-to-downmix information. The information generator is adapted to generate signal information by combining a spectral value of each of the two or more audio input channels in a first way. The information generator is adapted to generate downmix information by combining the spectral value of each of the two or more audio input channels in a second way being different from the first way. Furthermore, the information generator is adapted to combine the signal information and the downmix information to obtain signal-to-downmix information. The apparatus has a signal attenuator for attenuating the two or more audio input channels depending on the signal-to-downmix information to obtain the two or more modified audio channels.
Claims
1. An apparatus for generating a modified audio signal comprising two or more modified audio channels from an audio input signal comprising two or more audio input channels, wherein the apparatus comprises: an information generator for generating signal-to-downmix information, wherein the information generator is adapted to generate signal information by combining a spectral value of each of the two or more audio input channels in a first way, wherein the information generator is adapted to generate downmix information by combining the spectral value of each of the two or more audio input channels in a second way being different from the first way, and wherein the information generator is adapted to combine the signal information and the downmix information to acquire signal-to-downmix information, and a signal attenuator for attenuating the two or more audio input channels depending on the signal-to-downmix information to acquire the two or more modified audio channels, wherein the information generator is configured to generate the signal information Φ.sub.1(m, k) according to the formula:
Φ.sub.1(m,k)=ε{WX(m,k)(WX(m,k)).sup.H}, wherein the information generator is configured to generate the downmix information Φ.sub.2(m, k) according to the formula:
Φ.sub.2(m,k)=ε{VX(m,k)(VX(m,k)).sup.H}, and wherein the information generator is configured to generate a signal-to-downmix ratio as the signal-to-downmix information R.sub.g(m, k, β) according to the formula:
X(m,k)=[X.sub.1(m,k) . . . X.sub.N(m,k)].sup.T, wherein N indicates the number of audio input channels of the audio input signal, wherein m indicates a time index, and wherein k indicates a frequency index, wherein X.sub.1(m, k) indicates the first audio input channel, wherein X.sub.N(m, k) indicates the N-th audio input channel, wherein V indicates a matrix or a vector, wherein W indicates a matrix or a vector, wherein .sup.H indicates the conjugate transpose of a matrix or a vector, wherein ε{•} is an expectation operation, wherein β is a real number with β>0, and wherein tr{ } is the trace of a matrix.
2. The apparatus according to claim 1, wherein V is a row vector of length N whose elements are equal to one and W is the identity matrix of size N×N.
3. The apparatus according to claim 1, wherein V=[1, 1], wherein W=[1, −1] and wherein N=2.
4. The apparatus according to claim 1, wherein the number of the modified audio channels is equal to the number of the audio input channels, or wherein the number of the modified audio channels is smaller than the number of the audio input channels.
5. The apparatus according to claim 1, wherein the information generator is configured to process the spectral value of each of the two or more audio input channels to acquire two or more processed values, and wherein the information generator is configured to combine the two or more processed values to acquire the signal information, and wherein the information generator is configured to combine the spectral value of each of the two or more audio input channels to acquire a combined value, and wherein the information generator is configured to process the combined value to acquire the downmix information.
6. The apparatus according to claim 5, wherein the information generator is configured to process the combined value by determining a power spectral density of the combined value.
7. The apparatus according to claim 6, wherein the information generator is configured to use
s(m,k,β)=Σ.sub.i=1.sup.NΦ.sub.i,i(m,k).sup.β to acquire the signal information, wherein Φ.sub.i,i(m, k) indicates the auto power spectral density of the spectral value of the i-th audio signal channel.
8. The apparatus according to claim 7, wherein the information generator is configured to determine
9. The apparatus according to claim 1, wherein the information generator is configured to process the spectral value of each of the two or more audio input channels by multiplying said spectral value by the complex conjugate of said spectral value to acquire an auto power spectral density of said spectral value for each of the two or more audio input channels.
10. The apparatus according to claim 1, wherein the signal attenuator is adapted to attenuate the two or more audio input channels depending on a gain function G(m, k) according to the formula:
Y(m,k)=G(m,k)X(m,k), wherein the gain function G(m, k) depends on the signal-to-downmix information, and wherein the gain function G(m, k) is a monotonically increasing function of the signal-to-downmix information or a monotonically decreasing function of the signal-to-downmix information, wherein X(m, k) indicates the audio input signal, wherein Y(m, k) indicates the modified audio signal, wherein m indicates a time index, and wherein k indicates a frequency index.
11. The apparatus according to claim 10, wherein the gain function G(m, k) is a first function G.sub.c.sub.
G.sub.c.sub.
wherein
wherein
G.sub.s.sub.
wherein
12. A system comprising: a phase compensator for generating a phase-compensated audio signal comprising two or more phase-compensated audio channels from an unprocessed audio signal comprising two or more unprocessed audio channels, and an apparatus according to claim 1 for receiving the phase compensated audio signal as an audio input signal and for generating a modified audio signal comprising two or more modified audio channels from the audio input signal comprising the two or more phase-compensated audio channels as two or more audio input channels, wherein one of the two or more unprocessed audio channels is a reference channel, wherein the phase compensator is adapted to estimate for each unprocessed audio channel of the two or more unprocessed audio channels which is not the reference channel a phase transfer function between said unprocessed audio channel and the reference channel, and wherein the phase compensator is adapted to generate the phase-compensated audio signal by modifying each unprocessed audio channel of the unprocessed audio channels which is not the reference channel depending on the phase transfer function of said unprocessed audio channel.
13. A method for generating a modified audio signal comprising two or more modified audio channels from an audio input signal comprising two or more audio input channels, wherein the method comprises: generating signal information by combining a spectral value of each of the two or more audio input channels in a first way, generating downmix information by combining the spectral value of each of the two or more audio input channels in a second way being different from the first way, generating signal-to-downmix information by combining the signal information and the downmix information, and attenuating the two or more audio input channels depending on the signal-to-downmix information to acquire the two or more modified audio channels, wherein generating the signal information Φ.sub.1(m, k) is conducted according to the formula:
Φ.sub.1(m,k)=ε{WX(m,k)(WX(m,k)).sup.H}, wherein generating the downmix information Φ.sub.2(m, k) is conducted according to the formula:
Φ.sub.2(m,k)=ε{VX(m,k)(VX(m,k)).sup.H}, and wherein a signal-to-downmix ratio is generated as the signal-to-downmix information R.sub.g(m, k, β) according to the formula
X(m,k)=[X.sub.1(m,k) . . . X.sub.N(m,k)].sup.T, wherein N indicates the number of audio input channels of the audio input signal, wherein m indicates a time index, and wherein k indicates a frequency index, wherein X.sub.1(m, k) indicates the first audio input channel, wherein X.sub.N(m, k) indicates the N-th audio input channel, wherein V indicates a matrix or a vector, wherein W indicates a matrix or a vector, wherein .sup.H indicates the conjugate transpose of a matrix or a vector, wherein ε{•} is an expectation operation, wherein β is a real number with β>0, and wherein tr{ } is the trace of a matrix.
14. A non-transitory computer-readable storage device having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations comprising: generating signal information by combining a spectral value of each of the two or more audio input channels in a first way, generating downmix information by combining the spectral value of each of the two or more audio input channels in a second way being different from the first way, generating signal-to-downmix information by combining the signal information and the downmix information, and attenuating the two or more audio input channels depending on the signal-to-downmix information to acquire the two or more modified audio channels, wherein generating the signal information Φ.sub.1(m, k) is conducted according to the formula:
Φ.sub.1(m,k)=ε{WX(m,k)(WX(m,k)).sup.H}, wherein generating the downmix information Φ.sub.2(m, k) is conducted according to the formula:
Φ.sub.2(m,k)=ε{VX(m,k)(VX(m,k)).sup.H}, and wherein a signal-to-downmix ratio is generated as the signal-to-downmix information R.sub.g(m, k, β) according to the formula
X(m,k)=[X.sub.1(m,k) . . . X.sub.N(m,k)].sup.T, wherein N indicates the number of audio input channels of the audio input signal, wherein m indicates a time index, and wherein k indicates a frequency index, wherein X.sub.1(m, k) indicates the first audio input channel, wherein X.sub.N(m, k) indicates the N-th audio input channel, wherein V indicates a matrix or a vector, wherein W indicates a matrix or a vector, wherein .sup.H indicates the conjugate transpose of a matrix or a vector, wherein ε{•} is an expectation operation, wherein β is a real number with β>0, and wherein tr{ } is the trace of a matrix.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) In the following, embodiments of the present invention are described in more detail with reference to the figures, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION OF THE INVENTION
(15)
(16) The apparatus comprises an information generator 110 for generating signal-to-downmix information.
(17) The information generator 110 is adapted to generate signal information by combining a spectral value of each of the two or more audio input channels in a first way. Moreover, the information generator 110 is adapted to generate downmix information by combining the spectral value of each of the two or more audio input channels in a second way being different from the first way.
(18) Furthermore, the information generator 110 is adapted to combine the signal information and the downmix information to obtain signal-to-downmix information. For example, the signal-to-downmix information may be a signal-to-downmix ratio, e.g., a signal-to-downmix value.
(19) Moreover, the apparatus comprises a signal attenuator 120 for attenuating the two or more audio input channels depending on the signal-to-downmix information to obtain the two or more modified audio channels.
(20) According to an embodiment, the information generator may be configured to combine the signal information and the downmix information so that the signal-to-downmix information indicates a ratio of the signal information to the downmix information. For example, the signal information may be a first value and the downmix information may be a second value and the signal-to-downmix information indicates a ratio of the signal value to the downmix value. For example, the signal-to-downmix information may be the first value divided by the second value. Or, for example, if the first value and the second value are logarithmic values, the signal-to-downmix information may be the difference between the first value and the second value.
(21) In the following, the underlying signal model and the concepts are described and analyzed for the case of input signal featuring amplitude difference stereophony.
(22) The rationale is to compute and apply real-valued spectral weights as a function of the diffuseness and the lateral position of direct sources. The processing as demonstrated here is applied in the STFT domain, yet it is not restricted to a particular filterbank. The N channel input signal is denoted by:
x[n]=[x.sub.1[n] . . . x.sub.N[n]].sup.T, (1)
where n denotes the discrete time index. The input signal is assumed to be an additive mixture of direct signals s.sub.i[n] and ambient sounds a.sub.i[n],
(23)
where P is the number of sound sources, d.sub.i,l[n] denotes the impulse responses of the direct paths of the i-th source into the l-th channel of length L.sub.i,l samples, and the ambient signal components are mutually uncorrelated or weakly correlated. In the following description, it is assumed that the signal model corresponds to amplitude difference stereophony, i.e., L.sub.i,l=1, ∀i, l.
(24) The time-frequency domain representation of X[n] is given by:
X(m,k)=[X.sub.1(m,k) . . . X.sub.N(m,k)].sup.T, (3)
with time index m and frequency index k. The output signals are denoted by:
Y(m,k)=[Y.sub.1(m,k) . . . Y.sub.N(m,k)].sup.T, (4)
and are obtained by means of spectral weighting
Y(m,k)=G(m,k)X(m,k), (5)
with real-valued weights G(m, k). Time domain output signals are computed by applying the inverse processing of the filterbank. For the computation of the spectral weights, the sum signal, thereafter denoted as the downmix signal, is computed as:
(25)
The matrix of PSD of the input signal, comprising estimates of the (auto-)PSD on the main diagonal, while off-diagonal elements are estimates of the cross-PSD, is given by:
Φ.sub.i,l(m,k)=ε{X.sub.i(m,k)X.sub.l*(m,k)}, i,l=1 . . . N, (7)
where X* denotes the complex conjugate of X, and ε{•} is the expectation operation with respect to the time dimension. In the presented simulations the expectation values are estimated using single-pole recursive averaging:
Φ.sub.i,l(m,k)=αX.sub.i(m,k)X.sub.l*(m,k)+(1−α)Φ.sub.i,l(m−1,k), (8)
where the filter coefficient α determines the integration time. Furthermore, the quantity R(m, k; β) is defined as:
(26)
where Φ.sub.d(m, k) is the PSD of the downmix signal and β is a parameter which will be addressed in the following. The quantity R(m, k; 1) is the signal-to-downmix ratio (SDR), i.e., the ratio of the total PSD and the PSD of the downmix signal. The power to
(27)
ensures that the range of R(m, k; β) is independent of β.
(28) The information generator 110 may be configured to determine the signal-to-downmix ratio according to Equation (9).
(29) According to Equation (9), the signal information s(m, k, β) that may be determined by the information generator 110 is defined as:
s(m,k,β)=Σ.sub.i=1.sup.NΦ.sub.i,i(m,k).sup.β.
(30) As can be seen above, Φ.sub.i,i(m, k) is defined as Φ.sub.i,i(m, k)=ε{X.sub.i(m, k)X.sub.i*(m, k)}. Thus, to determine the signal information s(m, k, β), the spectral value X.sub.i(m, k) of each of the two or more audio input channels is processed to obtain the processed value Φ.sub.i,i(m, k).sup.β for each of the two or more audio input channels, and the obtained processed values Φ.sub.i,i(m, k).sup.β are then combined, e.g., as in Equation (9) by summing up the obtained processed values Φ.sub.i,i(m, k).sup.β.
(31) Thus, the information generator 110 may be configured to process the spectral value X.sub.i(m, k) of each of the two or more audio input channels to obtain two or more processed values Φ.sub.i,i(m, k).sup.β, and the information generator 110 may be configured to combine the two or more processed values to obtain the signal information s(m, k, β). In more general, the information generator 110 is adapted to generate signal information s(m, k, β) by combining a spectral value X.sub.i(m, k) of each of the two or more audio input channels in a first way.
(32) Moreover, according to Equation (9), the downmix information d (m, k, β) that may be determined by the information generator 110 is defined as:
d(m,k,β)=Φ.sub.d(m,k).sup.β.
To form Φ.sub.d(m, k), at first X.sub.d(m, k) is formed according to the above Equation (6):
(33)
(34) As can be seen, at first, the spectral value X.sub.i(m, k) of each of the two or more audio input channels is combined to obtain a combined value X.sub.d(m, k), e.g., as in Equation (6), by summing up the spectral value X.sub.i(m, k) of each of the two or more audio input channels.
(35) Then, to obtain Φ.sub.d(m, k), the power spectral density of X.sub.d(m, k) is formed, e.g., according to:
Φ.sub.d(m,k)={X.sub.d(m,k)X.sub.d*(m,k)},
and then Φ.sub.d(m, k).sup.β may be determined. More generally speaking, the obtained combined value X.sub.d(m, k) has been processed to obtain the downmix information d(m, k, β)=Φ.sub.d(m, k).sup.β.
(36) Thus, the information generator 110 may be configured to combine the spectral value X.sub.i(m, k) of each of the two or more audio input channels to obtain a combined value, and the information generator 110 may be configured to process the combined value to obtain the downmix information d (m, k, β). In more general, the information generator 110 is adapted to generate downmix information d(m, k, β) by combining the spectral value X.sub.i(m, k) of each of the two or more audio input channels in a second way. The way, how the downmix information is generated (“second way”) differs from the way, how the signal information is generated (“first way”) and thus, the second way is different from the first way.
(37) The information generator 110 is adapted to generate signal information by combining a spectral value of each of the two or more audio input channels in a first way. Moreover, the information generator 110 is adapted to generate downmix information by combining the spectral value of each of the two or more audio input channels in a second way being different from the first way.
(38)
(39) In particular,
(40)
and
(41)
(42)
(43) Due to these properties, appropriate spectral weights for center signal scaling can be computed from the SDR by using monotonically decreasing functions for the extraction of center signals and monotonically increasing functions for the attenuation of center signals.
(44) For the extraction of a center signal, appropriate functions of R(m, k; β) are, for example:
G.sub.c.sub.
and
(45)
where a parameter for controlling the maximum attenuation is introduced.
(46) For the attenuation of the center signal, appropriate functions of R(m, k; β) are, for example,
G.sub.s.sub.
and
(47)
(48)
(49) In particular,
(50) Moreover,
(51) Furthermore,
(52) The effect of the parameter β is shown in
(53) Post-processing of spectral weights: Prior to the spectral weighting, the weights G(m, k; β, γ) can be further processed by means of smoothing operations. Zero phase low-pass filtering along the frequency axis reduces circular convolution artifacts which can occur for example when the zero-padding in the STFT computation is too short or a rectangular synthesis window is applied. Low-pass filtering along the time axis can reduce processing artifacts, especially when the time constant for the PSD estimation is rather small.
(54) In the following, generalized spectral weights are provided.
(55) More general spectral weights are obtained when rewriting Equation (9) as:
(56)
with
Φ.sub.1(m,k)=ε{WX(m,k)(WX(m,k)).sup.H} (17)
Φ.sub.2(m,k)=ε{VX(m,k)(VX(m,k)).sup.H} (18)
where superscript .sup.H denotes the conjugate transpose of a matrix or a vector, and W and V are mixing matrices or mixing (row) vectors.
(57) Here, Φ.sub.1(m, k) may be considered as signal information and Φ.sub.2(m, k) may be considered as downmix information.
(58) For example, Φ.sub.2=Φ.sub.d when V is a vector of length N whose elements are equal to one. Equation (16) is equal to (9) when V is a row vector of length N whose elements are equal to one and W is the identity matrix of size N×N.
(59) The generalized SDR R.sub.g(m, k, β, W, V) covers, for example, the ratio of the PSD of the side signal and of the PSD of the downmix signal, for W=[1, −1], V=[1, 1], and N=2:
(60)
where Φ.sub.s(m, k) is the PSD of the side signal.
(61) According to an embodiment, the information generator 110 is adapted to generate signal information Φ.sub.1(m, k) by combining a spectral value X.sub.i(m, k) of each of the two or more audio input channels in a first way. Moreover, the information generator 110 is adapted to generate downmix information Φ.sub.2(m, k) by combining the spectral value X.sub.i(m, k) of each of the two or more audio input channels in a second way being different from the first way.
(62) In the following, a more general case of mixing models featuring time-of-arrival stereophony is described.
(63) The derivation of the spectral weights described above relies on the assumption that L.sub.i,l=1, ∀i, l, i.e., the direct sound sources are time-aligned between the input channels. When the mixing of the direct source signals is not restricted to amplitude difference stereophony (L.sub.i,l>1), for example when recording with spaced microphones, the downmix of the input signal X.sub.d(m, k) is subject to phase cancellation. Phase cancellation in X.sub.d(m, k) leads to increasing SDR values and consequently to the typical comb-filtering artifacts when applying the spectral weighting as described above.
(64) The notches of the comb-filter correspond to the frequencies:
(65)
for gain functions (12) and (13) and
(66)
for gain functions (14) and (15), where f.sub.s is the sampling frequency, o are odd integers, e are even integers, and d is the delay in samples.
(67) A first approach to solve this problem is to compensate the phase differences resulting from the ICTD prior to the computation of X.sub.d(m, k). Phase difference compensation (PDC) is achieved by estimating the time-variant inter-channel phase transfer function {circumflex over (P)}.sub.i(m, k){circumflex over (P)}.sub.i(m, k)ε[−π π] between the i-th channel and a reference channel denoted by index r:
{circumflex over (P)}.sub.i(m,k)=argX.sub.r(m,k)−argX.sub.i(m,k), iε[1, . . . ,N]\r (20)
where the operator A\B denotes set-theoretic difference of set B and set A, and applying a time-variant allpass compensation filter H.sub.C,i(m, k) to the i-th channel signal:
{tilde over (X)}.sub.i(m,k)=H.sub.C,i(m,k)X.sub.i(m,k). (21)
where the phase transfer function of H.sub.C,i(m, k) is:
argH.sub.C,i(m,k)=−ε{{circumflex over (P)}.sub.i(m,k)}. (22)
(68) The expectation value is estimated using single-pole recursive averaging. It should be noted that phase jumps of 2π occurring at frequencies close to the notch frequencies need to be compensated for prior to the recursive averaging.
(69) The downmix signal is computed according to:
(70)
such that the PDC is only applied for computing X.sub.d and does not affect the phase of the output signal.
(71)
(72) The system comprises a phase compensator 210 for generating a phase-compensated audio signal comprising two or more phase-compensated audio channels from an unprocessed audio signal comprising two or more unprocessed audio channels.
(73) Furthermore, the system comprises an apparatus 220 according to one of the above-described embodiments for receiving the phase compensated audio signal as an audio input signal and for generating a modified audio signal comprising two or more modified audio channels from the audio input signal comprising the two or more phase-compensated audio channels as two or more audio input channels.
(74) One of the two or more unprocessed audio channels is a reference channel. The phase compensator 210 is adapted to estimate for each unprocessed audio channel of the two or more unprocessed audio channels which is not the reference channel a phase transfer function between said unprocessed audio channel and the reference channel. Moreover, the phase compensator 210 is adapted to generate the phase-compensated audio signal by modifying each unprocessed audio channel of the unprocessed audio channels which is not the reference channel depending on the phase transfer function of said unprocessed audio channel.
(75) In the following, intuitive explanations of the control parameters are provided, e.g., a semantic meaning of control parameters.
(76) For the operation of digital audio effects it is advantageous to provide controls with semantically meaningful parameters. The gain functions (12)-(15) are controlled by the parameters α, β and γ. Sound engineers and audio engineers are used to time constants, and specifying α as time constant is intuitive and according to common practice. The effect of the integration time can be experienced best by experimentation. In order to support the operation of the provided concepts, descriptors for the remaining parameters are proposed, namely impact for γ and diffuseness for β.
(77) The parameter impact can be best compared with the order of a filter. By analogy to the roll-off in filtering, the maximum attenuation equals γ 6 dB, for N=2.
(78) The label diffuseness is proposed here to emphasize the fact that then attenuating panned and diffuse sounds, larger values of β result in more leakage of diffuse sounds. A nonlinear mapping of the user parameter β.sub.u, e.g., β=√{square root over (β.sub.u+1)}, with 0≦β.sub.u≦10, is advantageous in a way that it enables a more consistent behavior of the processing as opposed to when modifying β directly (where consistency relates to the effect of a change of the parameter on the result throughout the range of the parameter value).
(79) In the following, computational complexity and memory requirements are briefly discussed.
(80) The computational complexity and memory requirements scale with the number of bands of the filterbank and depend on the implementation of additional post-processing of the spectral weights. A low-cost implementation of the method can be achieved when setting β=1, γε, computing spectral weights according to Equation (12) or (14), and when not applying the PDC filter. The computation of the SDR uses only one cost intensive nonlinear functions per sub-band when βε
. For β=1, only two buffers for the PSD estimation are necessitated, whereas methods making explicit use of the ICC, e.g., as described in C. Avendano and J.-M. Jot, “A frequency-domain approach to multi-channel upmix,” J. Audio Eng. Soc., vol. 52, 2004; D. Jang, J. Hong, H. Jung, and K. Kang, “Center channel separation based on spatial analysis,” in Proc. Int. Conf. Digital Audio Effects (DAFx), 2008; U.S. Pat. No. 7,630,500 B1, issued to P. E. Beckmann, 2009; U.S. Pat. No. 7,894,611 B2, issued to P. E. Beckmann, 2011; and J. Merimaa, M. Goodwin, and J.-M. Jot, “Correlation-based ambience extraction from stereo recordings,” in Proc. Audio Eng. Soc. 123rd Cony., 2007, necessitate at least three buffers.
(81) In the following, the performance of the presented concepts by means of examples is discussed.
(82) First, the processing is applied to an amplitude-panned mixture of 5 instrument recordings (drums, bass, keys, 2 guitars) sampled at 44100 Hz of which an excerpt of 3 seconds length is visualized. Drums, bass, and keys are panned to the center, one guitar is panned to the left channel and the second guitar is panned to the right channel, both with |ICLD|=20 dB. A convolution reverb having stereo impulse responses with an RT60 of about 1.4 seconds per input channel is used to generate ambient signal components. The reverberated signal is added with a direct-to-ambient ratio of about 8 dB after K-weighting as described in International Telecommunication Union, Radiocommunication Assembly, “Algorithms to measure audio programme loudness and true-peak audio level,” Recommendation ITUR BS.1770-2, March 2011, Geneva, Switzerland.
(83)
(84) In particular,
(85)
(86) The time constant for the recursive averaging in the PSD estimation here and in the following is set to 200 ms.
(87)
(88)
(89)
(90) Informal listening over headphones reveals that the attenuation of the signal components is effective. When listening to the extracted center signal, processing artifacts become audible as slight modulations during the notes of guitar 2, similar to pumping in dynamic range compression. It can be noted that the reverberation is reduced and that the attenuation is more effective for low frequencies than for high frequencies. Whether this is caused by the larger direct-to-ambient ratio in the lower frequencies, the frequency content of the sound sources or subjective perception due to unmasking phenomena cannot be answered without a more detailed analysis.
(91) When listening to the output signal where the center is attenuated, the overall sound quality is slightly better when compared to the center extraction result. Processing artifacts are audible as slight movements of the panned sources towards the center when dominant centered sources are active, equivalently to the pumping when extracting the center. The output signal sounds less direct as the result of the increased amount of ambience in the output signal.
(92) To illustrate the PDC filtering,
(93) The two-channel mixture signal is generated by mixing the speech source signals with equal gains to each channel and by adding white noise with an SNR of 10 dB (K-weighted) to the signal.
(94)
(95) The spectral weights in the upper plot are close to 0 dB when speech is active and assume the minimum value in time-frequency regions with low SNR. The second plot shows the spectral weights for an input signal where the first speech signal (
(96) Informal listening shows that the additive noise is largely attenuated. When processing signals without ICTD, the output signals have a bit of an ambient sound characteristic which results presumably from the phase incoherence introduced by the additive noise. When processing signals with ICTD, the first speech signal (
(97) Concepts for scaling the center signal in audio recordings by applying real-valued spectral weights which are computed from monotonic functions of the SDR have been provided. The rationale is that center signal scaling needs to take into account both, the lateral displacement of direct sources and the amount of diffuseness, and that these characteristics are implicitly captured by the SDR. The processing can be controlled by semantically meaningful user parameters and is in comparison to other frequency domain techniques of low computational complexity and memory load. The proposed concepts give good results when processing input signals featuring amplitude difference stereophony, but can be subject to comb-filtering artifacts when the direct sound sources are not time-aligned between the input channels. A first approach to solve this is to compensate for non-zero phase in the inter-channel transfer function.
(98) So far, the concepts of embodiments have been tested by means of informal listening. For typical commercial recordings, the results are of good sound quality but also depend on the desired separation strength.
(99) Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
(100) The inventive decomposed signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
(101) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
(102) Some embodiments according to the invention comprise a non-transitory data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(103) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
(104) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(105) In many embodiments, parts of the systems and apparatuses are provided in devices including microprocessors. Various embodiments of systems, apparatuses, and methods described herein may be implemented fully or partially in software and/or firmware. This software and/or firmware may take the form of instructions contained in or on a non-transitory computer-readable storage medium. Those instructions then may be read and executed by one or more processors to enable performance of the operations described herein. The instructions may be in any suitable form such as, but not limited to, source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. Such a computer-readable medium may include any tangible non-transitory medium for storing information in a form readable by one or more computers such as, but not limited to, read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; a flash memory, etc.
(106) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(107) A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
(108) A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
(109) A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
(110) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(111) In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
(112) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which will be apparent to others skilled in the art and which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is, therefore, intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.