DIRECTIONAL LOUDNESS MAP BASED AUDIO PROCESSING
20210383820 · 2021-12-09
Inventors
Cpc classification
G10L19/02
PHYSICS
H04R1/26
ELECTRICITY
G10L25/18
PHYSICS
G10L19/008
PHYSICS
G10L19/173
PHYSICS
International classification
G10L25/18
PHYSICS
G10L19/02
PHYSICS
H04R1/26
ELECTRICITY
Abstract
An audio analyzer configured to obtain spectral domain representations of two or more input audio signals. Additionally the audio analyzer is configured to obtain directional information associated with spectral bands of the spectral domain representations and to obtain loudness information associated with different directions as an analysis result. Contributions to the loudness information are determined in dependence on the directional information.
Claims
1.-94. (canceled)
95. An audio analyzer, wherein the audio analyzer is configured to acquire spectral domain representations of two or more input audio signals; wherein the audio analyzer is configured to acquire directional information associated with spectral bands of the spectral domain representations; wherein the audio analyzer is configured to acquire loudness information associated with different directions as an analysis result, wherein contributions to the loudness information are determined in dependence on the directional information.
96. Audio analyzer according to claim 95, wherein the audio analyzer is configured to acquire a plurality of weighted spectral domain representations on the basis of the spectral domain representations of the two or more input audio signals; wherein values of the one or more spectral domain representations are weighted in dependence on the different directions of the audio components in the two or more input audio signals to acquire the plurality of weighted spectral domain representations; wherein the audio analyzer is configured to acquire loudness information associated with the different directions on the basis of the weighted spectral domain representations as the analysis result.
97. Audio analyzer according to claim 95, wherein the audio analyzer is configured to decompose the two or more input audio signals into a short-time Fourier transform domain to acquire two or more transformed audio signals.
98. Audio analyzer according to claim 97, wherein the audio analyzer is configured to group spectral bins of the two or more transformed audio signals to spectral bands of the two or more transformed audio signals; and wherein the audio analyzer is configured to weight the spectral bands using different weights, based on an outer-ear and middle-ear model, to acquire the one or more spectral domain representations of the two or more input audio signals.
99. Audio analyzer according to claim 95, wherein the audio analyzer is configured to determine a direction-dependent weighting per spectral bin and for a plurality of predetermined directions.
100. Audio analyzer according to claim 95, wherein the audio analyzer is configured to determine a direction-dependent weighting using a Gaussian function, such that the direction-dependent weighting decreases with increasing deviation between respective extracted direction values and respective predetermined direction values.
101. Audio analyzer according to claim 100, wherein the audio analyzer is configured to determine panning index values as the extracted direction values; and/or wherein the audio analyzer is configured to determine the extracted direction values in dependence on spectral domain values of the input audio signals.
102. Audio analyzer according to claim 99, wherein the audio analyzer is configured to acquire the direction-dependent weighting θ.sub.0,i(m, k) associated with a predetermined direction, a time designated with a time index m, and a spectral bin designated by a spectral bin index k according to
103. Audio analyzer according to claim 95, wherein the audio analyzer is configured to acquire the weighted spectral domain representations Y.sub.i,b,Ψ.sub.
Y.sub.i,b,Ψ.sub.
104. Audio analyzer according to claim 95, wherein the audio analyzer is configured to determine an average over a plurality of band loudness values, in order to acquire a combined loudness value; and/or wherein the audio analyzer is configured to acquire band loudness values for a plurality of spectral bands on the basis of a weighted combined spectral domain representation representing a plurality of input audio signals; and wherein the audio analyzer is configured to acquire, as the analysis result, a plurality of combined loudness values on the basis of the acquired band loudness values for a plurality of different directions.
105. Audio analyzer according to claim 104, wherein the audio analyzer is configured to compute a mean of squared spectral values of the weighted combined spectral domain representation over spectral values of a frequency band, and to apply an exponentiation comprising an exponent between 0 and ½ to the mean of squared spectral values, in order to determine the band loudness values; and/or wherein the audio analyzer is configured to acquire the band loudness values L.sub.b,Ψ.sub.
106. Audio analyzer according to claim 95, wherein the audio analyzer is configured to acquire a plurality of combined loudness values L(m, Ψ.sub.0,j) associated with a direction designated with index Ψ.sub.0,j and a time designated with a time index m according to
107. The audio analyzer according to claim 95, wherein the audio analyzer is configured to allocate loudness contributions to histogram bins associated with different directions in dependence on the directional information, in order to acquire the analysis result; and/or wherein the audio analyzer is configured to acquire loudness information associated with spectral bins on the basis of the spectral domain representations, and wherein the audio analyzer is configured to add a loudness contribution to one or more histogram bins on the basis of a loudness information associated with a given spectral bin; wherein a selection, to which one or more histogram bins the loudness contribution is made, is based on a determination of the directional information for a given spectral bin; and/or wherein the audio analyzer is configured to add loudness contributions to a plurality of histogram bins on the basis of a loudness information associated with a given spectral bin, such that a largest contribution is added to a histogram bin associated with a direction that corresponds to the directional information associated with the given spectral bin, and such that reduced contributions are added to one or more histogram bins associated with further directions.
108. The audio analyzer according to claim 95, wherein the audio analyzer is configured to acquire directional information on the basis of an analysis of an amplitude panning of audio content; and/or wherein the audio analyzer is configured to acquire directional information on the basis of an analysis of a phase relationship and/or a time delay and/or correlation between audio contents of two or more input audio signals; and/or wherein the audio analyzer is configured to acquire directional information on the basis of an identification of widened sources, and/or wherein the audio analyzer is configured to acquire directional information using a matching of spectral information of an incoming sound and templates associated with head related transfer functions in different directions.
109. An audio similarity evaluator, wherein the audio similarity evaluator is configured to acquire a first loudness information associated with different directions on the basis of a first set of two or more input audio signals, and wherein the audio similarity evaluator is configured to compare the first loudness information with a second loudness information associated with the different panning directions and with a set of two or more reference audio signals, in order to acquire a similarity information describing a similarity between the first set of two or more input audio signals and the set of two or more reference audio signals.
110. An audio similarity evaluator according to claim 109, wherein the audio similarity evaluator is configured to acquire the first loudness information such that the first loudness information comprises a plurality of combined loudness values associated with the first set of two or more input audio signals and associated with respective predetermined directions, wherein the combined loudness values of the first loudness information describe loudness of signal components of the first set of two or more input audio signals associated with the respective predetermined directions; and/or wherein the audio similarity evaluator is configured to acquire the first loudness information such that the first loudness information is associated with combinations of a plurality of weighted spectral domain representations of the first set of two or more input audio signals associated with respective predetermined directions.
111. An audio similarity evaluator according to claim 109, wherein the audio similarity evaluator is configured to determine a difference between the second loudness information and the first loudness information to acquire a residual loudness information; and wherein the audio similarity evaluator is configured to determine a value that quantifies the difference over a plurality of directions.
112. An audio similarity evaluator according to claim 109, wherein the audio similarity evaluator is configured to acquire the first loudness information and/or the second loudness information using an audio analyzer according to claim 95.
113. An audio encoder for encoding an input audio content comprising one or more input audio signals, wherein the audio encoder is configured to provide one or more encoded audio signals on the basis of one or more input audio signals, or one or more signals derived therefrom; wherein the audio encoder is configured to adapt encoding parameters in dependence on one or more directional loudness maps which represent loudness information associated with a plurality of different directions of the one or more signals to be encoded.
114. Audio encoder according to claim 113, wherein the audio encoder is configured to adapt a bit distribution between the one or more signals and/or parameters to be encoded in dependence on contributions of individual directional loudness maps of the one or more signals and/or parameters to be encoded to an overall directional loudness map; and/or wherein the audio encoder is configured to disable encoding of a given one of the signals to be encoded, when contributions of an individual directional loudness map of the given one of the signals to be encoded to an overall directional loudness map is below a threshold; and/or wherein the audio encoder is configured to adapt a quantization precision of the one or more signals to be encoded in dependence on contributions of individual directional loudness maps of the one or more signals to be encoded to an overall directional loudness map.
115. Audio encoder according to claim 113, wherein the audio encoder is configured to quantize spectral domain representations of the one or more input audio signals, or of the one or more signals derived therefrom using one or more quantization parameters, to acquire one or more quantized spectral domain representations; wherein the audio encoder is configured to adjust the one or more quantization parameters in dependence on one or more directional loudness maps which represent loudness information associated with a plurality of different directions of the one or more signals to be quantized, to adapt the provision of the one or more encoded audio signals; and wherein the audio encoder is configured to encode the one or more quantized spectral domain representations, in order to acquire the one or more encoded audio signals.
116. The audio encoder according to claim 115, wherein the audio encoder is configured to adjust the one or more quantization parameters in dependence on contributions of individual directional loudness maps of the one or more signals to be quantized to an overall directional loudness map; and/or wherein the audio encoder is configured to determine an overall directional loudness map on the basis of the input audio signals, such that the overall directional loudness map represents loudness information associated with the different directions of an audio scene represented by the input audio signals; and/or wherein the one or more signals to be quantized are associated with different directions or are associated with different loudspeakers or are associated with different audio objects; and/or wherein the signals to be quantized comprise components of a joint multi-signal coding of two or more input audio signals; and/or wherein the audio encoder is configured to estimate a contribution of a residual signal of the joint multi-signal coding to the overall directional loudness map, and to adjust the one or more quantization parameters on dependence thereon.
117. The audio encoder according to claim 113, wherein the audio encoder is configured to adapt a bit distribution between the one or more signals and/or parameters to be encoded in dependence on an evaluation of a spatial masking between two or more signals to be encoded, wherein the audio encoder is configured to evaluate the spatial masking on the basis of the directional loudness maps associated with the two or more signals to be encoded.
118. The audio encoder according to claim 113, wherein the audio encoder comprises an audio analyzer according to claim 95, wherein the loudness information associated with different directions forms the directional loudness map.
119. The audio encoder according to claim 113, wherein the audio encoder is configured to adapt a noise introduced by the encoder in dependence on the one or more directional loudness maps; and wherein the audio encoder is configured to use a deviation between a directional loudness map, which is associated with a given un-encoded input audio signal, and a directional loudness map achievable by an encoded version of the given input audio signal, as a criterion for the adaptation of the provision of the given encoded audio signal.
120. The audio encoder according to claim 113, wherein the audio encoder is configured to activate and deactivate a joint coding tool in dependence on one or more directional loudness maps which represent loudness information associated with a plurality of different directions of the one or more signals to be encoded; and/or wherein the audio encoder is configured to determine one or more parameters of a joint coding tool in dependence on one or more directional loudness maps which represent loudness information associated with a plurality of different directions of the one or more signals to be encoded.
121. An audio encoder for encoding an input audio content comprising one or more input audio signals, wherein the audio encoder is configured to provide one or more encoded audio signals on the basis of two or more input audio signals, or on the basis of two or more signals derived therefrom, using a joint encoding of two or more signals to be encoded jointly; wherein the audio encoder is configured to select signals to be encoded jointly out of a plurality of candidate signals or out of a plurality of pairs of candidate signals in dependence on directional loudness maps which represent loudness information associated with a plurality of different directions of the candidate signals or of the pairs of candidate signals.
122. The audio encoder according to claim 121, wherein the audio encoder is configured to select signals to be encoded jointly out of a plurality of candidate signals or out of a plurality of pairs of candidate signals in dependence on contributions of individual directional loudness maps of the candidate signals to an overall directional loudness map or in dependence on contributions of directional loudness maps of the pairs of candidate signals to an overall directional loudness map; and/or wherein the audio encoder is configured to determine a contribution of pairs of candidate signals to the overall directional loudness map; and wherein the audio encoder is configured to choose one or more pairs of candidate signals comprising a highest contribution to the overall directional loudness map for a joint encoding, or wherein the audio encoder is configured to choose one or more pairs of candidate signals comprising a contribution to the overall directional loudness map which is larger than a predetermined threshold for a joint encoding; and/or wherein the audio encoder is configured to determine individual directional loudness maps of two or more candidate signals, and wherein the audio encoder is configured to compare the individual directional loudness maps of the two or more candidate signals, and wherein the audio encoder is configured to select two or more of the candidate signals for a joint encoding in dependence on a result of the comparison; and/or wherein the audio encoder is configured to determine an overall directional loudness map using a downmixing of the input audio signals or using a binauralization of the input audio signals.
123. An audio encoder for encoding an input audio content comprising one or more input audio signals, wherein the audio encoder is configured to provide one or more encoded audio signals on the basis of two or more input audio signals, or on the basis of two or more signals derived therefrom; wherein the audio encoder is configured to determine an overall directional loudness map on the basis of the input audio signals, and/or to determine one or more individual directional loudness maps associated with individual input audio signals; and wherein the audio encoder is configured to encode the overall directional loudness map and/or one or more individual directional loudness maps as a side information.
124. The audio encoder according to claim 123, wherein the audio encoder is configured to determine the overall directional loudness map on the basis of the input audio signals such that the overall directional loudness map represents loudness information associated with the different directions of an audio scene represented by the input audio signals; and/or wherein the audio encoder is configured to encode the overall directional loudness map in the form of a set of values associated with different directions; or wherein the audio encoder is configured to encode the overall directional loudness map using a center position value and a slope information; or wherein the audio encoder is configured to encode the overall directional loudness map in the form of a polynomial representation; or wherein the audio encoder is configured to encode the overall directional loudness map in the form of a spline representation; and/or wherein the audio encoder is configured to encode one downmix signal acquired on the basis of a plurality of input audio signals and an overall directional loudness map; or wherein the audio encoder is configured to encode a plurality of signals, and to encode individual directional loudness maps of a plurality of signals which are encoded; or wherein the audio encoder is configured to encode an overall directional loudness map, a plurality of signals and parameters describing contributions of the signals which are encoded to the overall directional loudness map.
125. An audio decoder for decoding an encoded audio content, wherein the audio decoder is configured to receive an encoded representation of one or more audio signals and to provide a decoded representation of the one or more audio signals; wherein the audio decoder is configured to receive an encoded directional loudness map information and to decode the encoded directional loudness map information, to acquire one or more directional loudness maps; and wherein the audio decoder is configured to reconstruct an audio scene using the decoded representation of the one or more audio signals and using the one or more directional loudness maps.
126. The audio decoder according to claim 125, wherein the audio decoder is configured to acquire output signals such that one or more directional loudness maps associated with the output signals approximate or equal one or more target directional loudness maps, wherein the one or more target directional loudness maps are based on the one or more decoded directional loudness maps or are equal to the one or more decoded directional loudness maps.
127. The audio decoder according to claim 125, wherein the audio decoder is configured to receive one encoded downmix signal and an overall directional loudness map; or a plurality of encoded audio signals, and individual directional loudness maps of the plurality of encoded signals; or an overall directional loudness map, a plurality of encoded audio signals and parameters describing contributions of the encoded audio signals to the overall directional loudness map; and wherein the audio decoder is configured to provide the output signals on the basis thereof.
128. A format converter for converting a format of an audio content, which represents an audio scene, from a first format to a second format, wherein the format converter is configured provide a representation of the audio content in the second format on the basis of the representation of the audio content in the first format; wherein the format converter is configured to adjust a complexity of the format conversion in dependence on contributions of input audio signals of the first format to an overall directional loudness map of the audio scene.
129. The format converter according to claim 128, wherein the format converter is configured to compute or estimate a contribution of a given input audio signal to the overall directional loudness map of the audio scene; and wherein the format converter is configured to decide whether to consider the given input audio signal in the format conversion in dependence on a computation or estimation of the contribution
130. An audio decoder for decoding an encoded audio content, wherein the audio decoder is configured to receive an encoded representation of one or more audio signals and to provide a decoded representation of the one or more audio signals; wherein the audio decoder is configured to reconstruct an audio scene using the decoded representation of the one or more audio signals; wherein the audio decoder is configured to adjust a decoding complexity in dependence on contributions of encoded signals to an overall directional loudness map of a decoded audio scene.
131. The audio decoder according to claim 130, wherein the audio decoder is configured to receive an encoded directional loudness map information and to decode the encoded directional loudness map information, to acquire the overall directional loudness map and/or one or more directional loudness maps.
132. The audio decoder according to claim 131, Wherein the audio decoder is configured to derive the overall directional loudness map from the one or more directional loudness maps.
133. The audio decoder according to claim 130, Wherein the audio decoder is configured to compute or estimate a contribution of a given encoded signal to the overall directional loudness map of the decoded audio scene; and Wherein the audio decoder is configured to decide whether to decode the given encoded signal in dependence on a computation or estimation of the contribution.
134. A renderer for rendering an audio content, wherein the renderer is configured to reconstruct an audio scene on the basis of one or more input audio signals; wherein the renderer is configured to adjust a rendering complexity in dependence on contributions of the input audio signals to an overall directional loudness map of a rendered audio scene.
135. The renderer according to claim 134, wherein the renderer is configured to compute or estimate a contribution of a given input audio signal to the overall directional loudness map of the audio scene; and wherein the renderer is configured to decide whether to consider the given input audio signal in the rendering in dependence on a computation or estimation of the contribution.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0129] Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
[0130]
[0131]
[0132]
[0133]
[0134]
[0135]
[0136]
[0137]
[0138]
[0139]
[0140]
[0141]
[0142]
[0143]
[0144]
[0145]
[0146]
[0147]
[0148]
[0149]
[0150]
[0151]
[0152]
[0153]
[0154]
[0155]
[0156]
[0157]
[0158]
[0159]
[0160]
[0161]
[0162]
[0163]
DETAILED DESCRIPTION OF THE INVENTION
[0164] Equal or equivalent elements are elements with equal or equivalent functionality. They are denoted in the following description by equal or equivalent reference numerals even if occurring in different figures.
[0165] In the following description, a plurality of details is set forth to provide a more throughout the explanation of embodiments of the present invention. However, it will be apparent to those skilled in the art that embodiments of the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form rather than in detail in order to avoid obscuring embodiments of the present invention. In addition, features of the different embodiments described hereinafter may be combined with each other, unless specifically noted otherwise.
[0166]
[0167] According to an embodiment, the spectral-domain representations 110.sub.1, 110.sub.2 are fed into a directional information determination 120 to obtain directional information 122, e.g., Ψ(m, k), associated with spectral bands (e.g., spectral bins k in a time frame m) of the spectral-domain representations 110.sub.1, 110.sub.2. The direction information 122 represents, for example, different directions of audio components contained in the two or more input audio signals. Thus, the directional information 122 can be associated with a direction from which a listener will hear a component contained in the two input audio signals. According to an embodiment the direction information can represent panning indices. Thus, for example, the directional information 122 comprises a first direction indicating a singer in a listening room and further directions corresponding to different music instruments of a band in an audio scene. The directional information 122 is, for example, determined by the audio analyzer 100 by analyzing level ratios between the spectral-domain representations 110.sub.1, 110.sub.2 for all frequency bins or frequency groups (e.g., for all spectral bins k or spectral bands b). Examples for the directional information determination 120 are described with respect to
[0168] According to an embodiment the audio analyzer 100 is configured to obtain the directional information 122 on the basis of an analysis of an amplitude panning of audio content; and/or on the basis of an analysis of a phase relationship and/or a time delay and/or correlation between audio contents of two or more input audio signals; and/or on the basis of an identification of widened (e.g. decorrelated and/or panned) sources. The audio content can comprise the input audio signals and/or the spectral-domain representations 110 of the input audio signals.
[0169] Based on the directional information 122 and the spectral-domain representations 110.sub.1, 110.sub.2 the audio analyzer 100 is configured to determine contributions 132 (e.g., Y.sub.L,b,Ψ.sub.
[0170] According to an embodiment, the extracted direction values Ψ(m, k) are determined in dependence on spectral domain values (e.g., X.sub.L,b(m.sub.0, k.sub.0) as X.sub.1(m, k) and X.sub.R,b(m.sub.0, k.sub.0) as X.sub.2 (m, k) in the notation of [13]] of the input audio signals.
[0171] To obtain the loudness information 142 (e.g. L(m, Ψ.sub.0,j) for a plurality of different evaluated direction ranges Ψ.sub.0,j (jϵ[1;J] for J predetermined directions)) associated with the different directions Ψ.sub.0,j (e.g., predetermined directions) as an analysis result by the audio analyzer 100, the audio analyzer 100 is configured to combine the contributions 132.sub.1 (e.g., Y.sub.L,b,Ψ.sub.
[0172]
[0173] According to an embodiment, the first input audio signal 112.sub.1 and/or the second input audio signal 1122 can represent a time-domain signal which can be converted by a time-domain to spectral-domain conversion 114 to receive a spectral-domain representation 110 of the respective input audio signal. In other words, the time-domain to spectral-domain conversion 114 can decompose the two or more input audio signals 112.sub.1, 112.sub.2 (e.g., x.sub.L, x.sub.R, x.sub.i) into a short-time Fourier transform (SIFT) domain to obtain two or more transformed audio signals 115.sub.1, 115.sub.2 (e.g., X′.sub.L, X′.sub.L, X′.sub.i). If the first input audio signal 112.sub.1 and/or the second input audio signal 112.sub.2 represent a spectral-domain representation 110, the time-domain to spectral-domain conversion 114 can be skipped.
[0174] Optionally the input audio signals 112 or the transformed audio signals 115 are processed by an ear model processing 116 to obtain the spectral-domain representations 110 of the respective input audio signal 112.sub.1 and 112.sub.2. Spectral bins of the signal to be processed, e.g., 112 or 115, are grouped to spectral bands, e.g., based on a model for a perception of spectral bands by a human ear and then the spectral bands can be weighted, based on an outer-ear and/or middle-ear model. Thus, with the ear model processing 116 an optimized spectral-domain representation 110 of the input audio signals 112 can be determined.
[0175] According to an embodiment, the spectral-domain representation 110.sub.1 of the first input audio signal 112.sub.1, e.g., X.sub.L,b(m, k), is associated with level information of the first input audio signal 112.sub.1 (e.g., indicated by the index L) and different spectral bands (e.g., indicated by the index b). Per spectral band b the spectral-domain representation 110.sub.1 represents, for example, a level information for time frames m and for all spectral bins k of the respective spectral band b.
[0176] According to an embodiment, the spectral-domain representation 110.sub.2 of the second input audio signal 112.sub.2, e.g., X.sub.R,b(m, k), is associated with level information of the second input audio signal 112.sub.2 (e.g., indicated by the index R) and different spectral bands (e.g., indicated by the index b). Per spectral band b the spectral-domain representation 110.sub.2 represents, for example, a level information for time frames m and for all spectral bins k of the respective spectral band b.
[0177] Based on the spectral-domain representation 110.sub.1 of the first input audio signal 112 and the spectral-domain representation 110.sub.2 of the second input audio signal a direction information determination 120 can be performed by the audio analyzer 100. With a direction analysis 124 a panning direction information 125, e.g., Ψ(m, k), can be determined. The panning direction information 125 represents, for example, panning indices corresponding to signal components (e.g., signal components of the first input audio signal 112.sub.1 and the second input audio signal 112.sub.2 panned to a certain direction). According to an embodiment, the input audio signals 112 are associated with different directions indicated, for example, by the index L for left and by the index R for right. A panning index defines, for example, a direction between two or more input audio signals 112 or a direction at the direction of an input audio signal 112. Thus, for example, in a case of two-channel signal as shown in
[0178] According to an embodiment, based on the panning direction information 125 the audio analyzer 100 is configured to perform a scaling factor determination 126 to determine a direction-dependent weighting 127, e.g., Θ.sub.Ψ.sub.
[0179] According to an embodiment, the direction-dependent weighting 127 uses a Gaussian function, such that the direction-dependent weighting decreases with an increasing deviation between respective extracted direction values Ψ(m, k) and the respective predetermined direction values Ψ.sub.0,j.
[0180] According to an embodiment, the audio analyzer 100 is configured to obtain the direction-dependent weighting 127 Θ.sub.Ψ.sub.
wherein ξ is a predetermined value (which controls, for example, a width of a Gaussian window); wherein Ψ(m, k) designates the extracted direction values associated with a time (or time frame) designated with a time index m, and a spectral bin designated by a spectral bin index k; and wherein Ψ.sub.0,j is a (e.g., predetermined) direction value which designates (or is associated with) a predetermined direction (e.g. having direction index j).
[0181] According to an embodiment, the audio analyzer 100 is configured to determine by using the directional information determination 120 a directional information comprising the panning direction information 125 and/or the direction-dependent weighting 127. This direction information is, for example, obtained on the basis of an audio content of the two or more input audio signals 112.
[0182] According to an embodiment, the audio analyzer 100 comprises a scaler 134 and/or a combiner 136 for a contributions determination 130. With the scaler 134 the direction-dependent weighting 127 is applied to the one or more spectral-domain representations 110 of the two or more input audio signals 112, in order to obtain weighted spectral-domain representations 135 (e.g., Y.sub.i,b,Ψ.sub.
[0183] According to an embodiment, the scaling factor determination 126 is configured to determine the direction-dependent weighting 127 such that per predetermined direction signal components, whose extracted direction values Ψ(m, k) deviate from the predetermined direction Ψ.sub.0,j, are weighted such that they have less influence than signal components, whose extracted direction values Ψ(m, k) equals the predetermined direction Ψ.sub.0,j. In other words, at the direction-dependent weighting 127 for a first predetermined direction Ψ.sub.0,1, signal components associated with the first predetermined direction Ψ.sub.0,1 are emphasized over signal components associated with other directions in a first weighted spectral-domain representation Y.sub.L,b,Ψ.sub.
[0184] According to an embodiment, the audio analyzer 100 is configured to obtain the weighted spectral-domain representations 135 Y.sub.i,b,Ψ.sub.
[0185] Additional or alternative functionalities of the scaler 134 are described with regard to
[0186] According to an embodiment, the weighted spectral-domain representations 135.sub.1 of the first input audio signal and the weighted spectral-domain representations 135.sub.2 of the second input audio signal are combined by the combiner 136 to obtain a weighted combined spectral-domain representation 137 Y.sub.DM,b,Ψ.sub.
[0187] Based on the weighted combined spectral-domain representation 137 a loudness information determination 140 is performed to obtain as analysis result a loudness information 142. According to an embodiment, the loudness information determination 140 comprises a loudness determination in bands 144 and a loudness determination over all bands 146. According to an embodiment, the loudness determination in bands 144 is configured to determine for each spectral band b on the basis of the weighted combined spectral-domain representations 137 band loudness values 145. In other words, the loudness determination in bands 144 determines a loudness at each spectral band in dependence on the predetermined directions Ψ.sub.0,j. Thus, the obtained band loudness values 145 do no longer depend on single spectral bins k.
[0188] According to an embodiment, the audio analyzer is configured to compute a mean of squared spectral values of the weighted combined spectral-domain representations 137 (e.g., Y.sub.DM,b,Ψ.sub.
[0189] According to an embodiment, the audio analyzer is configured to obtain the band loudness values 145 L.sub.b,Ψ.sub.
wherein K.sub.b designates a number of spectral bins in a frequency band having frequency band index b; wherein k is a running variable and designates spectral bins in the frequency band having frequency band index b; wherein b designates a spectral band; and wherein Y.sub.DM,b,Ψ.sub.
[0190] At the loudness information determination over all bands 146 the band loudness values 145 are, for example, averaged over all spectral bands to provide the loudness information 142 dependent on the predetermined direction and at least one time frame m. According to an embodiment, the loudness information 142 can represent a general loudness caused by the input audio signals 112 in different directions in a listening room. According to an embodiment, the loudness information 142 can be associated with combined loudness values associated with different given or predetermined directions Ψ.sub.0,j.
[0191] Audio analyzer according to one of the claims 1 to 17, wherein the audio analyzer is configured to obtain a plurality of combined loudness values L(m, Ψ.sub.0,j) associated with a direction designated with index Ψ.sub.0,j and a time (or time frame) designated with a time index m according to
wherein B designates a total number of spectral bands b and wherein L.sub.b,Ψ.sub.
[0192] In
[0193]
[0194]
[0195] The audio analyzer 100 shown in
[0196] Based on the time/frequency signals a directional information determination 120 is performed. The directional information determination 120 comprises, for example, a directional analysis 124 and a determination of window functions 126. At a contributions determination unit 130 directional signals 132 are obtained by, for example, dividing the time/frequency signals 110 into directional signals by applying directional-dependent window functions 127 to the time/frequency signals 110. Based on the directional signals 132 a loudness calculation 140 is performed to obtain the loudness information 142 as an analysis result. The loudness information 142 can comprise a directional loudness map.
[0197] The audio analyzer 100 in
[0198]
[0199] According to an embodiment, based on the time/frequency signals 110 a loudness calculation 140 is performed to obtain a combined loudness value 145 per time/frequency tile. The combined loudness value 145 is not associated with any directional information. The combined loudness value is, for example, associated with a loudness resulting from a superposition of the input signals 112 to a time/frequency tile.
[0200] Furthermore, the audio analyzer 100 is configured to perform a directional analysis 124 of the time/frequency signals 110 to obtain a directional information 122. According to
[0201] The audio analyzer 100 in
[0202] In
[0203] More details regarding the audio analyzer 100 in
[0204]
[0205] According to an embodiment, the loudness calculation 140 results in combined loudness values 145, e.g., per time/frequency tile. The combined loudness values 145 are, for example, associated with a combination of the first input audio signal and the second input audio signal (e.g., a combination of the two or more input audio signals).
[0206] Based on the directional information 122 and the combined loudness values 145 the combined loudness values 145 can be accumulated 146 into direction and time-dependent histogram bins. Thus, for example, all combined loudness values 145 associated with a certain direction are summed. According to the directional information 122 the directions are associated with time/frequency tiles. With the accumulation 146 a directional loudness histogram results, which can represent a loudness information 142 as an analysis result of a herein described audio analyzer.
[0207] It is also possible that time/frequency tiles corresponding to the same direction and/or neighboring directions in a different or neighboring time frame (e.g., in a previous or subsequent time frame) can be associated with the direction in the current time step or time frame. This means, for example, that the directional information 122 comprises direction information per frequency tile (or frequency bin) dependent on time. Thus, for example, the directional information 122 is obtained for multiple timeframes or for all time frames.
[0208] More details regarding the histogram approach shown in
[0209]
[0210]
[0211] According to an embodiment, a directional information 122 can comprise scaling factors associated with a direction 121 and time/frequency tiles 123 as shown in
[0212]
[0213] The first loudness information 142.sub.1 and the second loudness information 142.sub.2 can be determined by a loudness information determination 100, which can be performed by the audio similarity evaluator 200. According to an embodiment, the loudness information determination 100 can be performed by an audio analyzer. Thus, for example, the audio similarity evaluator 200 can comprise an audio analyzer or receive the first loudness information 142.sub.1 and/or the second loudness information 142.sub.2 from an external audio analyzer. According to an embodiment, the audio analyzer can comprise features and/or functionalities as described with regard to an audio analyzer in
[0214] According to an embodiment, the set of reference audio signals 112b can represent an ideal set of audio signals for an optimized audio perception by a listener in the listening space.
[0215] According to an embodiment, the first loudness information 142.sub.1 (for example, a vector comprising L.sub.1(m, Ψ.sub.0,1) to L.sub.1(m, Ψ.sub.0,j)) and/or the second loudness information 142.sub.2 (for example, a vector comprising L.sub.2(m, Ψ.sub.0,1) to L.sub.2(m, Ψ.sub.0,j)) can comprise a plurality of combined loudness values associated with the respective input audio signals (e.g., the input audio signals corresponding to the first set of input audio signals 112a or the reference audio signals corresponding to the set of reference audio signals 112b (and associated with respective predetermined directions)). The respective predetermined directions can represent panning indices. Since each input audio signal is, for example, associated with a loudspeaker, the respective predetermined directions can be understood as equally spaced positions between the respective loudspeakers (e.g., between neighboring loudspeakers and/or other pairs of loudspeakers). In other words, the audio similarity evaluator 200 is configured to obtain a direction component (e.g., a herein described first direction) used for obtaining the loudness information 142.sub.1 and/or 142.sub.2 with different directions (e.g., herein described second directions) using metadata representing position information of loudspeakers associated with the input audio signals. The combined loudness values of the first loudness information 142.sub.1 and/or of the second loudness information 142.sub.2 describe the loudness of signal components of the respective set of input audio signals 112a and 112b associated with the respective predetermined directions. The first loudness information 142.sub.1 and/or the second loudness information 142.sub.2 is associated with combinations of a plurality of weighted spectral-domain representations associated with the respective predetermined direction.
[0216] The audio similarity evaluator 200 is configured to compare the first loudness information 142.sub.1 with the second loudness information 142.sub.2 in order to obtain a similarity information 210 describing a similarity between the first set of two or more input audio signals 112a and the set of two or more reference audio signals 112b. This can be performed by a loudness information comparison unit 220. The similarity information 210 can indicate a quality of the first set of input audio signals 112a. To further improve the prediction of a perception of the first set of input audio signals 112a based on the similarity information 210, only a subset of frequency bands in the first loudness information 142.sub.1 and/or in the second loudness information 142.sub.2 can be considered. According to an embodiment, the first loudness information 142.sub.1 and/or the second loudness information 142.sub.2 is only determined for frequency bands with frequencies of 1.5 kHz and above. Thus, the compared loudness information 142.sub.1 and 142.sub.2 can be optimized based on the sensitivity of the human auditory system. Thus, the loudness information comparison unit 220 is configured to compare loudness information 142.sub.1 and 142.sub.2, which comprise only loudness values of relevant frequency bands. Relevant frequency bands can be associated with frequency bands corresponding to a (e.g., human ear) sensitivity higher than a predetermined threshold for predetermined level differences.
[0217] To obtain the similarity information 210, e.g., a difference between the second loudness information 142.sub.2 and the first loudness information 142.sub.1 is calculated.
[0218] This difference can represent a residual loudness information and can already define the similarity information 210. Alternatively, the residual loudness information is processed further to obtain the similarity information 210. According to an embodiment, the audio similarity evaluator 200 is configured to determine a value that quantifies the difference over a plurality of directions. This value can be a single scalar value representing the similarity information 210. To receive the scalar value the loudness information comparison unit 220 can be configured to calculate the difference for parts or a complete duration of the first set of input audio signals 112a and/or the set of reference audio signals 112b and then average the obtained residual loudness information over all panning directions (e.g., the different directions with which the first loudness information 142.sub.1 and/or the second loudness information 142.sub.2 is associated) and time producing a single numbered termed model output variable (MOV).
[0219]
[0220] According to an embodiment, in a next step audio components of the stereo signals 112a and 112b can be analyzed for their directional information. Different panning directions 125 can be predetermined and can be combined with a window width 128 to obtain a direction-dependent weighting 127.sub.1 to 127.sub.7. Based on the direction-dependent weighting 127 and the spectral-domain representation 110a and/or 110b of the respective stereo input signal 112a and/or 112b a panning index directional decomposition 130 can be performed to obtain contributions 132a and/or 132b. According to an embodiment, the contributions 132a and/or 132b are then, for example, processed by a loudness calculation 144 to obtain loudness 145a and/or 145b per frequency band and panning direction. According to an embodiment, an ERB-wise frequency averaging 146 (ERB=equivalent rectangular bandwidth) is performed on the loudness signals 145b and/or 145a to obtain directional loudness maps 142a and/or 142b for a loudness information comparison 220. The loudness information comparison 220 is, for example, configured to calculate a distance measure based on the two directional loudness maps 142a and 142b. The distance measure can represent a directional loudness map comprising differences between the two directional loudness maps 142a and 142b. According to an embodiment, a single numbered termed model output variable MOV can be obtained as the similarity information 210 by averaging the distance measure over all panning directions and time.
[0221]
[0222]
[0223] The audio encoder 300 is configured to adapt 340 encoding parameters in dependence on one or more directional loudness maps 142 (e.g., L.sub.i(m, Ψ.sub.0,j) for a plurality of different Ψ.sub.0), which represent loudness information associated with a plurality of different directions (e.g., predetermined directions or directions of the one or more signals 112 to be encoded). According to an embodiment, the encoding parameters comprise quantization parameters and/or other encoding parameters, like a bit distribution and/or parameters relating to a disabling/enabling of the encoding 310.
[0224] According to an embodiment, the audio encoder 300 is configured to perform a loudness information determination 100 to obtain the directional loudness map 142 based on the input audio signal 112, or based on the processed input audio signal 110. Thus, for example, the audio encoder 300 can comprise an audio analyzer 100 as described with regard to
[0225] According to an embodiment, the audio encoder 300 can receive only one input audio signal 112. In this case, the directional loudness map 142 comprises, for example, loudness values for only one direction. According to an embodiment, the directional loudness map 142 can comprise loudness values equaling zero for directions differing from a direction associated with the input audio signal 112. In the case of only one input audio signal 112 the audio encoder 300 can decide based on the directional loudness map 142 if the adaptation 340 of the encoding parameters should be performed. Thus, for example, the adaptation 340 of the encoding parameters can comprise a setting of the encoding parameters to standard encoding parameters for mono signals.
[0226] If the audio encoder 300 receives a stereo signal or a multi-channel signal as the input audio signal 112, the directional loudness map 142 can comprise loudness values for different directions (e.g., differing from zero). In case of a stereo input audio signal the audio encoder 300 obtains, for example, one directional loudness map 142 associated with the two input audio signals 112. In case of a multi-channel input audio signal 112 the audio encoder 300 obtains, for example, one or more directional loudness maps 142 based on the input audio signals 112. If a multi-channel signal 112 is encoded by the audio encoder 300, e.g., an overall directional loudness map 142, based on all channel signals and/or directional loudness maps, and/or one or more directional loudness maps 142, based on signal pairs of the multi-channel input audio signal 112, can be obtained by the loudness information determination 100. Thus, for example, the audio encoder 300 can be configured to perform the adaptation 340 of the encoding parameters in dependence on contributions of individual directional loudness maps 142, for example, of signal pairs, a mid-signal, a side-signal, a downmix signal, a difference signal and/or of groups of three or more signals, to an overall directional loudness map 142, for example, associated with multiple input audio signals, e.g., associated with all signals of the multi-channel input audio signal 112 or a processed multi-channel input audio signal 110.
[0227] The loudness information determination 100 as described with regard to
[0228]
[0229] According to an embodiment, the input audio content 112 can be directly encoded 310 or optionally processed 330 before. As already described above, the audio encoder 300 can be configured to determine a spectral-domain representation 110 of one or more input audio signals of the input audio content 112 by the processing 330. Alternatively, the processing 330 can comprise further processing steps to derive one or more signals of the input audio content 112, which can undergo a time-domain to spectral-domain conversion to receive the spectral-domain representations 110. According to an embodiment, the signals derived by the processing 330 can comprise, for example, a mid-signal or downmix signal and side-signal or difference signal.
[0230] According to an embodiment, the signals of the input audio content 112 or the spectral-domain representations 110 can undergo a quantization by the quantizer 312. The quantizer 312 uses, for example, one or more quantization parameters to obtain one or more quantized spectral-domain representations 313. This one or more quantized spectral-domain representations 313 can be encoded by the coding unit 314, in order to obtain the one or more encoded audio signals of the encoded audio content 320.
[0231] To optimize the encoding 310 by the audio encoder 300, the audio encoder 300 can be configured to adapt 342 quantization parameters. The quantization parameters, for example, comprise scale factors or parameters describing which quantization accuracies or quantization steps should be applied to which spectral bins of frequency bands of the one or more signals to be quantized. According to an embodiment, the quantization parameters describe, for example, an allocation of bits to different signals to be quantized and/or to different frequency bands. The adaptation 342 of the quantization parameters can be understood as an adaptation of a quantization precision and/or an adaptation of noise introduced by the encoder 300 and/or as an adaptation of a bit distribution between the one or more signals 112/110 and/or parameters to be encoded by the audio encoder 300. In other words, the audio encoder 300 is configured to adjust the one or more quantization parameters in order to adapt the bit distribution, to adapt the quantization precision, and/or to adapt the noise. Additionally the quantization parameters and/or the coding parameters can be encoded 310 by the audio encoder.
[0232] According to an embodiment, the adaptation 340 of encoding parameters, like the adaptation 342 of the quantization parameters and the adaptation 344 of the coding parameters, can be performed in dependence on the one or more directional loudness maps 142, which represents loudness information associated with the plurality of different directions, panning directions, of the one or more signals 112/110 to be quantized. To be more accurate, the adaptation 340 can be performed in dependence on contributions of individual directional loudness maps 142 of the one or more signals to be encoded to an overall directional loudness map 142. This can be performed as described with regard to
[0233] According to an embodiment, the audio encoder 300 is configured to determine the overall directional loudness map on the basis of the input audio signals 112, or the spectral-domain representations 110, such that the overall directional loudness map represents loudness information associated with different directions, for example, of audio components, of an audio scene represented by the input audio content 112. Alternatively, the overall directional loudness map can represent loudness information associated with different directions of an audio scene to be represented, for example, after a decoder-sided rendering. According to an embodiment, the different directions can be obtained by a loudness information determination 100 possibly in combination with knowledge or side information regarding positions of loudspeakers and/or knowledge or side information describing positions of audio objects. This knowledge or side information can be obtained based on the one or more signals 112/110 to be quantized, since these signals 112/110 are, for example, associated in a fixed, non-signal-dependent manner, with different directions or with different loudspeakers, or with different audio objects. A signal is, for example, associated with a certain channel, which can be interpreted as a direction of the different directions (e.g., of the herein described first directions). According to an embodiment, audio objects of the one or more signals are panned to different directions or rendered at different directions, which can be obtained by the loudness information determination 100 as an object rendering information. This knowledge or side information can be obtained by the loudness information determination 100 for groups of two or more input audio signals of the input audio content 112 or the spectral-domain representations 110.
[0234] According to an embodiment, the signals 112/110 to be quantized can comprise components, for example, a mid-signal and a side-signal of a mid-side stereo coding, of a joint multi-signal coding of two or more input audio signals 112. Thus, the audio encoder 300 is configured to estimate the aforementioned contributions of directional loudness maps 142 of one or more residual signals of the joint multi-signal coding to the overall directional loudness map 142, and to adjust the one or more encoding parameter 340 in dependence thereof.
[0235] According to an embodiment, the audio encoder 300 is configured to adapt the bit distribution between the one or more signals 112/110 and/or parameters to be encoded, and/or to adapt the quantization precision of the one or more signals 112/110 to be encoded, and/or to adapt the noise introduced by the encoder 300, individually for different spectral bins or individually for different frequency bands. This means, for example, that the adaptation 342 of the quantization parameters is performed such that the encoding 310 is improved for individual spectral bins or individual different frequency bands.
[0236] According to an embodiment, the audio encoder 300 is configured to adapt the bit distribution between the one or more signals 112/110 and/or the parameters to be encoded in dependence on an evaluation of a spatial masking between two or more signals to be encoded. The audio encoder is, for example, configured to evaluate the spatial masking on the basis of the directional loudness maps 142 associated with the two or more signals 112/110 to be encoded. Additionally or alternatively, the audio encoder is configured to evaluate the spatial masking or a masking effect of a loudness contribution associated with a first direction of a first signal to be encoded onto a loudness contribution associated with a second direction, which is different from the first direction, of a second signal to be encoded. According to an embodiment, the loudness contribution associated with the first direction can, for example, represent a loudness information of an audio object or audio component of the signals of the input audio content and the loudness contribution associated with the second direction can represent, for example, a loudness information associated with another audio object or audio component of the signals of the input audio content. Dependent on the loudness information of the loudness contribution associated with the first direction and the loudness contribution associated with the second direction, and depending on the distance between the first direction and the second direction, the masking effect or the spatial masking can be evaluated. According to an embodiment, the masking effect reduces with an increasing difference of the angles between the first direction and the second direction. Similarly a temporal masking can be evaluated.
[0237] According to an embodiment, the adaptation 342 of the quantization parameters can be performed by the audio encoder 300 in order to adapt the noise introduced by the encoder 300 based on a directional loudness map achievable by an encoded version 320 of the input audio content 112. Thus, the audio encoder 300 is, for example, configured to use a deviation between a directional loudness map 142, which is associated with a given un-encoded input audio signal 112/110 (or two or more input audio signals), and a directional loudness map achievable by an encoded version 320 of the given input audio signal 112/110 (or two or more input audio signals), as a criterion for an adaptation of the provision of the given encoded audio signal or audio signals of the encoded audio content 320. This deviation can represent a quality of the encoding 310 of the encoder 300. Thus, the encoder 300 can be configured to adapt 340 the encoding parameters such that the deviation is below a certain threshold. Thus, the feedback loop 322 is realized to improve the encoding 310 by the audio encoder 300 based on directional loudness maps 142 of the encoded audio content 320 and directional loudness maps 142 of the un-encoded input audio content 112 or of the un-encoded spectral-domain representations 110. According to an embodiment, in the feedback loop 322 the encoded audio content 320 is decoded to perform a loudness information determination 100 based on decoded audio signals. Alternatively, it is also possible that the directional loudness maps 142 of the encoded audio content 320 are achieved by a feed forward realized by a neuronal network (e.g., predicted).
[0238] According to an embodiment, the audio encoder is configured to adjust the one or more quantization parameters by the adaptation 342 to adapt a provision of the one or more encoded audio signals of the encoded audio content 320.
[0239] According to an embodiment, the adaptation 340 of encoding parameters can be performed in order to disable or enable the encoding 310 and/or to activate and deactivate a joint coding tool, which is, for example, used by the coding unit 314. This is, for example, performed by the adaptation 344 of the coding parameters. According to an embodiment, the adaptation 344 of the coding parameters can depend on the same considerations as the adaptation 342 of the quantization parameters. Thus, According to an embodiment, the audio encoder 300 is configured to disable the encoding 310 of a given one of the signals to be encoded, e.g., of a residual signal, when contributions of an individual directional loudness map 142 of the given one of the signals to be encoded (or, e.g., when contributions of a directional loudness map 142 of a pair of signals to be encoded or of a group of three or more signals to be encoded) to an overall direction loudness map is below a threshold. Thus, the audio encoder 300 is configured to effectively encode 310 only relevant information.
[0240] According to an embodiment, the joint coding tool of the coding unit 314 is, for example, configured to jointly encode two or more of the input audio signals 112, or signals 110 derived therefrom, for example, to make an M/S (mid/side-signal) on/off decision. The adaptation 344 of the coding parameters can be performed such that the joint coding tool is activated or deactivated in dependence on one or more directional loudness maps 142, which represent loudness information associated with a plurality of different directions of the one or more signals 112/110 to be encoded. Alternatively or additionally, the audio encoder 300 can be configured to determine one or more parameters of a joint coding tool as coding parameters in dependence on the one or more directional loudness maps 142. Thus, with the adaptation 344 of the coding parameters, for example, a smoothing of frequency-dependent prediction factors can be controlled, for example, to set parameters of an “intensity stereo” joint coding tool.
[0241] According to an embodiment, the quantization parameters and/or the coding parameters can be understood as control parameters, which can control the provision of the one or more encoded audio signals 320. Thus, the audio encoder 300 is configured to determine or estimate an influence of a variation of the one or more control parameters onto a directional loudness map 142 of one or more encoded signals 320, and to adjust the one or more control parameters in dependence on the determination or estimation of the influence. This can be realized by the feedback loop 322 and/or by a feed forward as described above.
[0242]
[0243] The audio encoder 300 is configured to select 350 signals to be encoded jointly 310 out of a plurality of candidate signals 110, or out of a plurality of pairs of candidate signals 110 in dependence on directional loudness maps 142. The directional loudness maps 142 represent loudness information associated with a plurality of different directions, e.g., panning directions, of the candidate signals 110 or of the pairs of candidate signals 110 and/or predetermined directions.
[0244] According to an embodiment, the directional loudness maps 142 can be calculated by the loudness information determination 100 as described herein. Thus, the loudness information determination 100 can be implemented as described with regard to the audio encoder 300 described in
[0245] If the input audio content 112 comprises only one input audio signal, this signal is selected by the signal selection 350 to be encoded by the audio encoder 300, for example, using an entropy encoding to provide one encoded audio signal as the encoded audio content 320. In this case, for example, the audio encoder is configured to disable the joint encoding 310 and to switch to an encoding of only one signal.
[0246] If the input audio content 112 comprises two input audio signals 112.sub.1 and 112.sub.2, which can be described as X.sub.1 and X.sub.2, both signals 112.sub.1 and 112.sub.2 are selected 350 by the audio encoder 300 for the joint encoding 310 to provide one or more encoded signals in the encoded audio content 320. Thus, the encoded audio content 320 optionally comprises a mid-signal and a side-signal, or a downmix signal and a difference signal, or only one of these four signals.
[0247] If the input audio content 112 comprises three or more input audio signals, the signal selection 350 is based on the directional loudness maps 142 of the candidate signals 110. According to an embodiment, the audio encoder 300 is configured to use the signal selection 350 to select one signal pair out of the plurality of candidate signals 110, for which, according to the directional loudness maps 142, an efficient audio encoding and a high-quality audio output can be realized. Alternatively or additionally, it is also possible that the signal selection 350 selects three or more signals of the candidate signals 110 to be encoded jointly 310. Alternatively or additionally, it is possible that the audio encoder 300 uses the signal selection 350 to select more than one signal pair or group of signals for a joint encoding 310. The selection 350 of the signals 352 to be encoded can depend on contributions of individual directional loudness maps 142 of a combination of two or more signals to an overall directional loudness map. According to an embodiment, the overall directional loudness map is associated with multiple selected input audio signals or with each signal of the input audio content 112. How this signal selection 350 can be performed by the audio encoder 300 is exemplarily described in
[0248] Thus, the audio encoder 300 is configured to provide one or more encoded, for example, quantized and then losslessly encoded, audio signals, for example, encoded spectral-domain representations, on the basis of two or more input audio signals 112.sub.1, 112.sub.2, or on the basis of two or more signals 110.sub.1, 110.sub.2 derived therefrom, using the joint encoding 310 of two or more signals 352 to be encoded jointly.
[0249] According to an embodiment, the audio encoder 300 is, for example, configured to determine individual directional loudness maps 142 of two or more candidate signals, and compare the individual directional loudness maps 142 of the two or more candidate signals. Additionally the audio encoder is, for example, configured to select two or more of the candidate signals for a joint encoding in dependence on a result of the comparison, for example, such that candidate signals, individual loudness maps of which comprise a maximum similarity or a similarity which is higher than a similarity threshold, are selected for a joint encoding. With this optimized selection, a very efficient encoding can be realized since the high similarity of the signals to be encoded jointly can result in an encoding using only few bits. This means, for example, that a downmix signal or a residual signal of the chosen candidate pair can be efficiently encoded jointly.
[0250]
[0251] According to
[0252]
[0253] According to an embodiment, each directional loudness map 142 represents loudness information associated with different directions. The different directions are indicated in
[0254] According to an embodiment, the signal selection 350 is performed such that a contribution of pairs of candidate signals to the overall directional loudness map 142b are determined. A relation between the overall directional loudness map 142b and the directional loudness maps 142a.sub.1 to 142a.sub.3 of the pairs of candidate signals can be described by the formula
DirLoudMap(1,2,3)=a*DirLoudMap(1,2,3)+b*DirLoudMap(2,3)+c*DirLoudMap(1,3).
[0255] The contribution as determined by the audio encoder using the signal selection can be represented by the factors a, b and c.
[0256] According to an embodiment, the audio encoder is configured to choose one or more pairs of candidate signals 112.sub.1 to 112.sub.3 having a highest contribution to the overall directional loudness map 142b for a joint encoding. This means, for example, that the pair of candidate signals is chosen by the signal selection 350, which is associated with the highest factor of the factors a, b and c.
[0257] Alternatively, the audio encoder is configured to choose one or more pairs of candidate signals 112.sub.1 to 112.sub.3 having a contribution to the overall directional loudness map 142b, which is larger than a predetermined threshold for a joint encoding. This means, for example, that a predetermined threshold is chosen and that each factor a, b, c is compared with the predetermined threshold to select each signal pair associated with a factor larger than the predetermined threshold.
[0258] According to an embodiment, the contributions can be in a range of 0% to 100%, which means, for example, for the factors a, b and c a range from 0 to 1. A contribution of 100% is, for example, associated with a directional loudness map 142a equaling exactly the overall directional loudness map 142b. According to an embodiment, the predetermined threshold depends on how many input audio signals are included in the input audio content. According to an embodiment, the predetermined threshold can be defined as a contribution of at least 35% or of at least 50% or of at least 60% or of at least 75%.
[0259] According to an embodiment, the predetermined threshold depends on how many signals have to be selected by the signal selection 350 for the joint encoding. If, for example, at least two signal pairs have to be selected, two signal pairs can be selected, which are associated with directional loudness maps 142a having the highest contribution to the overall directional loudness map 142b. This means, for example, that the signal pair with the highest contribution and with the second highest contribution are selected 350.
[0260] It is advantageous to base the selection of the signals to be encoded by the audio encoder on directional loudness maps 142, since a comparison of directional loudness maps can indicate a quality of a perception of the encoded audio signals by a listener. According to an embodiment, the signal selection 350 is performed by the audio encoder such that the signal pair or the signal pairs are selected, for which their directional loudness map 142a is most similar to the overall directional loudness map 142b. This can result in a similar perception of the selected candidate pair or candidate pairs compared to a perception of all input audio signals. Thus, the quality of the encoded audio content can be improved.
[0261]
[0262] The audio encoder 300 is configured to determine 100 an overall directional loudness map on the basis of the input audio signals 112 and/or to determine 100 one or more individual directional loudness maps 142 associated with individual input audio signals 112. The overall directional loudness map can be represented by L(m, φ.sub.0,i) and the individual directional loudness maps can be represented by L(m, φ.sub.0,i). According to an embodiment, the overall direction loudness map can represent a target directional loudness map of a scene. In other words, the overall directional loudness map can be associated with a desired directional loudness map for a combination of the encoded audio signals. Additionally or alternatively, it is possible that directional loudness maps L(m, φ.sub.0,i) of signal pairs or of groups of three or more signals can be determined 100 by the audio encoder 300.
[0263] The audio encoder 300 is configured to encode 310 the overall directional loudness map 142 and/or one or more individual directional loudness maps 142 and/or one or more directional loudness maps of signal pairs or groups of three or more input audio signals 112 as a side information. Thus, the encoded audio content 320 comprises the encoded audio signals and the encoded directional loudness maps. According to an embodiment, the encoding 310 can depend on one or more directional loudness maps 142, whereby it is advantageous to also encode these directional loudness maps 142 to enable a high quality decoding of the encoded audio content 320. With the directional loudness maps 142 as encoded side information, an originally intended quality characteristic (e.g., to be achievable by the encoding 310 and/or by an audio decoder) is provided by the encoded audio content 320.
[0264] According to an embodiment, the audio encoder 300 is configured to determine 100 the overall directional loudness map L(m, φ.sub.0,i) on the basis of the input audio signals 112 such that the overall directional loudness map represents loudness information associated with the different directions, for example, of audio components, of an audio scene represented by the input audio signals 112. Alternatively, the overall directional loudness map L(m, φ.sub.0,i) represents loudness information associated with the different directions, for example, of audio components, of an audio scene to be represented, for example, after a decoder-sided rendering by the input audio signals. The loudness information determination 100 can be performed by the audio encoder 300 optionally in combination with knowledge or side information regarding positions of loudspeakers and/or knowledge or side information describing positions of audio objects in the input audio signals 112.
[0265] According to an embodiment, the loudness information determination 100 can be implemented as described with other herein described audio encoders 300.
[0266] The audio encoder 300 is, for example, configured to encode 310 the overall directional loudness map L(m, φ.sub.0,i) in the form of a set of values, for example, scalar values, associated with different directions. According to an embodiment, the values are additionally associated with a plurality of frequency bins of frequency bands. Each value or values at discrete directions of the overall directional loudness map can be encoded. This means, for example, that each value of a color matrix as shown in
[0267] Alternatively, the audio encoder 300 is, for example, configured to encode the overall directional loudness map L(m, φ.sub.0,i) using a center position value and a slope information. The center position value describes, for example, an angle or a direction at which a maximum of the overall directional loudness map for a given frequency band or frequency bin, or for a plurality of frequency bins or frequency bands is located. The slope information represents, for example, one or more scalar values describing slopes of the values of the overall directional loudness map in angle direction. The scalar values of the slope information are, for example, values of the overall directional loudness map for directions neighboring the center position value. The center position value can represent a scalar value of a loudness information and/or a scalar value of a direction corresponding to the loudness value.
[0268] Alternatively, the audio encoder is, for example, configured to encode the overall directional loudness map L(m, φ.sub.0,i) in the form of a polynomial representation or in the form of a spline representation.
[0269] According to an embodiment, the above-described encoding possibilities 310 for the overall directional loudness map L(m, φ.sub.0,i) can also be applied for the individual directional loudness maps L.sub.i(m, φ.sub.0,i) and/or for directional loudness maps associated with signal pairs or groups of three or more signals.
[0270] According to an embodiment, the audio encoder 300 is configured to encode one downmix signal obtained on the basis of a plurality of input audio signals 112 and an overall directional loudness map L(m, φ.sub.0,i). Optionally also a contribution of a directional loudness map, associated with the downmix signal, to the overall directional loudness map is, for example, encoded as side information.
[0271] Alternatively, the audio encoder 300 is, for example, configured to encode 310 a plurality of signals, for example, the input audio signals 112 or the signals 110 derived therefrom, and to encode 310 individual loudness maps L.sub.i(m, φ.sub.0,i) of the plurality of signals 112/110 which are encoded 310 (e.g., of individual signals, of signal pairs or of groups of three or more signals). The encoded plurality of signals and the encoded individual directional loudness maps are, for example, transmitted into the encoded audio representation 320, or included into the encoded audio representation 320.
[0272] According to an alternative embodiment, the audio encoder 300 is configured to encode 310 the overall directional loudness map L(m, φ.sub.0,i), a plurality of signals, for example, the input audio signals 112 or the signals 110 derived therefrom, and parameters describing contributions, for example, relative contributions of the signals, which are encoded to the overall directional loudness map. According to an embodiment, the parameters can be represented by the parameters a, b and c as described in
[0273] According to an embodiment, an audio encoder can comprise or combine individual features and/or functionalities as described with regard to one or more of the audio encoders 300 described in
[0274]
[0275] The audio decoder 400 is configured to receive the encoded representation 422 of one or more audio signals and to provide a decoded representation 412 of the one or more audio signals. Furthermore, the audio decoder 400 is configured to receive the encoded directional loudness map information 424 and to decode 410 the encoded directional loudness map information 424, to obtain one or more decoded directional loudness maps 414. The decoded directional loudness maps 414 can comprise features and/or functionalities as described with regard to the above-described directional loudness maps 142.
[0276] According to an embodiment, the decoding 410 can be performed by the audio decoder 400 using an AAC-like decoding or using a decoding of entropy-encoded spectral values, or using a decoding of entropy-encoded loudness values.
[0277] The audio decoder 400 is configured to reconstruct 430 an audio scene using the decoded representation 412 of the one or more audio signals and using the one or more directional loudness maps 414. Based on the reconstruction 430, a decoded audio content 432, like a multi-channel-representation, can be determined by the audio decoder 400.
[0278] According to an embodiment, the directional loudness map 414 can represent a target directional loudness map to be achievable by the decoded audio content 432. Thus, with the directional loudness map 414 the reconstruction of the audio scene 430 can be optimized to result in a high-quality perception of a listener of the decoded audio content 432. This is based on the idea that the directional loudness map 414 can indicate a desired perception for the listener.
[0279]
[0280] According to an embodiment, the one or more directional loudness maps associated with the output signals 432 can be determined by the audio decoder 400. The audio decoder 400 comprises, for example, an audio analyzer to determine the one or more directional loudness maps associated with the output signals 432, or is configured to receive from an external audio analyzer 100 the one or more directional loudness maps associated with the output signals 432.
[0281] According to an embodiment, the audio decoder 400 is configured to compare the one or more directional loudness maps associated with the output signals 432 and the decoded directional loudness maps 414; or compare the one or more directional loudness maps associated with the output signals 432 with a directional loudness map derived from the decoded directional loudness map 414, and to adapt 440 the decoding parameters or the reconstruction 430 based on this comparison. According to an embodiment, the audio decoder 400 is configured to adapt 440 the decoding parameters or to adapt the reconstruction 430 such that a deviation between the one or more directional loudness maps associated with the output signals 432 and the one or more target directional loudness maps is below a predetermined threshold. This can represent a feedback loop, whereby the decoding 410 and/or the reconstruction 430 is adapted such that the one or more directional loudness maps associated with the output signals 432 approximate the one or more target directional loudness maps by at least 75% or by at least 80%, or by at least 85%, or by at least 90%, or by at least 95%.
[0282] According to an embodiment, the audio decoder 400 is configured to receive one encoded downmix signal as the encoded representation 422 of the one or more audio signals and an overall directional loudness map as the encoded directional loudness map information 424. The encoded downmix signal is, for example, obtained on the basis of a plurality of input audio signals. Alternatively, the audio decoder 400 is configured to receive a plurality of encoded audio signals as the encoded representation 422 of the one or more audio signals and individual directional loudness maps of the plurality of encoded signals as the encoded directional loudness map information 424. The encoded audio signal represents, for example, input audio signals encoded by an encoder or signals derived from the input audio signals encoded by the encoder. Alternatively, the audio decoder 400 is configured to receive an overall directional loudness map as the encoded directional loudness map information 424, a plurality of encoded audio signals as the encoded representation 422 of the one or more audio signals, and additionally parameters describing contributions of the encoded audio signals to the overall directional loudness map. Thus, the encoded audio content 420 can additionally comprise the parameters, and the audio decoder 400 can be configured to use these parameters to improve the adaptation 440 of the decoding parameters, and/or to improve the reconstruction 430 of the audio scene.
[0283] The audio decoder 400 is configured to provide the output signals 432 on the basis of one of the before mentioned encoded audio content 420.
[0284]
[0285] The first format may, for example, comprise a first number of channels or input audio signals and a side information or a spatial side information adapted to the first number of channels or input audio signals. The second format may, for example, comprise a second number of channels or output audio signals, which may be different from the first number of channels or input audio signals, and a side information or a spatial side information adapted to the second number of channels or output audio signals. The audio content 520 in the first format comprises, for example, one or more audio signals, one or more downmix signals, one or more residual signals, one or more mid signals, one or more side signals and/or one or more different signals.
[0286] The format converter 500 is configured to adjust 540 a complexity of the format conversion 510 in dependence on contributions of input audio signals of the first format to an overall direction loudness map 142 of the audio scene. The audio content 520 comprises, for example, the input audio signals of the first format. The contributions can directly represent contributions of the input audio signals of the first format to the overall direction loudness map 142 of the audio scene or can represent contributions of individual directional loudness maps of the input audio signals of the first format to the overall direction loudness map 142 or can represent contributions of directional loudness maps of pairs of the input audio signals of the first format to the overall directional loudness map 142. According to an embodiment, the contributions can be calculated by the format converter 500 as described in
[0287] The audio content 520 in the first format can comprise directional loudness map information of the input audio signals in the first format. Based on the directional loudness map information the format converter 500 is, for example, configured to obtain the overall directional loudness map 142 and/or one or more directional loudness maps. The one or more directional loudness maps can represent directional loudness maps of each input audio signals in the first format and/or directional loudness maps of groups or pairs of signals in the first format. The format converter 500 is, for example, configured to derive the overall directional loudness map 142 from the one or more directional loudness maps or directional loudness map information.
[0288] The complexity adjustment 540 is, for example, performed such that it is controlled if a skipping of one or more of the input audio signals of the first format, which contribute to the directional loudness map below a threshold is possible. In other words the format converter 500 is, for example, configured to compute or estimate a contribution of a given input audio signal to the overall directional loudness map 142 of the audio scene and to decide whether to consider the given input audio signal in the format conversion 510 in dependence on the computation or estimation of the contribution. The computed or estimated contribution is, for example, compared with a predetermined absolute or relative threshold value by the format converter 500.
[0289] The contributions of the input audio signals of the first format to the overall directional loudness map 142 can indicate a relevance of the respective input audio signal for a quality of a perception of the audio content 530 in the second format. Thus, for example, only audio signals in the first format with high relevance undergo the format conversion 510. This can result in a high quality audio content 530 in the second format.
[0290]
[0291] The decoding complexity adjustment 440 can be performed by the audio decoder 400 similar to the complexity adjustment 540 of the format converter 500 in
[0292] According to an embodiment, the audio decoder 400 is configured to receive an encoded directional loudness map information, for example, extracted from the encoded audio content 420. The encoded directional loudness map information can be decoded 410 by the audio decoder 400 to determine a decoded directional loudness information 414. Based on the decoded directional loudness information 414 an overall directional loudness map of the one or more audio signals of the encoded audio content 420 and/or one or more individual directional loudness maps of the one or more audio signals of the encoded audio content 420 can be obtained. The overall directional loudness map of the one or more audio signals of the encoded audio content 420 is, for example, derived from the one or more individual directional loudness maps.
[0293] The overall directional loudness map 142 of the decoded audio scene 434 can be calculated by a directional loudness map determination 100, which can be optionally performed by the audio decoder 400. According to an embodiment, the audio decoder 400 comprises an audio analyzer as described with regard to
[0294] According to an embodiment, the audio decoder 400 is configured to compute or estimate a contribution of a given encoded signal to the overall directional loudness map 142 of the decoded audio scene and to decide whether to decode 410 the given encoded signal in dependence on the computation or estimation of the contribution. Thus, for example, the overall directional loudness map of the one or more audio signals of the encoded audio content 420 can be compared with the overall directional loudness map of the decoded audio scene 434. The determination of the contributions can be performed as described above (e.g., as described with respect to
[0295] Alternatively the audio decoder 400 is configured to compute or estimate a contribution of a given encoded signal to the decoded overall directional loudness map 414 of an encoded audio scene and to decide whether to decode 410 the given encoded signal in dependence on the computation or estimation of the contribution.
[0296] The complexity adjustment 440 is, for example, performed such that it is controlled if a skipping of one or more of the encoded representation of one or more input audio signals, which contribute to the directional loudness map below a threshold, is possible.
[0297] Additionally or alternatively the decoding complexity adjustment 440 can be configured to adapt decoding parameters based on the contributions.
[0298] Additionally or alternatively the decoding complexity adjustment 440 can be configured to compare decoded directional loudness maps 414 with the overall directional loudness map of the decoded audio scene 434 (e.g., the overall directional loudness map of the decoded audio scene 434 is the target directional loudness map) to adapt decoding parameters.
[0299]
[0300] According to an embodiment, for the reconstruction 640 of the audio scene the renderer 600 is configured to analyze the one or more input audio signals 622 to optimize a rendering to obtain a desired audio scene. Thus, for example, the renderer 600 is configured to modify a spatial arrangement of audio objects of the audio content 620. This means, for example, that the renderer 600 can reconstruct 640 a new audio scene. The new audio scene comprises, for example, rearranged audio objects compared to an original audio scene of the audio content 620. This means, for example, that a guitarist and/or a singer and/or other audio objects are positioned in the new audio scene at different spatial locations than in the original audio scene.
[0301] Additionally or alternatively, a number of audio channels or a relationship between audio channels is rendered by the audio renderer 600. Thus, for example, the renderer 600 can render an audio content 620 comprising a multichannel signal to, for example, a two-channel signal. This is, for example, desirable if only two loudspeakers are available for a representation of the audio content 620.
[0302] According to an embodiment, the rendering is performed by the renderer 600 such that the new audio scene shows only minor deviations with respect to the original audio scene.
[0303] The renderer 600 is configured to adjust 650 a rendering complexity in dependence on contributions of the input audio signals 622 to an overall directional loudness map 142 of a rendered audio scene 642. According to an embodiment, the rendered audio scene 642 can represent the new audio scene described above. According to an embodiment, the audio content 620 can comprise the overall directional loudness map 142 as side information. This overall directional loudness map 142 received as side information by the renderer 600 can indicate a desired audio scene for the rendered audio content 630. Alternatively, a directional loudness map determination 100 can determine the overall directional loudness map 142 based on the rendered audio scene received from the reconstruction unit 640. According to an embodiment, the renderer 600 can comprise the directional loudness map determination 100 or receive the overall directional loudness map 142 of an external directional loudness map determination 100. According to an embodiment, the directional loudness map determination 100 can be performed by an audio analyzer as described above.
[0304] According to an embodiment, the adjustment 650 of the rendering complexity is, for example, performed by skipping one or more of the input audio signals 622. The input audio signals 622 to be skipped are, for example, signals which contribute to the directional loudness map 142 below a threshold. Thus, only relevant input audio signals are rendered by the audio renderer 600.
[0305] According to an embodiment, the renderer 600 is configured to compute or estimate a contribution of a given input audio signal 622 to the overall directional loudness map 142 of the audio scene, e.g., of the rendered audio scene 642. Furthermore, the renderer 600 is configured to decide whether to consider the given input audio signal in the rendering in dependence on a computation or estimation of the contribution. Thus, for example, the computed or estimated contribution is compared with a predetermined absolute or relative threshold value.
[0306]
[0307]
[0308]
[0309]
[0310]
[0311]
[0312]
[0313]
[0314]
Remarks:
[0315] In the following, different inventive embodiments and aspects will be described in a chapter “Objective assessment of spatial audio quality using directional loudness maps”, in a chapter “Use of directional loudness for audio coding and objective quality measurement”, in a chapter “Directional loudness for audio coding”, in a chapter “Generic steps for computing a directional loudness map (DirLoudMap)”, in a chapter “Example: Recovering directional signals with windowing/selection function derived from panning index” and in a chapter “Embodiments of Different forms of calculating the loudness maps using generalized criterion functions”.
[0316] Also, further embodiments will be defined by the enclosed claims.
[0317] It should be noted that any embodiments as defined by the claims can be supplemented by any of the details (features and functionalities) described in the above mentioned chapters.
[0318] Also, the embodiments described in the above mentioned chapters can be used individually, and can also be supplemented by any of the features in another chapter, or by any feature included in the claims.
[0319] Also, it should be noted that individual aspects described herein can be used individually or in combination. Thus, details can be added to each of said individual aspects without adding details to another one of said aspects.
[0320] It should also be noted that the present disclosure describes, explicitly or implicitly, features usable in an audio encoder (apparatus for providing an encoded representation of an input audio signal) and in an audio decoder (apparatus for providing a decoded representation of an audio signal on the basis of an encoded representation). Thus, any of the features described herein can be used in the context of an audio encoder and in the context of an audio decoder.
[0321] Moreover, features and functionalities disclosed herein relating to a method can also be used in an apparatus (configured to perform such functionality). Furthermore, any features and functionalities disclosed herein with respect to an apparatus can also be used in a corresponding method. In other words, the methods disclosed herein can be supplemented by any of the features and functionalities described with respect to the apparatuses.
[0322] Also, any of the features and functionalities described herein can be implemented in hardware or in software, or using a combination of hardware and software, as will be described in the section “implementation alternatives”.
Implementation Alternatives:
[0323] Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.
[0324] Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
[0325] Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
[0326] Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
[0327] Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
[0328] In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
[0329] A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
[0330] A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
[0331] A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
[0332] A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
[0333] A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
[0334] In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
[0335] The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
[0336] The apparatus described herein, or any components of the apparatus described herein, may be implemented at least partially in hardware and/or in software.
[0337] The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
[0338] The methods described herein, or any components of the apparatus described herein, may be performed at least partially by hardware and/or by software.
[0339] The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
Objective Assessment of Spatial Audio Quality Using Directional Loudness Maps
Abstract
[0340] This work introduces a feature extracted, for example, from stereophonic/binaural audio signals serving as a measurement of perceived quality degradation in processed spatial auditory scenes. The feature can be based on a simplified model assuming a stereo mix created by directional signals positioned using amplitude level panning techniques. We calculate, for example, the associated loudness in the stereo image for each directional signal in the Short-Time Fourier Transform (STFT) domain to compare a reference signal and a deteriorated version and derive a distortion measure aiming to describe the perceived degradation scores reported in listening tests.
[0341] The measure was tested on an extensive listening test database with stereo signals processed by state-of-the-art perceptual audio codecs using non waveform-preserving techniques such as bandwidth extension and joint stereo coding, known for presenting a challenge to existing quality predictors [1], [2]. Results suggest that the derived distortion measure can be incorporated as an extension to existing automated perceptual quality assessment algorithms for improving prediction on spatially coded audio signals.
Index Terms—Spatial Audio, Objective Quality Assessment, PEAQ, Panning Index.
1. Introduction
[0342] We propose a simple feature aiming to describe the deterioration in the perceived auditory stereo image, for example, based on the change in loudness at regions that share a common panning index [13]. That is, for example, regions in time and frequency of a binaural signal that share the same intensity level ratio between left and right channels, therefore corresponding to a given perceived direction in the horizontal plane of the auditory image.
[0343] The use of directional loudness measurements in the con-text of auditory scene analysis for audio rendering of complex virtual environments is also proposed in [14], whereas the current work is focused on overall spatial audio coding quality objective assessment.
[0344] The perceived stereo image distortion can be reflected as changes on a directional loudness map of a given granularity corresponding to the amount of panning index values to be evaluated as a parameter.
2. Method
[0345] According to an embodiment, the reference signal (REF) and the signal under test (SUT) are processed in parallel in order to extract features that aim to describe—when compared—the perceived auditory quality degradation caused by the operations carried out in order to produce the SUT.
[0346] Both binaural signals can be processed first by a peripheral ear model block. Each input signal is, for example, decomposed into the STFT domain using a Hann window of block size M=1024 samples and overlap of M/2, giving a time resolution of 21 ms at a sampling rate of Fs=48 kHz. The frequency bins of the transformed signal are then, for example, grouped to account for the frequency selectivity of the human cochlea following the ERB scale [15] in a total of B=20 frequency bin subsets or bands. Each band can then be weighted by a value derived from the combined linear transfer function that models the outer and middle ear as explained in [3].
[0347] The peripheral model outputs then signals X.sub.i,b(m, k) in each time frame m, and frequency bin k, and for each channel i={L, R} and each frequency group b∈{0, . . . , B−1}, with different widths K.sub.b expressed in frequency bins.
2.1. Directional Loudness Calculation (e.g., performed by an herein described audio analyzer and/or audio similarity evaluator)
[0348] According to an embodiment, the directional loudness calculation can be performed for different directions, such that, for example, the given panning direction Ψ.sub.0 can be interpreted as Ψ.sub.0,j with jϵ[1;J]. The following concept is based on the method presented in [13], where a similarity measure between the left and right channels of a binaural signal in the STFT domain can be used to extract time and frequency regions occupied by each source in a stereophonic recording based on their designated panning coefficients during the mixing process.
[0349] Given the output of the peripheral model X.sub.i,b(m, k) a time-frequency (T/F) tile Y.sub.i,b,Ψ.sub.
Y.sub.i,b,Ψ.sub.
The recovered signal will have the T/F components of the input that correspond to a panning direction Ψ.sub.0 within a tolerance value. The windowing function can be defined as a Gaussian window centered at the desired panning direction:
where Ψ(m, k) is the panning index as calculated in [13] with a defined support of [−1,1] corresponding to signals panned fully to the left or to the right, respectively. Indeed, Y.sub.i,b,Ψ.sub.
where Y.sub.DM is the sum signal of channels i={L, R}. The loudness is then averaged, for example, over all ERB bands to provide a directional loudness map defined over the panning domain Ψ.sub.0∈[−1,1] over time frame m:
[0350] For further refinement Equation 4 can be calculated only considering a subset of the ERB bands corresponding to frequency regions of 1.5 kHz and above to accommodate to the sensitivity of the human auditory system to level differences in this region, according to the duplex theory [17]. According to an embodiment, bands b∈{7, . . . , 19} are used corresponding to frequencies from 1.34 kHz to F.sub.S/2.
[0351] As a step, directional loudness maps for the duration of the reference signal and SUT are, for example, subtracted and the absolute value of the residual is then averaged over all panning directions and time producing a single number termed Model Output Variable (MOV), following the terminology in [3]. This number effectively expressing the distortion between directional loudness maps of reference and SUT, is expected to be a predictor of the associated subjective quality degradation reported in listening tests.
[0352]
3. Experiment Description
[0353] In order to test and validate the usefulness of the proposed MOV, a regression experiment similar to the one in [18] was carried out in which MOVs were calculated for reference and SUT pairs in a database and compared to their respective subjective quality scores from a listening test. The prediction performance of the system making use of this MOV is evaluated in terms of correlation against subjective data (R), absolute error score (AES), and number of outliers (v), as described in [3].
[0354] The database used for the experiment corresponds to a part of the Unified Speech and Audio Coding (USAC) Verification Test [19] Set 2, which contains stereo signals coded at bitrates ranging from 16 to 24 kbps using joint stereo [12] and bandwidth extension tools along with their quality score on the MUSHRA scale. Speech items were excluded since the proposed MOV is not expected to describe the main cause of distortion on speech signals. A total of 88 items (e.g., average length 8 seconds) remained in the database for the experiment.
[0355] To account for possible monaural/timbral distortions in the database, the outputs of an implementation of the standard PEAQ (Advanced Version) termed Objective Difference Grade (ODG) and POLQA, named Mean Opinion Score (MOS) were taken as additional MOVs complementing the directional loudness distortion (DirLoudDist; e.g., DLD) described in the previous section. All MOVs can be normalized and adapted to give a score of 0 for indicating best quality and 1 for worst possible quality. Listening test scores were scaled accordingly.
[0356] One random fraction of the available content of the database (60%, 53 items) was reserved for training a regression model using Multivariate Adaptive Regression Splines (MARS) [8] mapping the MOVs to the items subjective scores. The remainder (35 items) were used for testing the performance of the trained regression model. In order to remove the influence of the training procedure from the overall MOV performance analysis, the training/testing cycle was, for example, carried out 500 times with randomized training/test items and mean values for R, AES, and v were considered as performance measures.
4. Results and Discussion
[0357]
TABLE-US-00001 TABLE 1 Mean performance values for 500 training/validation (e.g., testing) cycles of the regression model with different sets of MOVs. CHOI represents the 3 binaural MOVs as calculated in [20], EITDD corresponds to the high frequency envelope ITD distortion MOV as calculated in [1]. SEO corresponds to the 4 binaural MOVs from [1], including EITDD. DirLoudDist is the proposed MOV. The number in parenthesis represents the total number of MOVs used. (optional) MOV Set (N) R AES ν MOS + ODG (2) 0.77 2.63 12 MOS + ODG + CHOI (5) 0.77 2.39 11 MOS + ODG + EITDD (3) 0.82 2.0 11 MOS + ODG + SEO (6) 0.88 1.65 7 MOS + ODG + DirLoudDist (3) 0.88 1.69 8
[0358] Table 1 shows the mean performance values (correlation, absolute error score, number of outliers) for the experiment described in Section 3. In addition to the proposed MOV, the methods for objective evaluation of spatially coded audio signals proposed in [20] and [1] were also tested for comparison. Both compared implementations make use of the classical inter-aural cue distortions mentioned in the introduction: IACC distortion (IACCD), ILD distortion (ILDD), and ITDD.
[0359] As mentioned, the baseline performance is given by ODG and MOS, both achieve R=0.66 separately but present a combined performance of R=0.77 as shown in Table 1. This confirms that the features are complimentary in the evaluation of monaural distortions.
[0360] Considering the work of Choi et. al. [20], the addition of the three binaural distortions (CHOI in Table 1) to the two monaural quality indicators (making up to five joint MOVs) does not provide any further gain to the system in terms of prediction performance for the used dataset.
[0361] In [1], some further optional model refinements were made to the mentioned features in terms of lateral plane localization and cue distortion detectability. In addition, a novel MOV that considers high frequency envelope inter-aural time difference distortions (EITDD) [21] was, for example, incorporated. The set of these four binaural MOVs (marked as SEO in Table 1) plus the two monaural descriptors (6 MOVs in total) significantly improves the system performance for the current data set.
[0362] Looking at the contribution in improvement from EITDD suggests that frequency time-energy envelopes as used in joint stereo techniques [12] represent a salient aspect of the overall quality perception.
[0363] However, the presented MOV based on directional loudness map distortions (DirLoudDist) correlates even better with the perceived quality degradation than EITDD, even reaching similar performance figures as the combination of all the binaural MOVs of [1], while using one additional MOV to the two monaural quality descriptors, instead of four. Using fewer features for the same performance will reduce the risk of over-fitting and indicates their higher perceptual relevance.
[0364] A maximum mean correlation against subjective scores for the database of 0.88 shows that there is still room for improvement.
[0365] According to an embodiment, the proposed feature is based on a herein described model that assumes a simplified description of stereo signals in which auditory objects are only localized in the lateral plane by means of ILDs, which is usually the case in studio-produced audio content [13]. For ITD distortions usually present when coding multi-microphone recordings or more natural sounds, the model needs to be either extended or complemented by a suitable ITD distortion measure.
5. Conclusions and Future Work
[0366] According to an embodiment, distortion metric was introduced describing changes in a representation of the auditory scene based on loudness of events corresponding to a given panning direction. The significant increase in performance with respect to the monaural-only quality prediction shows the effectiveness of the proposed method. The approach also suggests a possible alternative or complement in quality measurement for low bitrate spatial audio coding where established distortion measurements based on classical binaural cues do not perform satisfactorily, possibly due to the non-waveform preserving nature of the audio processing involved.
[0367] The performance measurements show that there are still areas for improvement towards a more complete model that also includes auditory distortions based on effects other than channel level differences. Future work also includes studying how the model can describe temporal instabilities/modulations in the stereo image as reported in [12] in contrast to static distortions.
REFERENCES
[0368] [1] Jeong-Hun Seo, Sang Bae Chon, Keong-Mo Sung, and Inyong Choi, “Perceptual objective quality evaluation method for high quality multichannel audio codecs,” J. Audio Eng. Soc, vol. 61, no. 7/8, pp. 535-545, 2013. [0369] [2] M. Schafer, M. Bahram, and P. Vary, “An extension of the PEAQ measure by a binaural hearing model,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 8164-8168. [0370] [3] ITU-R Rec. BS.1387, Method for objective measurements of perceived audio quality, ITU-T Rec. BS.1387, Geneva, Switzerland, 2001. [0371] [4] ITU-T Rec. P.863, “Perceptual objective listening quality assessment,” Tech. Rep., International Telecommunication Union, Geneva, Switzerland, 2014. [0372] [5] Sven Kämpf, Judith Liebetrau, Sebastian Schneider, and Thomas Sporer, “Standardization of PEAQ-MC: Extension of ITU-R BS.1387-1 to Multichannel Audio,” in Audio Engineering Society Conference: 40th International Conference: Spatial Audio: Sense the Sound of Space, October 2010. [0373] [6] K Ulovec and M Smutny, “Perceived audio quality analysis in digital audio broadcasting plus system based on PEAQ,” Radioengineering, vol. 27, pp. 342-352, April 2018. [0374] [7] C. Faller and F. Baumgarte, “Binaural cue coding-Part II: Schemes and applications,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 520-531, November 2003. [0375] [8] Jan-Hendrik Fleßner, Rainer Huber, and Stephan D. Ewert, “Assessment and prediction of binaural aspects of audio quality,” J. Audio Eng. Soc, vol. 65, no. 11, pp. 929-942, 2017. [0376] [9] Marko Takanen and Gaëtan Lorho, “A binaural auditory model for the evaluation of reproduced stereo-phonic sound,” in Audio Engineering Society Conference: 45th International Conference: Applications of Time-Frequency Processing in Audio, March 2012. [0377] [10] Robert Conetta, Tim Brookes, Francis Rumsey, Slawomir Zielinski, Martin Dewhirst, Philip Jackson, Søren Bech, David Meares, and Sunish George, “Spatial audio quality perception (part 2): A linear regression model,” J. Audio Eng. Soc, vol. 62, no. 12, pp. 847-860, 2015. [0378] [11] ITU-R Rec. BS.1534-3, “Method for the subjective assessment of intermediate quality levels of coding systems,” Tech. Rep., International Telecommunication Union, Geneva, Switzerland, October 2015. [0379] [12] Frank Baumgarte and Christof Faller, “Why binaural cue coding is better than intensity stereo coding,” in Audio Engineering Society Convention 112, April 2002. [0380] [13] C. Avendano, “Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications,” in 2003 IEEE Workshop on Applications of Signal Processing to Au-dio and Acoustics, October 2003, pp. 55-58. [0381] [14] Nicolas Tsingos, Emmanuel Gallo, and George Drettakis, “Perceptual audio rendering of complex virtual environments,” in ACM SIGGRAPH 2004 Papers, New York, N.Y., USA, 2004, SIGGRAPH '04, pp. 249-258, ACM. [0382] [15] B. C. J. Moore and B. R. Glasberg, “A revision of Zwicker's loudness model,” Acustica United with Acta Acustica: the Journal of the European Acoustics Associ-ation, vol. 82, no. 2, pp. 335-345, 1996. [0383] [16] E. Zwicker, “Über psychologische and methodische Grundlagen der Lautheit [On the psychological and methodological bases of loudness],” Acustica, vol. 8, pp. 237-258, 1958. [0384] [17] Ewan A. Macpherson and John C. Middlebrooks, “Listener weighting of cues for lateral angle: The duplex theory of sound localization revisited,” The Journal of the Acoustical Society of America, vol. 111, no. 5, pp. 2219-2236, 2002. [0385] [18] Pablo Delgado, Jürgen Herre, Armin Taghipour, and Nadja Schinkel-Bielefeld, “Energy aware modeling of interchannel level difference distortion impact on spatial audio perception,” in Audio Engineering Society Conference: 2018 AES International Conference on Spatial Reproduction—Aesthetics and Science, July 2018. [0386] [19] ISO/IEC JTC1/SC29/WG11, “USAC verification test report N12232,” Tech. Rep., International Organisation for Standardisation, 2011. [0387] [20] Inyong Choi, Barbara G. Shinn-Cunningham, Sang Bae Chon, and Koeng-Mo Sung, “Objective measurement of perceived auditory quality in multichannel audio compression coding systems,” J. Audio Eng. Soc, vol. 56, no. 1/2, pp. 3-17, 2008 [0388] [21] E R Hafter and Raymond Dye, “Detection of interaural differences of time in trains of high-frequency clicks as a function of interclick interval and number,” The Journal of the Acoustical Society of America, vol. 73, pp. 644-51, 031983.
Use of Directional Loudness for Audio Coding and Objective Quality Measurement
[0389] Please see the chapter “objective assessment of spatial audio quality using directional loudness maps” for further descriptions.
Description: (e.g., description of
[0390] A feature extracted from, for example, stereophonic/binaural audio signals in the spatial (stereo) auditory scene is presented. The feature is, for example, based on a simplified model of a stereo mix that extracts panning directions of events in the stereo image. The associated loudness in the stereo image for each panning direction in the Short-Time Fourier Transform (STFT) domain can be calculated. The feature is optionally computed for reference and coded signal and then compared to derive a distortion measure aiming to describe the perceived degradation score reported in a listening test. Results show an improved robustness facing low bitrate, non-waveform preserving parametric techniques tools such as joint stereo and bandwidth extension when compared to existing methods. It can be integrated in standardized objective quality assessment measurement systems such as PEAQ or POLQA (PEAQ=Objective Measurements of Perceived Audio Quality; POLQA=Perceptual Objective Listening Quality Analysis).
Terminology
[0391] Signal: e. g., stereophonic signal representing objects, downmixes, residuals, etc. [0392] Directional Loudness Map (DirLoudMap): e. g. derived from each signal. Represents, for example, the loudness in T/F (time/frequency) domain associated with each panning direction in the auditory scene. It can be derived from more than two signals by using binaural rendering (HRTF (head-related transfer function)/BRIR (binaural room impulse response)).
Applications (Embodiments)
[0393] 1. Automatic evaluation of quality (embodiment 1): [0394] As described in the chapter “objective assessment of spatial audio quality using directional loudness maps” [0395] 2. Directional loudness-based bit distribution (embodiment 2) in the audio encoder, based on ratio (contribution) to the overall DirLoudMap of the individual signals DirLoudMaps. [0396] optional variation 1 (independent stereo pairs): audio signals as loudspeakers or objects. [0397] optional variation 2 (Downmix/Residual pairs): contribution of downmix signal DirLoudMap and residual DirLoudMap to the overall DirLoudMap. “Amount of contribution” in the auditory scene for bit distribution criteria. [0398] 1. An audio encoder, performing joint coding of two or more channels, resulting, for example, in each one or more downmix and residual signals, in which the contribution of each residual signal to the overall directional loudness map is determined, e.g. from a fixed decoding rule (e.g. MS-Stereo) or by estimating the inverse joint coding process from the joint coding parameters (e.g. rotation in MCT). Based on the residual signal's contribution to the overall DirLoudMap, the bit rate distribution between downmix and residual signal is adapted, e.g. by controlling the quantization precision of the signals, or by directly discarding residual signals where the contribution is below a threshold. Possible criteria for “contribution” are e.g. the average ratio or the ratio in the direction maximum relative contribution. [0399] Problem: combination and contribution estimation of individual DirLoudMap to the resulting/total loudness map. [0400] 3. (embodiment 3) For the decoder side, directional loudness can help the decoder make an informed decision on the [0401] complexity scaling/format converter: each audio signal can be included or excluded in the decoding process based on their contribution to the overall DirLoudMap (transmitted as a separate parameter or estimated from other parameters) and therefore change the complexity in rendering for different applications/format conversion. This enables decoding with reduced complexity when only limited resources are available (i.e. a multichannel signal rendered to a mobile device) [0402] As the resulting DirLoudMap may depend on the target reproduction setup, this ensures that the most important/salient signals for the individual scenario are reproduced, so this is an advantage over non-spatially informed approaches like a simple signal/object priority level. [0403] 4. For joint coding decision (embodiment 4) (e.g., description of
Directional Loudness for Audio Coding
Introduction and Definitions
DirLoudMap=Directional Loudness Map
Embodiment for Computing a DirLoudMap:
[0419] a) Perform t/f decomposition (+grouping into critical bands (CBs))(e. g. by filter bank, SIFT, . . . ) [0420] b) run directional analysis function for each t/f tile [0421] c) enter/accumulate result of b) into DirLoudMap histogram optionally (if needed by application): [0422] d) summarize output over CBs to provide broadband DirLoudMap
Embodiment of Level of DirLoudMap/directional analysis function: [0423] Level 1 (optional): Maps contribution directions according to spatial reproduction position of signals (channels/objects)−(no knowledge about signal content exploited). Uses a directional analysis function considering only the reproduction direction of channel/object+/−spreading window L1 reproduction direction of channel/object+/−spreading window (this can be wide band, i.e the same for all frequencies) [0424] Level 2 (optional): Maps contribution directions according to spatial reproduction position of signals (channels/objects) plus a *dynamic* function of the content of the channel/object signals (directional analysis function) of different levels of sophistication. [0425] Allows to identify [0426] optionally L2a) panned phantom sources (->panning index) [level], or optionally L2b) level+time delay panned phantom sources [level and time], or optionally L2c) widened (decorrelated) panned phantom sources (even more advanced)
Applications for Perceptual Audio Coding
[0427] Embodiment A) masking of each channel/object—no joint coding tools->target: controlling coder quantization noise (such that original and coded/decoded DirLoudMap deviate by less than a certain threshold, i.e. target criterion in DirLoudMap domain)
Embodiment B) masking of each channel/object—joint coding tools (e.g. M/S+prediction, MCT)
->target: controlling coder quantization noise in tool-processed signals (e.g. M or rotated “sum” signal) to meet target criterion in DirLoudMap domain
Example for B)
[0428] 1) calculate the overall DirLoudMap from, for example, all signals [0429] 2) apply joint coding tools [0430] 3) determine contribution of tool-processed signals (e.g. “sum” and “residual”) to DirLoudMap, with consideration of the decoding function (e.g. panning by rotation/prediction) [0431] 4) control quantization by [0432] a) considering influence of quantization noise to DirLoudMap [0433] b) considering impact of quantizing signal parts to zero to DirLoudMap
Embodiment C) controlling application (e.g. MS on/off) and/or parameters (e.g., prediction factor) of joint coding tools
target: controlling encoder/decoder parameters of joint coding tools to meet target criterion in DirLoudMap domain
Examples for C)
[0434] control M/S on/off decision based on DirLoudMap [0435] control smoothing of frequency dependent prediction factors based on the influence of varying the parameters to the DirLoudMap [0436] (for cheaper differential coding of parameters) [0437] (=control trade-off between side-info and prediction accuracy)
Embodiment D) determine parameters (on/off, ILD, . . . ) of *parametric* joint coding tools (e.g. intensity stereo)
->target: Controlling parameter of parametric joint coding tool to meeting target criterion in DirLoudMap domain
Embodiment E) Parametric Encoder/decoder system transmitting DirLoudMap as side information (rather than traditional spatial cues, e.g. ILD, ITD/IPD, ICC, . . . ) [0438] ->Encoder determines parameters based on analyzing DirLoudMap, generates downmix signal(s) and (bit stream) parameters, e.g., overall DirLoudMap+contribution each signal to DirLoudMap [0439] ->Decoder synthesizes transmitted DirLoudMap by appropriate means
Embodiment F) Decoder/Renderer/FormatConverter complexity reduction [0440] Determine contribution of each signal to the overall DirLoudMap (possibly based on transmitted side-info) to determine “importance” of each signal. In applications with restricted computational capability, skip decoding/rendering of signals that contribute to the DirLoudMap below a threshold.
Generic Steps for Computing a Directional Loudness Map (DirLoudMap)
[0441] This is, for example, valid for any implementation: (e.g., description of
[0446] For several (e. g. each) frequency bands (loop): [0447] b) Compute, for example, a directional analysis function on the t/f tiles of the several audio input channels->result: direction d (e. g. direction Ψ(m, k) or panning direction Ψ.sub.0,j). [0448] c) Compute, for example, a loudness on the t/f tiles of the several audio input channels [0449] ->result: loudness L [0450] Loudness computation could be simply energy, or—more sophisticated—energy (or Zwicker model: alpha=0.25-0.27) [0451] d.a) for example, enter/accumulate I contribution into DirLoudMap under direction d [0452] Optional: spreading (panning index: windowing) of I distributions between adjacent directions
end for
optionally, (if needed by application): Calculate broadband DirLoudMap [0453] d.b) summarize DirLoudMap over several (avoid: all) frequency bands to provide broadband DirLoudMap, indicating sound ‘activity’ as a function of direction/space
Example: Recovering Directional Signals with Windowing/Selection Function Derived from Banning Index (e.g., Description of
[0454] Left (see
[0455] Criterion function arbitrarily defined as: Ψ=level.sub.i/level.sub.r.
[0456] Criterion is, for example, “panning direction according to level”. For example, the level of each or several FFT bins. [0457] a) From the criterion function we can extract a windowing function/weighting function that selects the adequate frequency bins/spectral groups/components and recovers the directional signals. So the input spectrum (e. g. L and R) will be multiplied by different window functions Θ (one window function per each panning direction Ψ.sub.0) [0458] b) From the criterion function we have different directions associated to different values of Ψ (i.e. level ratios between L and R)
[0459] For recovering signals using method a)
Example 1) Panning direction center, Ψ.sub.0=1 (only keep bars that have the relationship Ψ=Ψ.sub.0=1. This is the directional signal (see
Example 2) Panning direction, slightly to the left, Ψ.sub.0=4/2 (only keep bars that have the relationship Ψ=Ψ.sub.0=4/2. This is the directional signal (see
Example 3) Panning direction, slightly to the right, Ψ.sub.0=3/4 (only keep bars that have the relationship Ψ=Ψ.sub.0=3/4. This is the directional signal (see
[0460] A criterion function can be arbitrarily defined as level of each DFT bin, energy per DFT bin group (Critical band)
or loudness per critical band
There can be different criteria for different applications.
Weighting (Optional)
[0461] Note: not to be confused with outer ear/middle ear (peripheral model) transfer function weighting, which weights, for example, critical bands.
[0462] Weighting: optionally instead of taking the exact value of Ψ.sub.0, use a tolerance range, and weight less importantly the values that deviate from Ψ.sub.0. i.e. “take all bars that obey a relationship of 4/3 and pass them with weight 1, values that are near, weight them with less than 14 for this, the Gaussian function could be used. In the above examples, the directional signals would have more bins, not weighted with 1, but with lower values.
[0463] Motivation: weighting enables a “smoother” transition between different directional signals, separation is not so abrupt since there is some “leaking” amongst the different directional signals.
[0464] For Example 3), it can look something like shown in
Embodiments of Different Forms of Calculating the Loudness Maps Using Generalized Criterion Functions
Option 1: Panning Index Approach (See FIG. 3a and FIG. 3b):
[0465] For (all) different Ψ.sub.0, a “value” map for this function in time can be assembled. A so called “directional loudness map” could be constructed either by [0466] Example 1) using a criterion function of “panning direction according to level of individual FFT bins”
so directional signals are, for example, composed of individual DFT bins. Then, for example, calculating the energy in each critical band (DFT bin group) for each directional signal, and then elevating these energies per critical band to an exponent of 0.25 or similar. .fwdarw.similar to the chapter “Objective assessment of spatial audio quality using directional loudness maps” [0467] Example 2) Instead of windowing the amplitude spectrum, one can window the loudness spectrum. The directional signals will be in the loudness domain already. [0468] Example 3) using directly a criterion function of “panning direction according to loudness of each critical band”
Then directional signals will be composed of chunks of whole critical bands that obey values given by Ψ.sub.0. [0469] For example, for Ψ.sub.0=4/3 the directional signal could be:
Y=1*critical_band_1+0.2*critical_band_2+0.001*critical_band_3. [0470] and different combinations for other panning directions/directional signals apply. Note that, in the case of the use of weighting, different panning directions could contain the same critical bands, but most likely with different weight values. If weighting is not applied, directional signals are mutually exclusive.
Option 2: Histogram Approach (See FIG. 4b):
[0471] It is a more general description of the overall directional loudness. It does not necessarily make use of the panning index (i.e. one does not need to recover “directional signals” by windowing the spectrum for calculating the loudness). An overall loudness the frequency spectrum is “distributed” according to their “analyzed direction” in the corresponding frequency region. Direction analysis can be level difference based, time difference based, or other form.
[0472] For each time frame (see
[0473] The resolution of the histogram H.sub.Ψ will be given, for example, by the amount of values given to the set of Ψ.sub.0. This is, for example, the amount of bins available for grouping occurrences of Ψ.sub.0 when evaluating Ψ within a time frame. Values are, for example, accumulated and smoothed over time, possibly with a “forgetting factor” a:
H.sub.Ψ(n)=αH.sub.Ψ.sub.
[0474] Where n is the time frame index.
[0475] While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents, which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.