Sound Field Related Rendering
20220303710 · 2022-09-22
Inventors
Cpc classification
H04S2420/07
ELECTRICITY
H04S2400/11
ELECTRICITY
H04S2420/11
ELECTRICITY
G10L19/008
PHYSICS
H04R2203/12
ELECTRICITY
International classification
H04S7/00
ELECTRICITY
G10L19/008
PHYSICS
Abstract
An apparatus for spatial audio reproduction including circuitry configured to: obtain at least one focus parameter configured to define a focus shape; process a spatial audio signal that represents an audio scene to generate a processed spatial audio signal that represents a modified audio scene, so as to control relative emphasis in, at least in part, a portion of the spatial audio signal in the focus shape relative to at least in part; other portions of the spatial audio signals outside the focus shape and output the processed spatial audio signal, wherein the modified audio scene enables the relative emphasis in, at least in part, the portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signals outside the focus shape.
Claims
1. An apparatus comprising at least one processor and at least one non-transitory memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least one focus parameter configured to define a focus shape; process a spatial audio signal that represents an audio scene to generate a processed spatial audio signal that represents a modified audio scene, so as to control relative emphasis in, at least in part, a portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signals outside the focus shape; and output the processed spatial audio signal, wherein the modified audio scene enables the relative emphasis in, at least in part, the portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signals outside the focus shape.
2. The apparatus according to claim 1, wherein at least one focus parameter is further configured to define a focus amount, and the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to process the spatial audio signal so as to control relative emphasis in, at least in part, a portion of the spatial audio signal in the focus shape according to the focus amount.
3. The apparatus according to claim 1, wherein the processed spatial audio signal is configured to cause the apparatus to: increase relative emphasis in, at least in part, the portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signals outside the focus shape; or decrease relative emphasis in, at least in part, the portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signal outside the focus shape.
4. (canceled)
5. The apparatus according to claim 2, wherein the processed spatial audio signal is configured to cause the apparatus to increase or decrease a relative sound level in, at least in part, the portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signal outside the focus shape.
6. The apparatus according to claim 1, wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus: obtain reproduction control information to control at least one aspect of outputting the processed spatial audio signal, and wherein the apparatus is caused to output the processed spatial audio signal further causes the apparatus to one of: process the processed spatial audio signal that represents the modified audio scene to generate an output spatial audio signal in accordance with the reproduction control information; or process the spatial audio signal in accordance with the reproduction control information before processing the spatial audio signal that represents the modified audio scene and output the processed spatial audio signal as the output spatial audio signal.
7. The apparatus according to claim 2, wherein the spatial audio signal and the processed spatial audio signal comprise respective Ambisonic signals and wherein the processed spatial audio signal is configured to cause the apparatus, for one or more frequency sub-bands, to: convert the Ambisonic signals associated with the spatial audio signal to a set of beam signals in a defined pattern; or generate, a set of modified beam signals based on the set of beam signals, the focus shape and the focus amount; or convert the modified beam signals to generate the modified Ambisonic signals associated with the processed spatial audio signal.
8. The apparatus according to claim 7, wherein the defined pattern comprises a defined number of beams which are spaced over a plane or over a volume.
9. The apparatus according to claim 7, wherein the spatial audio signal and the processed spatial audio signal comprise at least one of: respective higher order Ambisonic signals; or a subset of Ambisonic signal components of an order.
10. (canceled)
11. The apparatus according to claim 2, wherein the spatial audio signal and the processed spatial audio signal comprise respective parametric spatial audio signals, wherein a parametric spatial audio signal comprises one or more audio channels and spatial metadata, wherein the spatial metadata comprises a respective direction indication, an energy ratio parameter, and a distance indication for a plurality of frequency sub bands, wherein the processed spatial audio signal is configured to cause the apparatus to: compute, for one or more frequency sub-bands, spectral adjustment factors based on the spatial metadata, the focus shape and focus amount; apply the spectral adjustment factors for the one or more frequency sub-bands of the one or more audio channels to generate one or more processed audio channels; compute respective modified energy ratio parameters associated with the one or more frequency sub-bands of the processed spatial audio signal based on the focus shape, the focus amount and at least a part of the spatial metadata; or compose the processed spatial audio signal comprising the one or more processed audio channels, the modified energy ratio parameters, and the spatial metadata other than the energy ratio parameters.
12. The apparatus according to claim 2, wherein the spatial audio signal and the processed spatial audio signal comprise multi-channel loudspeaker channels and/or audio object channels, wherein the processed spatial audio signal is configured to cause the apparatus to: compute gain adjustment factors based on the respective audio channel direction indication, the focus shape and focus amount; or apply the gain adjustment factors to the respective audio channels; or compose the processed spatial audio signal comprising the one or more processed multichannel loudspeaker audio channels and/or the one or more processed audio object channels.
13. The apparatus according to claim 11, wherein the multi-channel loudspeaker channels and/or audio object channels further comprises respective audio channel distance indication, and wherein the computed gain adjustment factors are further based on the audio channel distance indication.
14. The apparatus according to claim 11, wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to determine a default respective audio channel distance, and wherein the computed gain adjustment factors are further based on the audio channel distance.
15. The apparatus according to claim 1, wherein the at least one focus parameter configured to define a focus shape comprises at least one of: a focus direction; a focus width; a focus height; a focus radius; a focus distance; a focus depth; a focus range; a focus diameter; or a focus shape characterizer.
16. The apparatus according to claim 1, wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to obtain a focus input from a sensor arrangement that comprises at least one direction sensor and at least one user input, wherein the focus input comprises: an indication of a focus direction for the focus shape based on the at least one direction sensor direction; and an indication of a focus width based on the at least one user input.
17. The apparatus according to claim 2, wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to obtain a focus input comprising at least one user input, and wherein the focus input further comprises an indication of the focus amount based on the at least one user input.
18. (canceled)
19. A method comprising: obtaining at least one focus parameter configured to define a focus shape; processing a spatial audio signal that represents an audio scene to generate a processed spatial audio signal that represents a modified audio scene, so as to control relative emphasis in, at least in part, a portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signal outside the focus shape; and outputting the processed spatial audio signal, wherein the modified audio scene enables the relative emphasis in, at least in part, the portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signal outside the focus shape.
20.-21. (canceled)
22. The apparatus according to claim 5, wherein the processed spatial audio signal is configured to cause the apparatus to increase or decrease the relative sound level according to the focus amount.
23. The method according to claim 17, wherein at least one focus parameter comprises defining a focus amount and processing the spatial audio signal comprises controlling relative emphasis in, at least in part, the portion of the spatial audio signal in the focus shape according to the focus amount.
24. The method according to claim 17, wherein processing the spatial audio signal comprises: increasing relative emphasis in, at least in part, the portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signals outside the focus shape; or decreasing relative emphasis in, at least in part, the portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signal outside the focus shape.
25. The method according to claim 23, wherein processing the spatial audio signal comprises at least one of: increasing or decreasing a relative sound level in, at least in part, the portion of the spatial audio signal in the focus shape relative to at least in part other portions of the spatial audio signal outside the focus shape; or increasing or decreasing the relative sound level according to the focus amount.
Description
SUMMARY OF THE FIGURES
[0069] For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
[0070]
[0071]
[0072]
[0073]
[0074]
[0075]
[0076]
[0077]
[0078]
[0079]
[0080]
[0081]
[0082]
[0083]
[0084]
[0085]
EMBODIMENTS OF THE APPLICATION
[0086] The following describes in further detail suitable apparatus and possible mechanisms for the provision of efficient rendering and playback of spatial audio signals.
[0087] Previous spatial audio signal playback example allows the user to control the focus direction and the focus amount. However, in some situations, such control of the focus direction/amount may not be sufficient. In some situations, it may be desirable to enable the user with a control interface to control the shape of the focus. In a sound field, there may be a number of different features such as multiple dominant sound sources in certain viewing directions as well as ambient sounds. Some users may prefer to hear certain features of the sound field whereas some others may prefer to hear alternative features of the sound field depending on which viewing direction is desirable. It is understood that such playback audio is dependent on one or more preferences and can be configurable based on user related preferences. The desired performance from the playback apparatus is to configure playback of the spatial sound so that the focus to various shapes or areas (e.g., narrow, wide, shallow, deep, near, far) can be controlled.
[0088] As an example, there may be audio content of interest within a sector (or a cone or another spatial span or range) rather than simply in one direction. Specifically it may be useful to control the spatial span of the focus. The
[0089] For example at first focus to all sources in the theatre play by keeping the focus sector relatively wide (as shown in
[0090] As another example, the desired or interesting audio content may be at a certain distance (with respect to the listener or with respect to another position). For example there may be an undesired or uninteresting audio source at a certain distance in a certain direction and a desired or an interesting audio source at another distance in the same direction (or nearly the same direction). This is shown in
[0091] Hence, the embodiments as discussed herein attempt to provide control of the focus shape (in addition to the focus direction and amount). The concept as discussed with respect to the embodiments described herein relates to spatial audio reproduction in media playback with multiple viewing directions by providing control of the audio focus shape where the audio scene over the controlled audio focus shape changes but the signal format can remain the same.
[0092] The embodiments provide at least one focus shape parameter corresponding to a selectable direction by adjusting any (or a combination of two or all) of the following parameters corresponding to the selected direction: focus width; focus height; focus radius; focus distance; and focus depth.: This parameter set in some embodiments comprises parameters which define any arbitrary shape. The spatial audio signal processing can in some embodiments be performed by: obtaining spatial audio signals associated with the media with multiple viewing directions; obtaining the focus direction and amount parameters; obtaining at least one focus shape parameter; modifying the spatial audio signals to have the desired focus characteristics; and reproducing the modified spatial audio signals (with headphones or loudspeakers).
[0093] The obtained spatial audio signals may, for example, be: Ambisonic signals; loudspeaker signals; parametric spatial audio formats such as a set of audio channels and the associated spatial metadata.
[0094] The focus shape may in some embodiments depend on which parameters are available. For example, in the case of having only direction, width, and height, the shape may be an ellipsoid cone-type volume. As another example, in the case of having only distance and depth, the focus shape may be a hollow sphere. In the case of not having width/height and/or depth, they may be considered to have some default value. Moreover, in some embodiments, an arbitrary focus shape may be used.
[0095] The focus amount may in some embodiments determine the ‘degree’ or how much to focus. For example the focus may be from 0% to 100%, where 0% means keeping the original sound scene unmodified, and 100% means focusing maximally on the desired spatial shape.
[0096] In some embodiments different users may want to have different focus characteristics and the original spatial audio signals may be individually modified and reproduced for each user, based on their individual preferences.
[0097] In the illustration of
[0098] Typically, the input audio signal and the audio signal with a focused sound component are provided in the same predefined spatial format, whereas the output audio signal may be provided in the same spatial format as applied for the input audio signal (and the audio signal with a focused sound component) or a different predefined spatial format may be employed for the output audio signal. The spatial audio format of the output audio signal is selected in view of the characteristics of the sound reproduction hardware applied for playback for the output audio signal.
[0099] In general, the input audio signal may be provided in a first predetermined spatial audio format and the output audio signal may be provided in a second predetermined spatial audio format. Non-limiting examples of spatial audio formats suitable for use as the first and/or second spatial audio format include Ambisonics, surround loudspeaker signals according to a predefined loudspeaker configuration, a predefined parametric spatial audio format. More detailed non-limiting examples of usage of these spatial audio formats in the framework of the spatial audio processing arrangement 250 as the first and/or second spatial audio format are provided later in this disclosure.
[0100] The spatial audio processing arrangement 250 is typically applied to process the input spatial audio signal 200 as a sequence of input frames into a respective sequence of output frames, each input (output) frame including a respective segment of digital audio signal for each channel of the input (output) spatial audio signal, provided as a respective time series of input (output) samples at a predefined sampling frequency. In some embodiments the input signal to the spatial audio processing arrangement 250 can be an encoded form, for example AAC, or AAC+ embedded metadata. In such embodiments the encoded audio input can be initially decoded. Similarly in some embodiments, the output from the spatial audio processing arrangement 250 could be encoded in any suitable manner.
[0101] In typical example, the spatial audio processing arrangement 250 employs a fixed predefined frame length such that each frame comprises respective L samples for each channel of the input spatial audio signal, which at the predefined sampling frequency maps to a corresponding duration in time. As an example in this regard, the fixed frame length may be 20 milliseconds (ms), which at a sampling frequency of 8, 16, 32 or 48 kHz results in a frame of L=160, L=320, L=640 and L=960 samples per channel, respectively. The frames may be non-overlapping or they may be partially overlapping, depending on if the processors apply filter banks and how these filter banks are configured. These values, however, serve as non-limiting examples and frame lengths and/or sampling frequencies different from these examples may be employed instead, depending e.g. on the desired audio bandwidth, on desired framing delay and/or on available processing capacity.
[0102] In the spatial audio processing arrangement 250, the focus refers to a user-selectable spatial region of interest. The focus may be, for example, a certain direction, distance, radius, arc of the audio scene in general. In another example, the focus region in which a (directional) sound source of interest is currently positioned. In the former scenario, the user-selectable focus typically denotes a region that stays constant or changes infrequently since the focus is predominantly in a specific spatial region, whereas in the latter scenario the user-selected focus may change more frequently since the focus is set to a certain sound source that may (or may not) change its position/shape/size in the audio scene over time. In an example, the focus may be defined, for example, as an azimuth angle that defines the spatial direction of interest with respect to a first predefined reference direction and/or as an elevation angle that defines the spatial direction of interest with respect to a second predefined reference direction and/or a shape and/or distance and/or radius or shape parameter.
[0103] The functionality described in the foregoing with references to components of the spatial audio processing arrangement 250 may be provided, for example, in accordance with a method 260 illustrated by a flowchart depicted in
[0104] The method 260 may be varied in a plurality of ways, for example in accordance with examples pertaining to respective functionality of components of the spatial audio processing arrangement 250 provided in the foregoing and in the following.
[0105] In some embodiments the input to the spatial audio processing arrangement 250 is Ambisonic signals. The apparatus can be configured to receive (and the method can be applied to) Ambisonic signals of any order. However, as the first-order Ambisonic (FOA) signal is in terms of the spatial selectivity fairly broad (first-degree directivity in specific), having fine control on focus shape is better exemplified with higher-order Ambisonics (HOA) that have higher spatial selectivity.
[0106] In particular in the following examples the method and apparatus is configured to receive 3.sup.rd order Ambisonic audio signals.
[0107] 3.sup.rd order Ambisonic audio signals have 16 beam pattern signals in total (in 3D). However for simplicity the following example consider here only those 7 Ambisonic components (in other word the audio signals) that are more “horizontal”, as shown in
[0108] With respect to
[0109] The input to the focus processor 550 in this example as described above is a subset 3.sup.rd order Ambisonic signal, for example the subsets 309 and 311. The 3.sup.rd order Ambisonic signal x.sub.HOA(t) 500 is also described in the following as HOA for simplicity. A signal x(t), where t is the discrete sample index, arriving from horizontal azimuth θ can be represented as a HOA signal by:
[0110] where a(θ) is the vector of Ambisonic weights for azimuth θ. As seen in this equation, the selected subset of the Ambisonic patterns can be defined with these very simple mathematical expressions in the horizontal plane.
[0111] In some embodiments the focus processor 550 comprises a matrix processor 501. The matrix processor 501 is configured in some embodiments to convert the Ambisonic (HOA) signals 500 (corresponding to Ambisonic or spherical harmonic patterns) to a set of beam signals (corresponding to beam patterns) in 7 evenly spaced horizontal directions. This in some embodiments may be represented by a transformation matrix T(θ.sub.f), where θ.sub.f is the focus direction 502 parameter:
Note that the transformation includes the focus direction θ.sub.f 502 parameter based processing such that the first pattern is aligned to the focus direction and the other patterns are aligned to other directions symmetrically spaced.
[0112] For example, when θ.sub.f=20 degrees, the beam patterns corresponding to the transformed signals x.sub.c(t) 504 and the beam patterns corresponding to the original HOA signals are shown in
[0113] The focus processor 550 may further comprise a spatial beams (based on focus parameters) processor 503. The spatial beams processor 503 is configured to receive the transformed Ambisonic signals x.sub.c(t) 504 from the matrix processor 501 and furthermore receive the focus amount and width focus parameters 508.
[0114] The spatial beams processor 503 is configured to then to modify the spatial beam signals x.sub.c(t) 504 to generate processed or modified spatial beam signals x′.sub.c(t) 506 based on the focus amount and shape parameters 508. The processed or modified spatial beam signals x′.sub.c(t) 506 can then be output to a further matrix processor 505. The spatial beams processor 503 is configured to implement various processing methods based on the types of focus shape parameters. In this example embodiment the focus parameters are focus direction, focus width, and focus amount. The focus amount can be determined as a value a ranging between 0 . . . 1 where 1 denotes the maximum focus. The focus width θ.sub.w(determined as the angle from the focus direction to the edge of the focus arc) is also a variable or controllable parameter. The spatial beam signals can be generated by
x′.sub.c(t)=I(θ.sub.w, a)x.sub.c(t),
[0115] where I(θ.sub.w, a) is a diagonal matrix with its diagonal elements determined as i(θ.sub.w, a), where
[0116] It should be noticed that the beams x.sub.c(t) are in this example formulated in such a manner that the first beam points towards the focus direction, the second beam towards the focus direction+p, and so on. As the result, when applying the matrix I(θ.sub.w, a), the beams farther away from the focus direction will be attenuated depending on the focus width parameter.
[0117] The focus processor 201 comprises a further matrix processor 505. The further matrix processor 505 is configured to receive the processed or modified spatial beam signals x′.sub.c(t) 506 and the focus direction 502 and inverse transform the result to generate the focus-processed HOA signals. The transformation matrix T(θ.sub.f) is invertible, and therefore the inversion processing can be expressed as
x′.sub.HOA(t)=T.sup.−1(θ.sub.f)x′.sub.c(t),
[0118] where x′.sub.HOA(t) is the focus processed HOA output 510.
[0119] With respect to
[0120] In the above examples, HOA processing is considered only in a set of more “horizontal” beam pattern signals was shown. It would be understood that these operations can be extended to 3D, using a set of beam patterns in 3D.
[0121] With respect to
[0122] The initial operation is receiving the HOA audio signals (and the focus parameters such as direction, width, amount or other control information) as shown in
[0123] The next operation is the generating of the transformed HOA audio signals into beam signals as shown in
[0124] Having transformed the HOA audio signals into beam signals then the next operation is one of spatial beams processing as shown in
[0125] Then the processed beam audio signals are then inverse transformed back into a HOA format as shown in
[0126] The processed HOA audio signals are then output as shown in
[0127] With respect to
[0128] Services) audio stream, which can be decoded and demultiplexed to the form of spatial metadata and audio channels. A typical number of audio channels in such a parametric spatial audio stream is two audio channels audio signals, however in some embodiments the number of audio channels can be any number of audio channels.
[0129] In these examples the parametric information comprises depth/distance information, which may be implemented in 6-degrees of freedom (6DOF) reproduction. In 6DOF, the distance metadata is used (along with the other metadata) to determine how the sound energy and direction should change as a function of user movement.
[0130] Therefore in this example each spatial metadata direction parameter is associated both with a direct-to-total energy ratio and a distance parameter. The estimation of distance parameters in context of parametric spatial audio capture has been detailed in earlier applications such as GB patent applications GB1710093.4 and GB1710085.0 and is not explored further for clarity reasons.
[0131] The focus processor 850 configured to receive parametric (in this case 6DOF-enabled) spatial audio 800 is configured to use the focus parameters (which in these examples are focus direction, amount, distance, and radius) to determine how much the direct and ambient components of the parametric spatial audio signal should be attenuated or emphasized to enable the focus effect.
[0132] In the following example the method (and the formulas) are expressed without any variations over time it should be understood that all the parameters may vary over the time.
[0133] In some embodiments the focus processor comprises a ratio modifier and spectral adjustment factor determiner 801 which is configured to receive the focus parameters 808 and additionally the spatial metadata consisting of directions 802, distances 822, direct-to-total energy ratios 804 in frequency bands.
[0134] The ratio modifier and spectral adjustment factor determiner is configured to implement the focus shape as a sphere in 3D space. First, the focus direction and distance are converted to a Cartesian coordinate system (3×1 y-z-x vector f) by
[0135] Similarly, at each frequency band k, the spatial metadata directions and distances are converted into the Cartesian coordinate system (3×1 y-z-x vector m(k)) by
[0136] The units of the spatial metadata distance and focus distance parameters should be the same (e.g., both in meters, or in any other scale). A mutual distance value d(k) between f and m(k) may be formulated simply as:
d(k)=|f−m(k)|,
[0137] which here means the length of the vector (f−m(k)).
[0138] The mutual distance value d(k) is then utilized in a gain-function along with the focus amount parameter a that is between 0 . . . 1 and the focus radius parameter d.sub.r (in same units as d(k)). When we perform focus, an example gain formula is
[0139] where c is a gain constant for the focus, for example a value of 4.
[0140] In practice, it may be desirable to smooth the above functions such that the focus gain function smoothly transitions from a high value at the focus area to a low value at the non-focused area.
[0141] Then a new direct portion value D(k) of the parametric spatial audio signal can be formulated as
D(k)=r(k )*f(k)
[0142] where r(k) is the direct-to-total energy ratio value at band k. A new ambient portion value A(k) can be formulated as
A(k)=(1−r(k))*(1−a).
[0143] The spectral correction factors (k) that is output 812 to a spectral adjustment processor 803 is then formulated based on the overall modification of the sound energy, in other words,
s(k)=√{square root over (D(k)+A(k))}.
[0144] A new modified direct-to-total energy ratio parameter r′(k) is then formulated to replace r(k) in the spatial metadata
[0145] At the numerically undetermined case D(k)=A(k)=0, then r′(k) can also be set to zero.
[0146] The direction and distance parameters of the spatial metadata may in some embodiments be not modified by the metadata adjustment and spectral adjustment factor determiner 801 and the modified and unmodified metadata output 810.
[0147] The spatial processor 850 may comprise a spectral adjustment processor 803. The spectral adjustment processor 803 may be configured to receive the audio signals 806 and the spectral adjustment factors 812. The audio signals can in some embodiments be in a time-frequency representation, or alternatively they are first transformed to the time-frequency domain for the spectral adjustment processing. The output 814 also can be in the time-frequency domain, or inverse transformed to the time domain before the output. The domain of the input and output depends on the implementation.
[0148] The spectral adjustment processor 803 may be configured to multiply, for each band k, the frequency bins (of the time frequency transform) of all channels within the band k by the spectral adjustment factor s(k). In other words performing the spectral adjustment. The multiplication (i.e., spectral correction) may be smoothed over time to avoid processing artefacts.
[0149] In other words, the processor is configured to modify the spectrum of the signal and the spatial metadata such that the procedure results in a parametric spatial audio signal that has been modified according to the focus parameters (in this case: focus direction, amount, distance, radius).
[0150] With respect to
[0151] The initial operation is receiving the parametric spatial audio signals (and focus parameters or other control information) as shown in
[0152] The next operation is the modifying of the parametric metadata and generating the spectral adjustment factors as shown in
[0153] The next operation is making a spectral adjustment to the audio signals as shown in
[0154] With respect to
[0155] For audio objects which have a direction and a distance (i.e., a position), the focus gain determiner 901 can utilize the same implementation processing as expressed in context of the parametric audio processing to determine the direct-gain f(k) 912 based on the spatial metadata and the focus parameters. In these embodiments there is no filter bank. In other words, there is only one frequency band k.
[0156] The focus processor furthermore may comprise a focus gain processor (for each channel) 903. The focus gain processor 903 is configured to receive the focus gains f(k) 912 for each audio channel and the audio signals 906. The focus gains 912 can then be applied to the corresponding audio channel signals 906 (and in some embodiments furthermore be temporal smoothed). The output from the focus gain processor 903 may be a focus-processed audio channel audio signal 914.
[0157] In these examples the channel directional/positional information 902 is unaltered and also provided as a channel directional/positional information output 910.
[0158] In some embodiments when the input audio channels do not have distance information (e.g., the input is loudspeaker or object sound with only directions but not distance) one option to handle such audio channels is to determine a fixed default distance for such signals and apply the same formula to determine f(k).
[0159] In some embodiments determining the focus gain f(k) 912 for such audio channels may be based on the angular difference between the focus direction and the direction of the audio channel. In some embodiments this may first determine a focus width θ.sub.w. For example as shown in
Then the angle θ.sub.a is determined between the focus direction and the direction of the audio channel (for each audio channel individually). Then similar formula as discussed above can be used to determine f(k), where d.sub.r is replaced by θ.sub.w and d(k) replaced by θ.sub.a (when determining the focus gain for the audio channels without the distance information). In some embodiments when the focus radius is larger than focus distance, the asin function above is not defined, and a large value (e.g., π) can be used for the focus width θ.sub.w.
[0160] With respect to
[0161] The initial operation is receiving the multichannel/object audio signals (and focus parameters or other control information and channel information such as directions/distances) as shown in
[0162] The next operation generating the focus gain factors as shown in
[0163] The next operation is applying a focus gain for each channel audio signals as shown in
[0164] Then the processing audio signal and unmodified channel directions (and distances) can then be output as shown in
[0165] With respect to
[0166] In these examples reproduction processor may comprise an Ambisonic rotation matrix processor 1101. The Ambisonic rotation matric processor 1101 is configured to receive the Ambisonic signal with focus processing 1100 and the view direction 1102. The Ambisonic rotation matrix processor 1101 is configured to generate a rotation matrix based on the view direction parameter 1102. This may in some embodiments use any suitable method, such as those applied in head-tracked Ambisonic binauralization (or more generally, such rotation of spherical harmonics is used in many fields, including other than audio). The rotation matrix then be applied to the Ambisonic audio signals. The result of which are rotated
[0167] Ambisonic signals with added focus 1104, which are output to an Ambisonic to binaural filter 1103.
[0168] The Ambisonic to binaural filter 1103 is configured to receive the rotated Ambisonic signals with added focus 1104. The Ambisonic to binaural filter 1103 may comprise a pre-formulated 2×K matrix of finite impulse response (FIR) filters that are applied to the KAmbisonic signals to generate the 2 binaural signals 1106. The FIR filters may have been generated by least-squares optimization methods with respect to a set of head-related impulse responses (HRIRs). An example of such a design procedure is to transform the HRIR data set to frequency bins (for example by FFT) to obtain the HRTF data set, and to determine for each frequency bin a complex-valued processing matrix that in a least-squares sense approximates the available HRTF data set at the data points of the HRTF data set. When for all frequency bins the complex valued matrices are determined in such a way, the result can be inverse transformed (for example by inverse FFT) as time-domain FIR filters. The FIR filters may also be windowed, for example by using a Hann window.
[0169] There are many known methods which may be used to render an Ambisonic signal to loudspeaker output. One example may be a linear decoding of the Ambisonic signals to a target loudspeaker configuration. This may be applied when the order of the Ambisonic signals is sufficiently high, for example, at least 3.sup.rd order, but preferably 4.sup.th order. In a specific example of such linear decoding an Ambisonic decoding matrix may be designed that, when applied to the Ambisonic signals (corresponding to Ambisonic beam patterns), generates loudspeaker signals corresponding to beam patterns that in a least-square sense approximate the vector-base amplitude panning (VBAP) beam patterns suitable for the target loudspeaker configuration. Processing the Ambisonic signals with such a designed Ambisonic decoding matrix may be configured to generate the loudspeaker sound output. In such embodiments the reproduction processor is configured to receive information regarding the loudspeaker configuration.
[0170] With respect to
[0171] The initial operation is receiving the focus processed Ambisonic audio signals (and the view directions) as shown in
[0172] The next operation is one of generating rotation matrix based on the view direction as shown in
[0173] The next operation is applying the rotation matrix to the Ambisonic audio signals to generate rotated Ambisonic audio signals with focus processing as shown in
[0174] Then the next operation is converting the Ambisonic audio signals to a suitable audio output format, for example a binaural format (or a multichannel audio format) as shown in
[0175] Then the output audio format is then output as shown in
[0176] With respect to
[0177] In some embodiments the reproduction processor comprises a filter bank 1201 configured to receive the audio channels 1200 audio signals and transform the audio channels to frequency bands (unless the input is already in a suitable time-frequency domain). Examples of suitable filter banks include the short-time
[0178] Fourier transform (STFT) and the complex quadrature mirror filter (QMF) bank. The time-frequency audio signals 1202 can be output to a parametric binaural synthesizer 1203.
[0179] In some embodiments the reproduction processor comprises a parametric binaural synthesizer 1203 configured to receive the time-frequency audio signals 1202 and the modified (and unmodified) metadata 1204 and also the view direction 1206 (or suitable reproduction related control or tracking information). In context of 6DOF reproduction, the user position may be provided along with the view direction parameter.
[0180] The parametric binaural synthesizer 1203 may be configured to implement any suitable known parametric spatial synthesis method configured to generate a binaural audio signal (in frequency bands) 1208, since the focus modification has taken place already for the signals and the metadata before the parametric binauralization block. The binauralized time-frequency audio signals 1208 can then be passed to an inverse filter bank 1205. The embodiments may further feature the reproduction processor comprising an inverse filter bank 1205 configured to receive the binauralized time-frequency audio signals 1208 and generate an inverse to the applied forward filter bank thus generate a time domain binauralized audio signal 1210 with the focus characteristics suitable for reproduction by headphones (not shown in
[0181] In some embodiments the binaural audio signal output is replaced by a loudspeaker channel audio signals output format from the parametric spatial audio signals using suitable loudspeaker synthesis methods. Any suitable approach may be used, for example one where the view direction parameter is replaced with information of the positions of the loudspeakers, and the binaural processor is replaced with a loudspeaker processor, based on suitable known methods.
[0182] With respect to
[0183] The initial operation is receiving the focus processed parametric spatial audio signals (and the view directions or other reproduction related control or tracking information) as shown in
[0184] The next operation is one of time-frequency converting the audio signals as shown in
[0185] The next operation is applying a parametric binaural (or loudspeaker channel format) processor based on the time-frequency converted audio signals, the metadata and viewing direction (or other information) as shown in
[0186] Then the output audio format is then output as shown in
[0187] Considering a loudspeaker output for the reproduction processor when the audio signal is in a form of multichannel audio and focus processor 950 in
[0188] In some embodiments the conversion from the first loudspeaker configuration to the second loudspeaker configuration may be implemented using any suitable amplitude panning technique. For example an amplitude panning technique may comprise deriving a N-by-M matrix of amplitude panning gains that define conversion from a M channels of the first loudspeaker configuration to a N channels of the second loudspeaker configuration and then use the matrix to multiply the channels of an intermediate spatial audio signal provided as a multi-channel loudspeaker signal according to the first loudspeaker configuration. The intermediate spatial audio signal may be understood to be similar to the audio signal with a focused sound component 204 as shown in
[0189] For binaural output any suitable binauralization of a multi-channel loudspeaker signal format (and/or objects) may be implemented. For example a typical binauralization may comprise processing the audio channels with head-related transfer functions (HRTFs) and adding synthetic room reverberation to generate an auditory impression of a listening room. The distance+directional (i.e., positional) information of the audio object sounds can be utilized for the 6DOF reproduction with user movement, by adopting the principles outlined for example in GB patent application GB1710085.0.
[0190] An example apparatus suitable for implementation is shown in
[0191] An audio bitstream obtainer 1423 is configured to obtain an audio bitstream 1424, for example being received/retrieved from storage. In some embodiments the mobile device comprises a decoder 1425 configured to receive compressed audio and decode it. Examples of the decoder is an AAC decoder in the case of AAC decoding. The resulting decoded (for example Ambisonic where the example implements the examples as shown in
[0192] The mobile phone 1401 receives controller data 1400 (for example via Bluetooth) from an external controller at a controller data receiver 1411 and passes that data to the focus parameter (from controller data) determiner 1421. The focus parameter (from controller data) determiner 1421 determines the focus parameters, for example based on the orientation of the controller device and/or button events. The focus parameters can comprise any kind of combination of the proposed focus parameters (e.g., focus direction, focus amount, focus height, and focus width). The focus parameters 1422 are forwarded to the focus processor 1427.
[0193] Based on the Ambisonic audio signals and focus parameters a focus processor 1427 is configured to create modified Ambisonic signals 1428 that have desired focus characteristics. These modified Ambisonic signals 1428 are forwarded to the Ambisonic to binaural processor 1429. The Ambisonic to binaural processor 1429 also is configured to receive head orientation information 1404 from the orientation tracker 1413 of the mobile phone 1401. Based on the modified
[0194] Ambisonic signals 1428 and the head orientation information 1404, the Ambisonic to binaural processor 1429 is configured to create head-tracked binaural signals 1430 which can be outputted from the mobile phone, and played back using, e.g., headphones.
[0195]
[0196] In some embodiments the focus amount can be controlled using Focus amount buttons (shown in
[0197] In some embodiments the focus shape can be determined by drawing the desired shape with a controller (e.g., with the one depicted in
[0198] In some embodiments, the focus controller as shown in
[0199] In an example scene, there are two sources of interest, for example talkers. The user then points and clicks “select focus direction” to both of these sources, and the visual display then indicates for the user that these sources (which are not only auditory sources but also visual sources at certain directions and distances) have been selected for audio focus. Then the user selects the focus amount and focus radius parameters, where the focus radius indicates how far auditory events from the sources of interest are to be included within the determined focus shape. During control adjustment, the focus radius could be indicated as visual spheres around the visual sources of interest.
[0200] The visual field may react to user movement, but also the sources may move within the scene, and the source positions are tracked, typically visually. Therefore, the focus shape, which in this case may be represented by two spheres in the 3D space, then change its overall shape adaptively by moving those spheres.
[0201] In other words, a complex focus shape with also depth focus is obtained. Then, depending on the spatial audio format that focus shape can be either accurately reproduced (in a condition where the spatial audio has reliable distance information), or approximated otherwise, for example as was exemplified in above.
[0202] In some embodiments, it may be desirable to further specify the focus processing, for example by determining a desired frequency range or spectral property of the focused signal. In particular, it may be useful to emphasize the focused audio spectrum at the speech frequency range to improve the intelligibility, for example by attenuating low frequency content (for example, below 200 Hz), and the high-frequency content (for example, above 8 kHz), thus leaving a particularly useful frequency range related to speech.
[0203] It is understood that the focus-processed signal may be further processed with any known audio processing techniques, such as automatic gain control or enhancement techniques (e.g. bandwidth extension, noise suppression).
[0204] In some further embodiments, the focus parameters (including the direction, the amount and at least one focus shape parameter) are generated by a content creator, and the parameters are sent alongside the spatial audio signal. For example the scene may be a VR video/audio recording of an unplugged music concert near the stage. The content creator may assume that the typical remote listener wishes to determine a focus arc that spans towards the stage, and also to the sides for room acoustic effect, but removes the direct sounds from the audience (behind the VR camera main direction) at least to some degree. Therefore, a focus parameter track is added to the stream, and it can be set as the default rendering mode. However, the audience sounds are nevertheless present in the stream, and some users may prefer to discard the focus processing and enable the full sound scene including the audience sounds to be reproduced.
[0205] In other words, instead of user needing to select the direction and shape of the focus, a potentially dynamic focus parameter pre-set can be selected. The pre-set may have been fine-tuned by the content creator to well follow the show, for example, such that the focusing is turned off at the end of each song, to play back the applause to the listener. The content creator can generate some expected preference profiles as the focus parameters. The approach is beneficial since only one spatial audio signal needs to be conveyed, but different preference profiles can be added. A legacy player not enabled with focus may decode the Ambisonic signal without focus procedures.
[0206] In some further embodiments, the focus shape is controlled along with a visual zoom in the video with multiple viewing directions. The visual zoom can be conceptualized as the user controlling a set of virtual binoculars in the panoramic or 360 or 3D video. In such a use case, when the visual zoom feature is enabled (for example at least 1.5× zoom is set), then the audio focus of the spatial audio signal can also be enabled. Since the user is then clearly interested in that particular direction, the focus amount can be set to a high value, for example 80%, and the focus width can be set to correspond to the arc of the visual view in the virtual binoculars. In other words, the focus width gets smaller when the visual zoom is increased. As the focus was set to 80%, the user can hear to some degree the remaining spatial sound at the appropriate directions. In that way, the user hears the occurrence of interesting new content, and knows to turn off the visual zoom and to view to the new direction of interest. The zoom processing may also be used in the context of audio codecs that allow such processing. An example of such a codec could, e.g., be MPEG-I.
[0207] A user in such embodiments as described above may control the focus shape in a versatile way using the present invention.
[0208] An example processing output based on the implementation described for higher-order Ambisonics (HOA) signals is shown in
[0209] With respect to
[0210] In some embodiments the device 1700 comprises at least one processor or central processing unit 1707. The processor 1707 can be configured to execute various program codes such as the methods such as described herein.
[0211] In some embodiments the device 1700 comprises a memory 1711. In some embodiments the at least one processor 1707 is coupled to the memory 1711. The memory 1711 can be any suitable storage means. In some embodiments the memory 1711 comprises a program code section for storing program codes implementable upon the processor 1707. Furthermore in some embodiments the memory 1711 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1707 whenever needed via the memory-processor coupling.
[0212] In some embodiments the device 1700 comprises a user interface 1705. The user interface 1705 can be coupled in some embodiments to the processor 1707. In some embodiments the processor 1707 can control the operation of the user interface 1705 and receive inputs from the user interface 1705. In some embodiments the user interface 1705 can enable a user to input commands to the device 1700, for example via a keypad. In some embodiments the user interface 1705 can enable the user to obtain information from the device 1700. For example the user interface 1705 may comprise a display configured to display information from the device 1700 to the user. The user interface 1705 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1700 and further displaying information to the user of the device 1700.
[0213] In some embodiments the device 1700 comprises an input/output port 1709. The input/output port 1709 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1707 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
[0214] The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
[0215] The transceiver input/output port 1709 may be configured to receive the signals and in some embodiments obtain the focus parameters as described herein.
[0216] In some embodiments the device 1700 may be employed to generate a suitable audio signal using the processor 1707 executing suitable code. The input/output port 1709 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones (which may be a headtracked or a non-tracked headphones) or similar.
[0217] In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
[0218] The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
[0219] The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
[0220] Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
[0221] Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, Calif. automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or “fab” for fabrication.
[0222] The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.