System and method for handling digital content
10856093 ยท 2020-12-01
Assignee
Inventors
Cpc classification
H04S2400/15
ELECTRICITY
H04S2420/13
ELECTRICITY
H04S7/302
ELECTRICITY
H04R5/04
ELECTRICITY
H04S7/30
ELECTRICITY
H04R3/02
ELECTRICITY
H04S7/301
ELECTRICITY
H04S3/008
ELECTRICITY
H04S2400/01
ELECTRICITY
International classification
H04S3/00
ELECTRICITY
H04R5/04
ELECTRICITY
H04S7/00
ELECTRICITY
H04R3/02
ELECTRICITY
Abstract
The invention refers to a system for handling digital content including an input interface, a calculator, and an output interface. The input interface receives digital content and includes a plurality of input channels. At least one input channel receives digital content from a sensor or a group of sensors belonging to a recording session. The calculator provides output digital content by adapting received digital content to a reproduction session in which the output digital content is to be reproduced. The output interface outputs the output digital content and includes a plurality of output channels, wherein at least one output channel outputs the output digital content to an actuator or a group of actuators belonging to the reproduction session. Further, the input interface, the calculator, and the output interface are connected with each other via a network. The input interface is configured to receive digital content via by Ni input channels, where the number Ni is based on a user interaction, and/or the output interface is configured to output the output digital content via by No output channels, where the number No is based on a user interaction. The invention further refers to a corresponding method.
Claims
1. A system for handling digital content, wherein the system comprises an input interface, a calculator, and an output interface, and wherein the system is a platform for ad hoc multichannel audio capturing and rendering, wherein the input interface is configured to wirelessly receive the digital content, wherein the input interface comprises a plurality of Ni input channels, wherein a group of input channels is configured to receive the digital content from a group of sensors belonging to a recording session, wherein the calculator is configured to provide an output digital content by adapting the digital content to a reproduction session in which the output digital content is to be reproduced, wherein the output interface is configured to output the output digital content, wherein the output interface comprises a plurality of No output channels, wherein a group of output channels is configured to wirelessly output the output digital content to a group of actuators belonging to the reproduction session, wherein the input interface, the calculator, and the output interface are connected with each other via a network, wherein the number Ni is based on a user interaction, and wherein the number No is based on a user interaction, wherein the system is configured to allow associating the digital content with the recording session or the output digital content with the reproduction session, wherein the recording session is associated with the group of sensors and wherein the reproduction session is associated with the group of actuators, wherein the system is configured to handle the digital content belonging to the recording session, and wherein the system is configured to handle the digital content belonging to the reproduction session, wherein the system is configured to initialize a time synchronization routine for the group of sensors associated with the recording session, so that the sensors of the group of sensors are time synchronized, wherein the system is configured to initialize a time synchronization routine for the group of actuators associated with the reproduction session, so that the actuators of the group of actuators are time synchronized, wherein the system is configured to initialize a localization routine for the group of sensors providing information about locations of the sensors of the group of sensors, and wherein the system is configured to initialize a localization routine for the group of actuators providing information about locations of the actuators of the group of actuators.
2. The system of claim 1, wherein a central unit comprising the input interface, the calculator, and the output interface is configured to use the input channels for the time synchronization of the group of sensors by providing a common clock signal for the group of sensors, and wherein the central unit is configured to use the input channels for triggering the group of sensors to submit information about their locations to the central unit.
3. The system of claim 1, wherein the calculator is configured to provide a modified content by adapting the digital content to a reproduction session neutral format based on the information about the locations of the group of sensors, and wherein the calculator is configured to adapt the modified content being in the reproduction session neutral digital content to the reproduction session based on the information about the locations of the group of actuators.
4. The system of claim 1 wherein the locations of the sensors of the group of sensors are time-variant, and wherein the system is configured to run an algorithm for automatic synchronization localization the recording session.
5. The system of claim 1, wherein the platform is configured to combine different recording sessions and different kinds of the digital content with different reproduction sessions, and wherein platform is configured to personalize the recording session and the reproduction session concerning the numbers of the groups of sensors and actuators and the positions of the groups of sensors and the groups of actuators, respectively.
6. The system of claim 1, wherein the calculator is configured to provide a temporally coded content by performing a temporal coding on the digital content to obtain a temporally compressed format, wherein the temporal coding comprises recording a time stamp track in addition to an actual audio signal for sensor of the group of sensors, wherein the time stamp is acquired from a globally provided clock signal or from a session local network clock.
7. The system of claim 1, wherein the system comprises a user interface for allowing a user an access to the system, wherein the user interface is web-based, and wherein the user interface is configured to allow the user to initiate at least one of the following sessions: a registering session comprising registering the user or changing a user registration or de-registering the user, a login/logout session comprising a login of the user or a logout of the user, a sharing session sharing a session, the recording session comprising recording the digital content or uploading the digital content, the reproduction session comprising outputting the output digital content or reproducing the output digital content, and a duplex session comprising a combination of the recording session and the reproduction session.
8. The system of claim 1, wherein the group of sensors in the recording session comprises a number of smartphones, and wherein the group of actuators in the reproduction setting comprises a number of smartphones, and wherein the digital content is transmitted via mobile phone connections.
9. The system of claim 1, wherein the system is configured to initialize a calibration routine for the group of sensors associated with the recording session for providing calibration data for the group of sensors, and to initialize a calibration routine for the group of actuators associated with the reproduction session for providing calibration data for the group of actuators.
10. The system of claim 8, wherein the calculator is configured to provide the output digital content based on the digital content and based on transfer functions associated with the group of sensors belonging to the recording session by decomposing a wave field of the specified recording session into mutually statistically independent components, where the mutually statistically independent components are projections onto basis functions, where the basis functions are based on normal vectors and the transfer functions, and where the normal vectors are based on a curve calculated based on locations associated with the group of sensors belonging to the recording session, or wherein the calculator is configured to provide the output digital content based on the digital content and based on transfer functions associated with the group of actuators belonging to the reproduction session by decomposing a wave field of the recording session into mutually statistically independent components, where the mutually statistically independent components are projections onto basis functions, where the basis functions are based on normal vectors and the transfer functions, and where the normal vectors are based on a curve calculated based on locations associated with the group of actuators belonging to the reproduction session.
11. The system of claim 8, wherein the calculator is configured to divide the transfer functions in a time domain into early reflection parts and late reflection parts.
12. The system of claim 1, wherein the calculator is configured to provide a signal description for the digital content based on locations associated with the group of actuators of the reproduction session, where the signal description is given by decomposing the digital content into spatially independent signals that sum up to an omnidirectional sensor, and where the spatially independent signals comprise corresponding looking directions towards the actuators or of the group of actuators and spatial nulls into directions different from the looking directions.
13. The system of claim 1, wherein the system is configured to handle the digital content in full duplex, wherein a duplex session comprises a combination of the recording session and the reproduction session, and wherein the calculator is configured to perform a multichannel acoustic echo control in order to reduce echoes resulting from couplings between the group of sensors associated with the recording session and the group of actuators associated with the reproduction session.
14. A method for handling digital content, comprising: receiving the digital content by a input interface, wherein the input interface comprises a plurality of Ni input channels, wherein the input channels are configured to receive the digital content from a group of sensors belonging to a recording session; wherein the method comprises operating a platform for ad hoc multichannel audio capturing and rendering, providing an output digital content by adapting the digital content to a reproduction session in which the output digital content is to be reproduced, and outputting the output digital content by an output interface, wherein the output interface comprises a plurality of Ni output channels, wherein the output channels are configured to output the output digital content to a group of actuators belonging to the reproduction session, wherein the digital content and/or the output digital content is transferred via a wireless network, and wherein the number Ni is based on a user interaction, and wherein the number No is based on a user interaction, wherein the method allows associating the digital content with the recording session or the output digital content with the reproduction session, wherein the recording session is associated with the group of sensors and wherein the reproduction session is associated with the group of actuators, wherein the method handles the digital content belonging to the recording session, and wherein the method handles the digital content belonging to the reproduction session, wherein method initializes a time synchronization routine for the group of sensors associated with the recording session, so that the sensors of the group of sensors are time synchronized, wherein the method initializes a time synchronization routine for the group of actuators associated with the reproduction session, so that the actuators of the group of actuators are time synchronized, wherein the method initializes a localization routine for the group of sensors providing information about locations of the sensors of the group of sensors, and wherein the method initializes a localization routine for the group of actuators providing information about locations of the actuators of the group of actuators.
15. A non-transitory digital storage medium having a computer program stored thereon to perform, when said computer program is run by a computer, the method for handling digital content, the method comprising: receiving the digital content by a input interface, wherein the input interface comprises a plurality of Ni input channels, wherein the input channels are configured to receive the digital content from a group of sensors belonging to a recording session, wherein the method comprises operating a platform for ad hoc multichannel audio capturing and rendering, providing an output digital content by adapting the digital content to a reproduction session in which the output digital content is to be reproduced, and outputting the output digital content by an output interface, wherein the output interface comprises a plurality of Ni output channels, wherein the output channels are configured to output the output digital content to a group of actuators belonging to the reproduction session, wherein the digital content and the output digital content is transferred via a wireless network, and wherein the number Ni is based on a user interaction, and wherein the number No is based on a user interaction, wherein the method allows associating the digital content with the recording session or the output digital content with the reproduction session, wherein the recording session is associated with the group of sensors and wherein the reproduction session is associated with the group of actuators, wherein the method handles the digital content belonging to the recording session, and wherein the method handles the digital content belonging to the reproduction session, wherein method initializes a time synchronization routine for the group of sensors associated with the recording session, so that the sensors of the group of sensors are time synchronized, wherein the method initializes a time synchronization routine for the group of actuators associated with the reproduction session, so that the actuators of the group of actuators are time synchronized, wherein the method initializes a localization routine for the group of sensors providing information about locations of the sensors of the group of sensors, and wherein the method initializes a localization routine for the group of actuators providing information about locations of the actuators of the group of actuators.
16. The system of claim 12, wherein the actuators of the group of actuators are spatially surrounded by the sensors, and wherein the spatial nulls correspond to sectors of quiet zones or are based on at least one focused virtual sink with a directivity pattern achieved by a superposition of focused multipole sources according to a wave field synthesis or according to a time reversal cavity.
17. The system of claim 1, wherein the positions associated with the sensors of the group of sensors of the recording session and the positions associated with the actuators of the group of actuators of the reproduction session, respectively, coincide within a given tolerance level, and wherein the calculator is configured to provide the output digital content so that the actuators reproduce the digital content recorded by the sensors with coinciding positions, or wherein the positions associated with the group of sensors of the recording session and associated with the group of actuators of the reproduction session, respectively, coincide up to a spatial shift, and wherein the calculator is configured to provide the output digital content based on a compensation of the spatial shift.
18. The system of claim 1, wherein the calculator is configured to provide the output digital content by performing an inverse modeling for the digital content by calculating a system inversing a room acoustic of a reproduction room of a recording session, or wherein the calculator is configured to provide the output digital content by adapting the digital content to a virtual reproduction array and by extrapolating the adapted digital content to positions associated with the group of actuators of the reproduction session, or wherein the calculator is configured to provide the output digital content based on the digital content by placing virtual sources either randomly or according to data associated with the number No of output channels.
19. The system of claim 1, wherein the time synchronization routine for the recording session is performed such that each sensor of the group of sensors of the recording session acquires a common clock signal, and wherein the time synchronization routine for the reproduction session is performed such that each actuator of the group of actuators of the reproduction session acquires a common clock signal for the actuators.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION OF THE INVENTION
(10)
(11) The audio signals are recorded by three sensors in the form of microphones: M1, M2, and M3. The sensors M1, M2, M3 are individual nodes and belong to a recording session. The sensors belong in one embodiment to smartphones.
(12) In a reproduction session a consuming user U is interested in hearing the audio signals.
(13) For this purpose, four loudspeakers L1, L2, L3, and L4 serve in this embodiment for reproducing or replaying the audio signals stemming from the two sources S1, S2.
(14) As there are in the recording session three microphones M1, M2, M3 located in front of the signal sources S1, S2 and as there are in the reproduction session four loudspeakers L1, L2, L3, L4 arranged around the user U, a suitable adaptation of the recorded content to the reproduction scenario is advisable. This is done by the system 1.
(15) The system 1 also helps to connect different recording and reproduction sessions which are separated by space and time. This is done by the feature that the recording sessionor more precisely the used sensors M1, M2, M3and the reproduction sessionor more precisely the associated actuators L1, L2, L3, L4and a central unit CU for taking care of the digital content are connected to each other by a network, which is here realized by the internet. Hence, the drawn lines just indicate possible connections.
(16) The possibility to consume digital content in a reproduction session at any given time after a recording session has happened is enabled by a data storage 5 comprised here by the central unit CU for storing the recorded digital data and the output digital data based on the original digital data. The data storage 5 allows in the shown embodiment to store the received digital content in connection with a time stamp.
(17) The system 1 comprises an input interface 2 which allows to input digital content or data to the calculator 3 and here to the central unit CU. There is a network between the input interface 2, the calculator 3 and the output interface 4 which is here indicated by direct connections.
(18) The data refers to: digital data or information stemming from the sensors M1, M2, M3; information about the actuators L1, L2, L3, L4; data provided by a user interface UI; and data belonging to different modalities such as video data, haptic/touch data, or olfactory data.
(19) The shown input interface 2 comprises for the input of the respective data six input channels: I1, I2, I3, II, ID and IM.
(20) Three input channels I1, I2, and I3 are associated with the individual sensors M1, M2, and M3.
(21) One input channel II allows the user interface UI to input data. This data refers, for example, to selections by a user, to initializing sessions by the user or to uploading pre-recorded data. The pre-recorded or offline recorded data is recorded e.g. in advance of the current recording session or in a different recording session. The user addson the recording side of the systemthe pre-recorded data to the recording session or to a reproduction session. Associating the different data with a recording or reproduction session causes the calculator 3 to handle the data jointly in at least one step while performing the adaptation of the recording data to the output content to be used in a reproduction session.
(22) The fifth input channel ID allows the input of the information about the actuators L1, L2, L3, L4 used for the reproduction.
(23) The sixth input channel IM serves for the input of data belonging to different modalities such as video data, haptic/touch data, or olfactory data.
(24) At least some input channels I1, I2, I3, II, ID, IM allow in the shown embodiment not only to receive data but also to send or output data, e.g. for starting a routine in the connected components or nodes M1, M2, M3, L1, L2, L3, L4 or sending request signals and so on.
(25) In an embodiment, the input channels I1, I2, I3 connected with the sensors M1, M2, M3 allow to initiate a calibration of the sensors M1, M2, M3, i.e. to identify the characteristics of the respective sensor M1, M2, M3. In an embodiment, the calibration data are stored on the respective sensor M1, M2, M3 and are used directly by it for adjusting the recorded digital content. In a different embodiment, the calibration data is submitted to the central unit CU.
(26) The number Ni of input channels I1, I2, I3, actually used for the input of the audio data belonging to a recording session is set by a user. This implies that the input interface 2 offers input channels and the user decides how many channels are needed for a recording session. The user sets in one embodiment the number Ni of input channels usingin the shown embodimentthe user interface UI.
(27) Further, the interface 2 is not limited to one location or to one area but can be distributed via its input channels I1, I2, I3, II, IM, ID to very different places.
(28) The input interface 2 is connected to a central unit CU. The central unit CU is in one embodiment a computer and is in a different embodiment realized in a cloud. The shown central unit CU comprises a part of a calculator 3 which adapts the digital content stemming from the recording session to the requirements and possibilities of the reproduction session.
(29) The calculator 3according to the shown embodimentcomprises three different types of subunits C1.i, C2, and C3.i. The index i of the types of subunits C1 and C3 refers to the associated unit or node in the shown embodiment.
(30) One type of subunit C1.i (here: C1.1, C1.2, C1.3) belongs to the different sensors M1, M2, M3. A different subunit C2 belongs to the central unit CU and a third type of subunit C3.i (here: C3.1, C3.2, C3.3, C3.4) is part of the reproduction session and is associated with the loudspeakers L1, L2, L3, L4.
(31) The three different types of subunits C1 or C1.i, C2, C3 or C3.i help to adapt the digital content from the recording session to the reproducing session while providing modified content.
(32) The modified content is in one embodiment the output digital content to be output to and reproduced in the reproduction session.
(33) In a different embodiment, the modified content describes the recorded content or the reproduction in a neutral or abstract format. Hence, the modified content is in this embodiment a kind of intermediate step of adapting the digital content from the given parameters of the recording scenario via a neutral description to the constraints of the reproduction scenario.
(34) The subunits C1.1, C1.2, C1.3 of the type C1 belonging to the sensors M1, M2, M3 convert the digital content of the microphones M1, M2, M3 from a recording session specific and, thus, sensor specific format into a neutral format. This neutral or mediating format refers, for example, to an ideal sensor detecting signals with equal intensity from all directions. Alternatively or additionally, the neutral format refers to an ideal recording situation. Generally, the neutral format lacks all references to the given recording session.
(35) The subunits are here part of the system. In a different embodiment, the subunits are connected to the system but perform the involved processing steps.
(36) The subunits C1 have access to information about the locations of the respective sensor M1, M2, M3 and use this information for calculating the recording session neutral digital content which is here submitted via respective input channels I2, I2, I3 to the central unit CU.
(37) Further processing of the digital content is performed by a subunit C2 belonging to the central unit CU. This is for example the combination of digital content from different sensors or the combination with off-line recorded data etc.
(38) The three sensors M1, M2, M3 allow an online recording of the two sound sources S1, S2. The digital content recorded by the three microphones M1, M2, M3 is buffered and uploaded to the central unit CU which is in one embodiment a server. The buffer size is chosen e.g. in dependence on network bandwidth and the desired recording quality (Bit depth and sampling frequency). For a higher quality a smaller buffer size is used.
(39) The central unit CU also uses the input channels I1, I2, I3 for a time synchronization of the sensors M1, M2, M3 by providing a common clock signal for the sensors M1, M2, M3. Further, the central unit CU uses the input channels I1, I2, I3 for triggering the connected sensors M1, M2, M3 to submit information about their location to the central unit CU and to the subunit C2 of calculator 3.
(40) The subunit C2belonging to the central unit CU of the shown embodimentallows to analyze pre-recorded or offline recorded data uploaded by the user for the respective recording session. The uploaded data is e.g. analyzed with respect to the statistical independence e.g., using interchannel correlation based measures to determine whether the uploaded channels are data of separated sources or a multichannel mixture signal. This allows to record digital content independently and to merge the content later on.
(41) In the central unit CU, the digital contentalternatively named input digital content or received digital contentand the output digital content are stored in a data storage 5. The output digital content is calculated by the calculator 3 and the central unit CU. Relevant for the reproduction session is the output digital content.
(42) The output digital content is transmitted via an output interface 4 to the reproduction session. This is still done via a networke.g. via the internetin which the system 1 is embedded or to which the system 1 is at least partially connected. The output interface 4 comprises output channels from which four channels O1, O2, O3, O4 are used in the shown embodiment to output the output digital data to four loudspeakers L1, L2, L3, L4. The number No of output channels used is based on a user input. The loudspeakers L1, L2, L3, L4 surround a consuming user U.
(43) Especially, it is possible for users to choose the number of input channels Ni needed for a recording session as well as the number of output channels No to be used for a reproduction session.
(44) The loudspeakers L1, L2, L3, L4 are connected to associated output channels O1, O2, O3, O4 and to subunits C3.1, C3.2, C3.3, C3.4. The subunits of the type C3 are either a part of the loudspeakers (L1 and C3.1; L3 and C3.3) or are separate additional components (C3.2 and L2; C3.4 and L4).
(45) The subunits C3.1, C3.2, C3.3, C3.4 belonging to type C3 provide output digital content for their associated loudspeakers L1, L2, L3, L4 taking information about the loudspeakers L1, L2, L3, L4 and especially their locations into consideration. The locations of the loudspeakers L1, L2, L3, L4 may refer to their absolute positions as well as to their relative positions and also to their positions relative to the consuming user U.
(46) The user interface UI allows in the shown embodiment a user to choose the number Ni of input channels for a recording session, i.e. the number of used sensors, and the number No of output channels for the reproduction session, i.e. the number of loudspeakers used.
(47) Additionally, the user interface UI allows a user to initiate different kinds of sessions:
(48) A kind of session allows steps concerning the registration of a user. Hence, in such a session a user can register, change its registration or even de-register.
(49) In a different kind of session, a user logs in or out.
(50) Still another session comprises sharing a session. This implies that e.g. two users participate in a session. This is, for example, a recording session. By sharing a recording session, different users can record digital content without the need to do this at the same time or at the same location.
(51) Each started session can be joined by other registered members of the platform or the same member with a different device upon invitation or by an accepted join-request (granted knocking). Each registered device in a session will be called node. A node has optionally a set of sensors (e.g., microphones) and/or actuators (e.g., loudspeakers) and is communicating accordingly the number of input and output channels with his channel peers and the server.
(52) A special session to be initiated is a recording session as discussed above comprising recording digital content and/or uploading digital content. Also of special interest is a reproduction sessionalso discussed abovecomprising outputting output digital content and/or reproducing output digital content. Finally, both sessions are combined in a duplex session.
(53) In a different embodiment, the user interface UIwhich can also be named user front endprovides at a developer level the integration of plugins for further processing the raw sensor (e.g., microphone) data. Different plugins are: synchronizing signals, continuous location tracking of the capturing devices and optionally their directivity patterns.
(54) The recording user front-end provides at a developer level the integration of plugins for the further processing of the raw sensor (e.g., microphone) data. The plugins have to be licensed by the platform operating community and is provided centrally by the operator. The platform provides natively as input for licensed plugins: synchronized signals, continuous location tracking of the capturing devices and optionally their directivity patterns.
(55) The data storage 5 of the shown embodiment stores the digital content in a temporal as well as spatially coded format.
(56) The received digital content is in an embodiment stored in a temporally compressed format such as Ogg Vorbis, Opus or FLAC. An embodiment especially referring to audio signals encloses recording a time stamp track additionally to the actual audio signal for each microphone M1, M2, M3. The time stamp is in one embodiment acquired from a globally provided clock signal and in a different embodiment from a session local network clock.
(57) Also, spatial coding is used in an embodiment. The goals of the spatial coding are twofold: 1. Transforming the data such that the multiple channels in the new representation are mutually statistically independent or at least to be less dependent on each other than before the transformation. This is done, for example, in order to reduce redundancy. 2. Enabling to project the given recording setup (according to the distribution of sensor positions) to a (possibly different) reproduction setup (according to the distribution of actuator positions).
(58) Here, different cases are realized by different embodiments. As detailed below, one embodiment is based on a statistically optimal spatial coding. Moreover, there are also realizations by embodiments based on deterministic approaches as detailed below. It has to be considered, that the statistically optimal coding scheme can also be understood as a general scheme for spatial coding which includes the deterministic ones as special cases.
(59) An embodiment for the adaptation of the recorded data to the requirements of the reproduction session will be explained in the following.
(60) The calculator 3 performs the adaptation. The sensors M1, M2, M3 and actuators L1, L2, L3, L4 are referred to as nodes which here include just one device each. Accordingly, the steps are used for recording as well as for reproduction sessions. Further, in the example just the locationor more precisely: the information about the locationof the node is considered. In this case, by sharing a recording and/or reproduction session, the assignment between the nodes and M1, M2, M3, L1, L2, L3, L4 is initiated.
(61) The calculator 3 adapts the digital content belonging to a session by calculating a centroid of an array of the nodes belonging to the session using the location information. Afterwards, all nodes are excluded from further considerations, when they are farer away from the calculated centroid than a given threshold. The other nodes located closer to the centroid are kept and form a set of remaining nodes. Thus, in an embodiment the relevant nodes from the given nodes of a recording or reproduction session are identified based on their positons. Relevant are nodes in an embodiment that are close to a joint or common positon. For the remaining nodes, convex polygons are calculated. In one embodiment, the convex polygons are calculated by applying a modified incremental convex hull algorithm.
(62) This is followed by a selection of the calculated convex polygon having the highest number of nodes. The selected calculated convex polygon forms a main array and is associated with nodes. These nodes belong to the remaining nodes and are the nodes allowing to form a convex polygon with the highest number of nodes. These associated nodes are clustered with respect to their location.
(63) In an embodiment, the calculator 3 clusters the nodes into subarrays depending on their distance to their respective centroid. Then, the selected calculated convex polygon described above is calculated for the individual subarrays.
(64) In an embodiment, convex and smooth polygons are used in order to calculate the normal vectors.
(65) The foregoing is used by the calculator 3 to calculate normal vectors for the nodes that are associated with the selected calculated convex polygon, i.e. with the main array. The nodes mentioned in the following are the nodes of the polygon.
(66) The calculator 3 performs the following steps using the different subunits C1, C2, C3:
(67) step 1: sorting locations of the nodes with respect to their inter-distances.
(68) step 2: calculating a closed Bezier curve to interpolate between the nodes of the polygon in a sorted order.
(69) step 3: calculating a derivative of the Bezier curve.
(70) step 4: calculating vectors between the nodes and the Bezier curve after excluding a node at which the Bezier curve starts and ends.
(71) step 5: calculating a scalar product between the calculated vectors of step 4 and the derivative of the Bezier curve calculated in step 3.
(72) step 6: determining a normal vector of a node as a vector between the respective node and the Bezier curve by minimizing the sum of the scalar product of claim 5 and a square Euclidean norm.
(73) step 7: starting at steps 2 and 3 by starting the Bezier curve with another node in order to determine the normal vector of the excluded node.
(74) As already mentioned, having determined the normal vectors according to the previous steps, the loudspeaker and microphone signals are preprocessed according to a spatiotemporal coding scheme in an embodiment.
(75) In an embodiment, the loudspeaker and microphone signals are preprocessed either at the central unit CU or here the subunit C2 (e.g. a server) or locally (using the subunits C1.1, C1.2, C1.3, C3.1, C3.2, C3.3, C3.4) in a different embodiment. Hence, the nodes allow in some embodiments to perform processing steps. Processing is done according to the following steps:
(76) 1. The nodes of the recording (microphones M1, M2, M3) and synthesis parts (loudspeakers L1, L2, L3, L4) are clustered according to the aforementioned approach and convex hulls for both sides, i.e. for the recording and the reproduction session are determined. The convex hulls surround the relevant recording and reproduction areas, respectively.
(77) 2. At the recording side, the relative transfer functions between each two microphones are determined. This is done, for example, via measurements. In one embodiment, each node comprises at least one sensor and one actuator, thus, enabling measurements of the transfer functions.
(78) Optionally, the transfer functions are approximated by the transfer functions between a loudspeaker of one node and the microphone of another by assuming that the microphone and loudspeaker of one node are spatially so close that they can be considered as being colocated. In an embodiment, the nodes are realized by smartphones comprising microphones and loudspeakers. For such devices like smartphones, it can be assumed that the microphones and loudspeakers are located at the same position.
(79) The relative transfer function describing the acoustic path from one node to itself is measured by calculating the acoustic path of one node's loudspeaker to its microphone.
(80) Each transfer function is divided in the time domain into early and late reflection parts resulting into two FIR filters of the length L, L. The division is motivated by the characteristic structure of acoustic room impulse responses. Typically, the early reflections are a set of discrete reflections whose density increases until the late reflection part in which individual reflections can no longer be discriminated and/or perceived.
(81) Modelling these two parts by two separate FIR filters, the late reflections part contains leading zeros in the time domains so that it can be realized by a filter of the same length as the one modelling the early reflections part.
(82) The separation is done e.g., using the approach presented in [Stewart et. al].
(83) The separated transfer functions between microphones i and j are written according to an embodiment in a convolution matrix (Sylvester Matrix H.sub.ij) form and ordered in a blocksylvester matrix, such that two blocksylvester matrices are obtained. One for the early reflections and one for the late reflections.
(84) For the early reflections:
(85)
(86) The notation with a circle () was used to distinguish the formula with the Sylvester matrices from a more compact calculation to be given in the following.
(87) Similarly, for the late reflections:
(88)
with components similar to that given in equation (2).
(89) Further, a dictionary is defined as
(90)
(91) In the dictionary, x.sub.p is denoting the position of each localized node and k.sub.n is denoting a wave vector with the magnitude k=/c with omega denoting a radial frequency.
(92) The dictionary is based in this embodiment on the locations of the relevant nodes and the calculated normal vectors of the respective session (either recording or reproduction session). It allows to describe the digital contenthere for example either the recorded audio signals, i.e. the sensor/microphone signals or the output signals of the actuators/loudspeakersby a transfer domain representation.
(93) For a microphone signal Y at a frequency k captured by the given distributed microphone array, it can be written:
Y(k)=(k)Y/(k)(5)
(94) There, Y denotes the transform-domain representation of the microphone signal.
(95) It is known that the Discrete Fourier Transform-Matrix (DFT-Matrix) diagonalizes so-called circulant matrices. This means that the DFT-Matrix is composed of the eigenvectors of circulant matrices. This relationship for circulant matrices also holds approximately for matrices with Toeplitz structure (if they are large).
(96) A Sylvester matrix (e.g., formula (2)) is a special case of a Toeplitz matrix. Moreover, it is known that the corresponding diagonal matrix contains the frequency-domain values on its main diagonal. Hence, the matrix with the late reflections H.sub.late is transformed into the frequency domain after zero padding and by a multiplication with a blockdiagonal matrix with the DFT (Discrete Fourier Transformation)-Matrices on its main diagonal from one side and the Hermitian transposed of this block diagonal matrix from the other side.
(97) Equivalently, for computational efficiency, the FFT (Fast Fourier Transform) is applied on the individual filters after zero padding. The resulting vectors are set as the diagonals of the submatrices in the complete blockwise diagonalized relative transfer functions matrix {hacek over (H)}.sub.late.
(98) Additionally, {hacek over (H)}.sub.late is decomposed into a set of compact matrices H.sub.late(k) which contain the elements of each frequency bin k. Thus, H.sub.late(k) contains the k-th values on the diagonals of the submatrices of {hacek over (H)}.sub.late.
(99) By taking the locations of the nodes into consideration, a dictionary matrix is constructed that relates a spatially subsampled (just spatially discrete sampling points of the wave fields are given by the respective nodes) loudspeaker signal in the frequency domain to a representation in a spatiotemporal transform-domain.
(100) This representation is chosen such that the late reverberations of the relative transfer functions are sparse, for example, a dictionary of plane waves as provided by equation (4) is used.
(101) Using the normal vectors calculated as described above, a set of plane waves Y.sub.des,OP is defined with the aim to reconstruct the given array structure.
(102) The direction of the wave vector of each plane wave is determined by one normal vector obtained from a previous step. These plane waves are then set as the diagonal of a diagonal matrix (k).
(103) A matrix .sup.+(k) is calculated as an estimator minimizing the cost function
J=vec{.sup.+H(k)}.sub.1+H(k).sup.+H(k).sub.F.sup.2,(6)
(104) where H(k)=H.sub.late(k). The cost function is given in a frequency selective form, so that =(k) with the respective frequency bin k. The minimization is achieved, for example, as shown in [Helwani et al. 2014].
(105) A filter matrix W is obtained by solving the linear system
(k)=W(k).sup.+H(k)(7)
(106) The spatial filters for preprocessing the microphone signals for the frequency bin k are then obtained by:
W(k)=(k)W(k)(8)
(107) The filters for the early reflections are used to create a beamformer for each node, for a selected subset of the nodes or for virtual nodes that are obtained by interpolating the relative transfer functions with a suitable interpolation kernel such as the Green's function for sound propagation in free-field.
(108) The beamformer is designed to exhibit spatial zeros in the directions of the other nodes, a subset of the other nodes or interpolated virtual nodes.
(109) These beamformers B are obtained by solving the following linear system in the time or frequency domain:
=H.sub.earlyW.sub.early(9)
(110) In this formula, F is a block diagonal matrix, whose diagonal elements are column vectors representing a pure delay filter.
(111) The inversion can be approximated by setting the subcolumns of W.sub.early as the time reversed of the FIR filters represented in H.sub.early and by applying a spatial window. To understand the role of the window, it is helpful to understand that the calculation of W.sub.early can be done column wise. Each column calculates prefilters for all nodes to get (or to be reproduced for the reproduction session) an independent signal for one node. The window penalizes in a frequency dependent manner the nodes by multiplying the node signal with a value between 0 and 1 according to the value of the scalar product of its normal vector with the normal vector of the desired independent node. Low values have a high penalty while the highest penalty is multiplication with zero. The lower the frequency, the lower is the penalization for the nodes.
(112) In a different and more advantageous embodiment, the inversion is done in the frequency domain by solving the system:
=H.sub.earlyW.sub.early(10)
(113) Finally, the prefilters of the early and late reflection parts are merged to a common filter. One possible embodiment of merging the filter parts is given by the following calculation:
H.sup.1=(I+W.sub.earlyH.sub.late).sup.1W.sub.early(11)
(114) An alternative embodiment of merging the filter parts is given by the calculation:
H.sup.1=(WH.sub.early+I).sup.1W(12)
(115) Here, I is denoting the unity matrix.
(116) The calculation (11) can be understood according to the following consideration for a microphone signal y and an excitation x from loudspeakers at the same positions of the microphones or in their near proximities:
H.sub.early.sup.1(H.sub.early+H.sub.late)x=H.sub.early.sup.1y,(13)
(I+H.sub.early.sup.1H.sub.late)x=H.sub.early.sup.1y.(14)
(117) Further, H.sub.early.sup.1 is approximated with W.sub.early and H.sub.late.sup.1 is approximated with W.
(118) Equation (12) is obtained in an analogous way by replacing H.sub.early.sup.1 on both sides of (13) by W.
(119) 3. Similarly, the relative transfer functions for the reproduction session are determined and preprocessing filters represented in a matrix B are calculated. The steps for determining the transform matrix for the digital content and output digital content, i.e. concerning the recording and reproduction session, respectively, are identical.
(120) 4. The actual remixing is performed in an embodiment by prefiltering the microphone signals, and by multiplying the output with the inverse of the discretized freefield Green's function. The function is used as a multiple input/output FIR matrix representing the sound propagation between the positions of the microphones and loudspeaker after overlaying the two array geometries (one for the recording session and one for the reproduction session) in one plane with coinciding centroids and at a by the user determined rotation angle or a randomly chosen rotation angle.
(121) The Green's function G describes the undisturbed or free field propagation from the sourceshere the locations of the sensorsin the recording room to the sinkshere the actuator locationsin the reproduction room.
(122) Performing the inversion of the Green's function matrix incorporates a predelay in the forward filters representing the Green's function especially in the case where the position of a recording node after the overlay process lies within the chosen convex hull at the reproduction side.
(123) The loudspeakers signal is obtained by convolving the filtered microphone signals with the inverse of the Green's function calculated previously and then with the calculated beamformer inverse of the relative transfer function as described in the last step.
(124) If the position of the microphone in a recording is unknown but the recording is compatible with a legacy format such as stereo, 5.1, 22.2, etc. the microphones corresponding to each recording channel are thought as virtual microphones set at the positions recommended by the corresponding standard.
(125) 5. For the reproduction session, several subarrays are involved e.g. in the synthesis of a prefiltered microphone signal according to the previously presented steps.
(126) Subarrays allow to reduce the complexity of the calculations. In an embodiment, using subarrays is based on the embodiment in which the nodes contain more than one sensor and/or more than one actuator.
(127) The previously described embodiment of spatial coding can be regarded as a statistically optimal realization according to the cost function (6). Alternatively, a simplified deterministic spatial coding can be used in an embodiment.
(128) Here, different cases are realized by different embodiments:
(129) Case a
(130) The original native channels, i.e. the original digital content is kept by a lossless spatial coding. In an embodiment, each of these channels is then coded temporally.
(131) Case b
(132) Case b.1: If the rendering setup (i.e. the location of the loudspeakers or actuators of the reproduction session) is known at the capturing time of the recording session, then a signal description, i.e. a description of the digital content is given by decomposing the signal into spatially independent signals that sum up to an omnidirectional microphone. Spatially independent implies to create a beam pattern having a looking direction into one loudspeaker and exhibiting spatial nulls into the direction of the other beam formers. The level of each beam is normalized such that summing up the signals results in an omnidirectional signal. If the position of the loudspeakers is unknown and the multichannel recording is given by Q signals, optimally, Q beams each with Q1 spatial nulls are created. Filtering the microphone signals with those constrained beam formers gives Q independent spatial signals that corresponds ideally with a localized independent source.
(133) Case b.2: If the rendering loudspeaker setup is located within the area surrounded by the recording microphone array, then the spatial nulls (with regard to the direction of arrival (DOA), i.e. the angle) correspond to sectors of quiet zones according to [Helwani et al., 2013] or by synthesizing a focused virtual sink with directivity pattern which can be achieved by a superposition of focused multipole sources according to the WFS (wave field synthesis) theory and time reversal cavity [Fink]. These sectors of quiet zones are centered around the center of gravity of the area enclosed by the microphone array.
(134) Case b.3.1: If the two manifolds of the recording session and reproduction session approximately coincide according to a predefined region of tolerance, each loudspeaker playbacks the sound recorded by each microphone.
(135) Case b.3.2: If the manifolds defined by the sensors and the actuator distribution are approximately the same up to a certain shift, then this shift is compensated by the reproduction filter.
(136) Case b.4: Inverse modeling by calculating a system that inverses the room acoustic of the reproduction room, in frequency selective and by assuming free-field propagation unless the acoustic of the reproduction room is known.
(137) Case c
(138) In the more general case, if the setup of the reproduction session is not known at the capturing time of the recording session, virtual reproduction array is assumed and the scheme according to case b is applied. From this virtual array, the wave field is then extrapolated to the actual loudspeaker positions in the reproduction room using WFS [Spors] techniques to synthesize virtual focused sound sources. Hereby the elements of the virtual loudspeaker array are treated as new sound sources.
(139) Case d
(140) The spatial codec imports multichannel audio signals without metadata by placing virtual sources either randomly for each channel or according to a lookup table that corresponds certain channel number e.g., 6 channels, with a legacy multichannel setup such as 5.1 or 2 channels are treated as stereo with 2 virtual sources such as a listener at the centroid of the array has an impression of two sources at 30 and 30.
(141) In a further embodiment a reduction of the number of channels is performed.
(142) In one version, a principal component analysis (PCA) or an independent component analysis (ICA) is performed across the channels after the beam forming stage in order to reduce the number of channels. The temporal delays between the individual channels are compensated before the (memoryless) PCA is applied [Hyvarinen]. Delay compensations and PCA are calculated in a block-by-block manner and saved in a separate data stream. The above mentioned temporal coding is then applied to each of the resulting channels of the beam former outputs or the optional PCA outputs.
(143) Other embodiments for the remixing are based on the following remixing techniques in the case that the digital content refers to audio signals:
(144) In case of Higher Order Ambisonics (HOA) [Daniel] order j-to-k with j>k: Spatial band stop is applied on the first k coefficients of the spherical harmonics to obtain a lower ambisonics signal which can be played back with a lower number of loudspeakers. The number j is the number of input channels, and k is the number of output channels as input and output channels of a remixing step.
(145) In the case of k>j, compressed sensing regularization (analogously to the criterion (6)) on the regularity of the sound field (sparsity of the total variation) [Cands].
(146) In the case of N-to-Binaural, i.e. in the case of reducing N input channels to a reproduction using earphones:
(147) For allowing a consuming user U to listen to a multichannel recorded signal as digital content with an arbitrary number of microphones as sensors located at random known locations, a virtual array of loudspeakers (vL1, vL2, vL3) emulated with a dataset of Head-Related Transfer Functions (HRTF) is used to create a virtual sink at the position of the real microphones.
(148) The signal as digital content is convolved with the focusing operator first and then with the set of HRTFs as shown in
(149) The position of the focused sinks is related to the position of the recording microphone.
(150) Hence in one embodiment, the HRTFs are prefiltered by the focusing operator which is, for example, modelled as a SIMO (Single Input/Multiple Output) FIR (Finite Impulse Response) filter with N as the number of the HRTF pairs (e.g., two filters for the left and right ears at each degree of the unit circle) and the length L as resulting from the Kirchhoff-Helmholtz integral.
(151) Multichannel output is convolved with the HRTF pairs resulting in a MIMO (Multiple Input Multiple Output) system of N inputs and two outputs and a filter length determined by the length of the HRTF length.
(152) Different application cases are possible:
(153) N-to-M with N separated input signals:
(154) In this case the separated input channels are considered as point sources of a synthetic soundfield. For the synthesis higher order ambisonics, wave field synthesis technique or panning techniques are used.
(155) 5.1 Surround-to-M:
(156) A 5.1 file is rendered by synthesizing a sound field with six sources at the recommended locations of the loudspeakers in a 5.1 specification.
(157) In one embodiment, the adaptation of the digital content recorded in a recording session to the reproduction in a reproduction session happens by the following steps:
(158) For the recording, a given number Q of smartphones are used as sensors. These are placed randomly in a capturing room or recording scenario. The sound sources are surrounding the microphones and no sound source is in an area enclosed by the sensors.
(159) The recording session is started, in which the sensors/microphones/smartphones as capturing devices are synchronized by acquiring a common clock signal. The devices perform a localization algorithm and send their (relative) locations to the central unit as metadata as a well as GPS data (absolute locations).
(160) The spatial sound scene coding is performed targeting a virtual circular loudspeaker array with a number Q of Q elements and surrounding the smartphones wherein Q<=Q. Accordingly, Q Beamformers each having (Q1) nulls are created with the nullsteering technique [Brandstein, ward Microphone arrays].
(161) The microphone signals are filtered with the designed beamformer and a channel reduction procedure is initialized based on a PCA technique [Hyvarinen] with a heuristically defined threshold allowing to reduce the number of channels by ignoring eigenvalues lower than this threshold. Hence, the PCA provides a downmix matrix with Q Column and D<=Q rows.
(162) The filtered signals are multiplied with the downmix matrix resulting in D eigenchannels. These D channels are temporally coded using, for example, Ogg Vorbis. The eigenvectors of the Downmix Matrix are stored as metadata. All metadata are compressed using e.g. a lossless coding scheme such as Huffmann codec. This is done by the calculator 3 which is partially located, for example, via subunits C1 (i=1, . . . , 4) at the individual sensors Mi (i=1, . . . , 4).
(163) Reproduction of the digital content recorded in the recording session is done with P loudspeakers that can be accurately localized and start a reproduction session as described above.
(164) The P (here P=4) loudspeakers L1, L2, L3, L4 receive the D (here also D=4) channels from the central unit CU which can also be named as platform and upmix the eigenchannels according to the downmix matrix stored in the metadata. The upmix matrix is the pseudoinverse of the downmix matrix. Accordingly, the calculator 3 comprises subunits C3.i (i=1, . . . , 4) located within the reproduction session adapting the reproduction session neutral modified content to the current reproduction session.
(165) The array is then synthesizing according to the location of the loudspeakers L1, L2, L3, L4 as actuators, and according to the description in the reproduction session, virtual sources at the position of the virtual loudspeakers assumed while the recording session.
(166)
(167) A duplex communication system is a point-to-point system allowing parties to communicate with each other. In a full duplex system, both parties can communicate with each other simultaneously.
(168) Here, just one party with one user is shown. In the duplex session, the user is a signal source S1 for a recording session and also a consuming user U for the reproduction session. Hence, a duplex session is a combination of these two different sessions.
(169) With regard to the recording session, the audio signals of the user as a content source S1 are recorded by a microphone as sensor M1. The resulting digital content is submitted via the input channel I1 of the input interface 2 to the central unit CU. The digital content is received by the central unit CU and is used by the calculator 3 for providing output digital content. This output digital content is output at the othernot shownside of the central unit CU connected with the other communication party.
(170) In the shown embodiment, the calculator 3 is completely integrated within the central unit CU and performs here all calculations for adapting the recorded data to the reproduction session.
(171) At the same time, the user is a consuming user U listening to the audio signals provided by the two actuators L1, L2. The actuators L1, L2 are connected to the two output channels O1, O2 of the output interface 4.
(172) If a duplex session is started, the nodes (here: the two loudspeaker L1, L2 and the microphone M1) provide information about their electroacoustical I/O interfaces and about their locations or about the location of the content source S1 and the consuming user U. Optionally, they allow a calibration, for example, initiated by the central unit CU.
(173) In the shown embodiment, the data storage is omitted as a realtime communication is desired.
(174) In an embodiment, a multichannel acoustic echo control such as, for example, described in [Buchner, Helwani 2013] is implemented. In one embodiment, this is done centrally at the calculator 3. In a different embodiment, this is performed in a distributed manner on the nodes L1, L2, M1.
(175) In
(176) Here, four microphones M1, M2, M3, M4 record audio signals stemming from three sources S1, S2, S3. The respective audio signals are transmitted as digital content using the input interface 2 to the calculator 3. The calculated output digital content comprising audio signals appropriate to the reproduction session is output via the output interface 4 to nine loudspeakers L1 . . . L9. This shows that the calculator 3 has to adapt the digital content recorded by four microphones to the requirements of a reproduction session using nine loudspeaker. In the reproduction session a wave field is generated by applying the output digital content with different amplitudes and different phases to the individual loudspeakers L1 . . . L9.
(177) Due to the ad-hoc setups, the array geometrieson the recording and/or reproduction sideare not known in advance, and typically the setup on the reproduction side will differ from the setup on the recording side. Hence, the transmission is performed in the shown embodiment in a neutral format that is independent of the array geometries and, ideally, also independent of the local acoustics in the reproduction room. The calculations for the transmission are performed by the calculator 2 and are here summarized by three steps performed e.g. by different subunits or only by a server as a central unit: W.sup.(rec), G, and w.sup.(repro).
(178) On the recording side, the filter matrix W.sup.(rec) produces the spatially neutral format from the sensor array data, i.e. from the recorded digital content.
(179) Using the neutral format, the data are transmitted (note that on each component of the neutral format in one embodiment a temporal coding is additionally applied) and processed by the filter matrix G. Specifically, for reproducing the signals on the reproduction side by placing (recorded) source signals on specific geometrical positions, the matrix G is the freefield Green's function.
(180) Finally, the filter matrix W.sup.(repro) creates the driving signals of the loudspeakers by taking into account the actual locations of the loudspeakers and the acoustics of the reproduction room.
(181) The calculation steps of the two transformation matrices W.sup.(rec) and W.sup.(repro) are analogous and are described below. Without loss of generality, only the steps for the reproduction side are described in the following.
(182) As a special case, the block diagram of
(183) The overall goal of the embodiment is a decomposition of the wave field into mutually statistically independent components, where these signal components are projections onto certain basis functions.
(184) The number of mutually independent components does not have to be the same as the number of identified normal vectors (based on the convex hulls). If the number of components is greater than the number of normal vectors, then the possibility is given of using linear combinations of multiple components. This allows for interpolations in order to obtain higher-resolution results.
(185) It follows a summary of steps to calculate an equalization filter matrix W shown exemplarily for the reproduction side, i.e., W=W.sup.(repro). 1. Measure the acoustic impulse responses between the nodes of the distributed reproduction system. In one embodiment, a close proximity of loudspeaker and corresponding microphone is assumed within each of the nodes so that they can be considered as being colocated. The impulse responses from each of the nodes to itself are also measured (relative transfer function). In total this gives a whole matrix of impulse responses. 2. Localize the relative geometric positions of the nodes of the reproduction system. 3. Based on the result of step 2, calculate the convex hull (e.g. Bezier curve) through the nodes and calculate the normal vectors (in one embodiment according to the above described seven steps). 4. For equalization of the reproduction room and normalization of the loudspeaker array geometry:
(186) Each transfer function is divided in the time domain into early and late reflection parts, i.e., H=H.sup.early+H.sup.late. An equivalent formulation using convolution matrices is given by equations (1) through (3). 4.1. To estimate the equalization filter based on the late reflections: 4.1.1. Calculate the frequency-domain representation of the late-reflection part of the measured impulse response matrix, H.sup.late(k), where k denotes the number of the frequency bin. 4.1.2. Define matrix according to equation (4) using the positions of the nodes and the normal vectors (steps 2 and 3 above). The elements of can be regarded as plane waves which will be used as basis vectors in the following steps. The vectors x.sub.i are positon vectors of the nodes, i.e. of the sensors and/or actuators, and are, thus, spatial sampling points. The vectors k.sub.i are wave vectors having directions of the normal vectors of the convex hull. 4.1.3. By minimizing the cost function (6), the matrix .sup.+ is obtained from and from H.sub.late(k). This optimization reconstructs a set of plane waves from the spatial sampling points. Due to the I.sub.1 norm in (6), the matrix .sup.+ will be optimized in such a way that the vector vec(.sup.+H.sub.late(k)) describes the minimum number of plane waves (sparseness constraint). Hence, the system H.sub.late(k) is represented in a lower-dimensional transform domain by decomposing it in a statistically optimal way into plain wave components. 4.1.4. The equalization filter W(k) in the compressed domain in obtained by solving equation (7) for W(k), e.g., using the Moore-Penrose pseudoinverse. Here, (k) is a diagonal matrix containing plain waves according to the array normal vectors from above as the target. 4.1.5. The equalization filter W(k)=W.sub.late(k) in the original (higher-dimensional) domain is obtained from W(k) according to equation (8). 4.2. To estimate the equalization filter based on the early reflections: Solve equation (9) for the equalization filter W.sub.early. This calculation is performed in the frequency domain according to equation (10). 4.3. The overall equalization filter is obtained by merging the early and the late reflection parts according to equation (11) or equation (12).
(187) Using the late reflection part is based on the discovery that the calculations are more stable.
(188) The arrows between the filter matrices W.sup.(rec), W.sup.(repro) and G indicate that information about calculated or predefined locations is submitted to the subsequent step. This means that the information about the calculated location of the calculated virtual audio objects is used for the step calculating the virtual microphone signals and that the information of the predefined locations of the virtual microphones is used for obtaining the filter matrix W.sup.(repro) for generating the audio signals to be reproduced within the reproduction session.
(189) In
(190) For the adaptation of the recorded audio signals to the reproduction session, two filter matrices W.sup.(rec) and W.sup.(repro) and a Green's function G are calculated as explained above. From which units of the shown embodiment the matrices W.sup.(rec), W.sup.(repro) and the function G are provided is indicated in the drawing by arrows.
(191) The central unit CU of the shown embodiment comprising the calculator 3 for providing the output digital content and comprising the input interface 2 as well as the output interface 4 is here realized as a server. The network connecting the input interface 2, the calculator 3, and the output interface 4 can be realizedat least partiallydirectly via a hardware connection (e.g. cables) within the server or e.g. via distributed elements connected by a wireless network.
(192) The central unit CU provides various input interface channels I1, I2, I3 and various output interface channels O1, O2, O3, O4. A user at the recording session and a user at the reproduction session determine the number of actually needed channels for the respective session.
(193) At the recording session, three sensors (here microphones) M1, M2, M3 are used for recording audio signals from two signal sources S1, S2. Two sensors M2 and M3 submit their respective signals to the third sensor M1 which is in the shown embodiment enabled to process the audio signals based on the filter matrix W.sup.(rec) of the recording session. Hence, in this embodiment, the preprocessing of the recorded signals is not performed by each sensor individually but by one sensor. This allows, for example, to use differently sophisticated sensors for the recording. The preprocessing of the recorded signals using the filter matrix W.sup.(rec) provides digital content to be transmitted to the input interface 2 in a recording session neutral format.
(194) In one embodiment, this is done by calculatingfor example based on the positons of the sensors M1, M2, M3 and/or their recording characteristics and/or their respective transfer functionsaudio objects as sources of calculated audio signals that together provide a wave field identical or similar to the wave field given within the recording session and recorded by the sensors. These calculated audio signals are less dependent on each other than the recorded audio signals. In an embodiment, it is strived for mutually independent objects.
(195) Hence, in an embodiment, the preprocessing at the side of the recording session provides digital content for processed audio signals recorded in the recording session. In an additional embodiment, the digital content also comprises metadata describing the positions of the calculated virtual audio objects. The processed audio signals of the digital content are the recorded audio signals in a neutral format implying that a dependency on the constrictions of the given recording session is reduced. In an embodiment, the digital content is provided based on transfer functions of the sensors M1, M2, M3. In a further embodiment, the transfer functions are used based on the above discussed splitting into late and early reflections.
(196) The digital content is submitted to the three input channels I1,I2,I3 of the input interface 2 of the server, for example, via the internet. In a different or additional embodiment, the digital content is submitted via any phone or mobile phone connection.
(197) The calculator 3 receives the digital content comprising the calculated audio signals andas metadatathe information about the positions of the calculated virtual audio objects.
(198) The calculator 3 of the central unit CU calculates based on the digital content and using a filter matrix that is in one embodiment Green's function G signals for virtual microphones that are located at predefined or set locations. In one embodiment, the virtual microphones are such positioned that they surround the positions of the sensors and/or the positions of the calculated virtual audio objects. In an embodiment, they are located on a circle.
(199) Thus, the calculator 3 receives the calculated audio signals that are dependent on the positions of the calculated virtual audio objects. Based on these signals, the calculator 3 provides virtual microphone signals for virtual microphones. The output digital content comprises these virtual microphone signals for the virtual microphones and comprises in one embodiment the positions of the virtual microphones as metadata. In a different embodiment, the positions are known to the receiving actuators or any other element receiving data from the output interface 4 so that the positions have not to be transmitted. The virtual microphone signals for the virtual microphones are independent of any constraint of the recording and the reproduction session, especially independent of the locations of the respective nodes (sensors or actuators) and the respective transfer functions. The virtual microphone signals for virtual microphones are output via the output channels O1, O2, O3, O4 of the output interface 4.
(200) On the receiving side of the output digital content (i.e. at the reproduction side) the output digital content is received by one actuator L1 that adapts the output digital content to the requirements of the given reproduction session. The adaptation of the digital output data to the number and location of the actuators is done using the filter matrix W.sup.(repro). In order to gather the information about the actuators L1, L2, L3, L4, each actuator is provided with a microphone. The microphones allow e.g. to obtain information about the output characteristics, the positions and the transfer functions of the actuators.
(201) The system 1 consists of a server as a central unit CU. Sensors M1, M2, M3 record audio signals from signal sources S1, S2 andhere realized by one sensorprovide digital data comprising calculated audio signals describing calculated virtual audio objects located at calculated positions. The calculator 3 provides based on the received digital content the output digital content with signals for virtual microphones wherein the signals for the virtual microphones generate a wave field comparable to that associated with the calculated audio signals of the calculated virtual audio objects. This output digital content is adapted afterwards to the parameters and situations of the reproduction session.
(202) The adaptation of the recorded audio signals with the conditions of the recording session to the conditions of the reproduction session, thus, comprises three large blocks with different types of transformations:
(203) First, transforming the recorded signals into calculated audio signals of calculated virtual audio objects located at calculated positions (this is done using the filter matrix W.sup.(rec)). Second, transforming the calculated audio signals into virtual microphone signals for virtual microphones located at set positions (this is done using the Green's function as an example for a filter matrix G).
(204) Third, transforming the virtual microphone signals for the virtual microphones into the signals that are to be reproduced by the actually given reproduction session (for this is used the filter matrix W.sup.(repro)).
(205) As above mentioned, the calculator 3 comprises in an embodiment different sub units. The embodiment of
(206) Some examples about where which steps are performed are given by
(207) In
(208) In
(209) The embodiment of
(210) Finally, the embodiment
(211)
(212) The audio signals from various sources (having unknown or even varying locations within the recording session) are recorded by three sensors M1, M2, M3. The sensors M1, M2, M3 are located at different positions and have their respective transfer functions. The transfer functions are depending on their recording characteristics and on their location within the recording area, i.e. the room in which the recording is done (here indicated by the wall on the top and on the right side; the other walls may be far away).
(213) The recorded audio signals are encoded by providing calculated audio signals that describe here four calculated virtual audio objects cAO1, cAO2, cAO3, cAO4. For the evaluation in this embodiment, a curve describing a convex hull is calculated that is based on the locations of the sensors M1, M2, M3 and surrounds at least the relevant recording area. In an embodiment, sensors are neglected (i.e. are less relevant) that are too far from a center of the sensors. The calculated audio signals are independent of the locations of the sensors M1, M2, M3 but refer to the locations of the calculated virtual audio objects cAO1, cAO2, cAO3, cAO4. Nevertheless, this calculated audio signals are less statistical dependent on each other than the recorded audio signals. This is achieved by ensuring in the calculations that each calculated virtual audio object emits signals just in one direction and not in other directions. In a further embodiment, also the transfer functions are considered by dividing them into an early and a late reflection part. Both parts are used for generating FIR filters (see above).
(214) The transfer of the recorded audio signals with their dependency on the locations of the sensors M1, M2, M3 to the calculated audio signals associated with locations of calculated virtual audio objects cAO1, cAO2, cAO3, cAO4 is summarized by the filter matrix W.sup.(rec) for the recording session. The calculated audio signals are a neutral format of the audio signals and are neutral with regard to the setting of the recording session.
(215) In a following step, the calculated audio signals belonging to the calculated virtual audio objects cAO1, cAO2, cAO3, cAO4 are used for calculating virtual microphone signals forhere sixvirtual microphones vM1, vM2, vM3, vM4, vM5, vM6. The virtual microphones vM1, vM2, vM3, vM4, vM5, vM6 arein the shown embodimentlocated at a circle. The calculation for obtaining the signals to be received by the virtual microphones is done using in one embodiment the Green's function G as a filter matrix.
(216) In the next step, the virtual microphone signals are used for providing the reproduction signals to be reproduced by the actuators (her shown in
(217) The system and the connected nodes (sensors, actuators) can also be described as a combination of an encoding and a decoding apparatus. Here, encoding comprises processing the recorded signals in such a way that the signals are given in a form independent of the parameters of the recording session, e.g. in a neutral format. The decoding on the other hand comprises adapting encoded signals to the parameters of the reproduction session.
(218) An encoder apparatus (or encoding apparatus) 100 shown in
(219) A filter provider 101 is configured to calculate a signal filter W.sup.(rec) that is based on the locations of the sensors used in the recording session for recording the audio signals 99 and in this embodiment based on the transfer functions of the sensors which takes the surrounding of the recording session into account. The signal filter W.sup.(rec) refers to the calculated virtual audio objects which are in an embodiment mutually statistically independent as they emit audio signals in just one direction. This signal filter W.sup.(rec) is applied by the filter applicator 102 to the audio signals 99. The resulting calculated audio signals 991 are the signals which emitted by the calculated virtual audio objects provide the same wave field as that given by the recorded audio signals 99. Further, the filter provider 101 also provides the locations of the calculated virtual audio objects.
(220) Hence, the audio signals 99 that are dependent on the locations of the sensors and here also on the transfer functions are transformed into calculated audio signals 991 that describe the virtual audio objects positioned at the calculated locations but that are less statistically dependent on each other and in one embodiment especially mutually independent of each other.
(221) In a next step, a virtual microphone processor 103 provides virtual microphone signals for the virtual microphones that are located at set or pre-defined positions. This is done using a filter matrix G which is in an embodiment Green's function. Thus, the virtual microphone processor 103 calculates based on a given number of virtual microphones and their respective pre-known or set positions the virtual microphone signals that cause the wave field experienced with the calculated audio signals 991. These virtual microphone signals are used for the output of the encoded audio signals 992. The encoded audio signals 992 comprise in an embodiment also metadata about locations of the virtual microphones. In a different embodiment, this information can be omitted due to the facts that the locations of the virtual microphones are well known to the decoder 200, e.g. via a predefinition.
(222) A decoder apparatus (or decoding apparatus) 200 receives the encoded audio signals 992. A filter provider 201 provides a signal filter W.sup.(repro)) that is based on the locations of the actuators to be used for the reproduction of the decoded audio signals 990 and based on the locations associated with the encoded audio signals 992here, this are the locations of the virtual microphones. The information about the location is either part of metadata comprised by the encoded audio signals 992 or is known to the decoder apparatus 200 (this especially refers to the shown case that the encoded audio signals 992 belong to virtual microphones). Based on the location information the filter provider 201 provides the signal filter W.sup.(repro)) that helps to adapt the encoded audio signals 992 to the conditions of the reproduction session. The actual calculation is in one embodiment as outlined above.
(223) In the embodiment of
(224) The embodiment shown in
(225) Although some aspects have been described in the context of a system or apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding system/apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
(226) The inventive transmitted or encoded signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
(227) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disc, a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
(228) Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(229) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may, for example, be stored on a machine readable carrier.
(230) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(231) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(232) A further embodiment of the inventive method is, therefore, a data carrier (or a non-transitory storage medium such as a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitory.
(233) A further embodiment of the invention method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may, for example, be configured to be transferred via a data communication connection, for example, via the internet.
(234) A further embodiment comprises a processing means, for example, a computer or a programmable logic device, configured to, or adapted to, perform one of the methods described herein.
(235) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(236) A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
(237) In some embodiments, a programmable logic device (for example, a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
(238) While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
REFERENCES
(239) Brandstein, M. S., Ward, D. B., (eds.), Microphone Arrays: Signal Processing Techniques and Applications, Springer Verlag, 2001. E. J. Cands and Y. Plan. Matrix completion with noise. Proceedings of the IEEE 98(6), 925-936. J. Daniel. Representation de champs acoustiques, application la transmission et la reproduction de scnes sonores complexes dans un contexte multimdia. PhD thesis, Universit Paris 6, 2000. M. Fink, Time reversal of ultrasonic fieldsPart I: Basic principles. IEEE Transactions on Ultrasocics, Ferroelectrics, and Frequency Control, 39(5):555-566, September 1992. K. Helwani and H. Buchner, Adaptive Filtering in Compressive Domains, Proc. IEEE IWAENC, Nice, 2014. K. Helwani, H. Buchner, J. Benesty, and J. Chen, Multichannel acoustic echo suppression, Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013. K. Helwani, S. Spors, and H. Buchner, The synthesis of sound figures, Journal on Multidimensional Systems and Signal Processing (MDSSP), Springer, November 2013. A. Hyvrinen, J. Karhunen and E. Oja, Independent Component Analysis. J. O'Rourke, Computational Geometry in C, Cambridge University Press, 1993 S. Spors, R Rabenstein, The theory of wave field synthesis revisited 124th AES Convention, 17-20. R. Stewart and M. Sandler, STATISTICAL MEASURES OF EARLY REFLECTIONS OF ROOM IMPULSE RESPONSES, Proc. of the 10th Int. Conference on Digital Audio Effects (DAFx-07), Bordeaux, France, Sep. 10-15, 2007.