METHOD FOR PROCESSING AUDIO INPUT DATA AND A DEVICE THEREOF
20240276171 ยท 2024-08-15
Inventors
Cpc classification
H04S7/305
ELECTRICITY
H04S2400/09
ELECTRICITY
G10L21/02
PHYSICS
G10L21/0264
PHYSICS
International classification
Abstract
A computer-implemented method for processing audio input data into processed audio data by using an audio device comprising a microphone, a processor device and a memory holding a plurality of neural networks is presented. The plurality of neural networks are associated with different room types, wherein each room type is associated with one or more reference room acoustic metrics. The method comprises obtaining, by the microphone, room response data, wherein the room response data is reflecting room acoustics of a room in which the audio device is placed, determining, by using the processor device, the one or more room acoustic metrics based on the room response data, and selecting, by using the processor device, a matching neural network among the plurality of neural networks by comparing the one or more room acoustic metrics with the one or more reference room acoustic metrics associated with the different room types associated with the plurality of neural networks.
Claims
1. A computer-implemented method for processing audio input data into processed audio data by using an audio device comprising a microphone, a processor device and a memory holding a plurality of neural networks, wherein the plurality of neural networks are associated with different room types, wherein each room type is associated with one or more reference room acoustic metrics, said method comprising: obtaining, by the microphone, room response data, wherein the room response data is reflecting room acoustics of a room in which the audio device is placed, determining, by using the processor device, the one or more room acoustic metrics based on the room response data, selecting, by using the processor device, a matching neural network among the plurality of neural networks by comparing the one or more room acoustic metrics with the one or more reference room acoustic metrics associated with the different room types associated with the plurality of neural networks, and processing the audio input data captured by the microphone in combination with the speech data received into the processed audio data by using the matching neural network.
2. The method according to claim 1, wherein the one or more room acoustic metrics comprise reverberation time for a given frequency band or a set of frequency bands, such as RT60, a Direct-To-Reverberant Ratio (DRR) and/or Early Decay Time (EDT).
3. The method according to claim 1, wherein the plurality of neural networks comprise a generally trained neural network, and the generally trained neural network is selected as the matching neural network in case no matching neural network is found by comparing the one or more room acoustics metrics with the one or more reference room acoustic metrics.
4. The method according to claim 1, wherein the plurality of neural networks have been trained with different loss functions, wherein the different loss functions differ in terms of trade-offs between different distortion types.
5. The method according to claim 1, wherein the audio input data and the processed audio data are multi-channel audio data.
6. The method according to claim 1, wherein the audio device comprises an output transducer, said method further comprising: obtaining speech data originating from a far-end room via a data communication device, wherein the audio device is placed in a near-end room, generating sound by using the output transducer using the speech data received, wherein the room response data captured by the microphone is based on sound generated by the output transducer using the speech data.
7. The method according to claim 1, further comprising: transferring the processed audio data to a far-end device placed in the far-end room, wherein the far-end device is provided with an output transducer arranged to generate sound based on the processed audio data.
8. An audio device comprising: a microphone configured to obtain room response data, wherein the room response data is reflecting room acoustics of a room in which the audio device is placed, a memory holding a plurality of neural networks, wherein the plurality of neural networks are associated with different room types, wherein each room type is associated with one or more reference room acoustic metrics, a processor device configured to determine the one or more room acoustic metrics based on the room response data, and to select a matching neural network among the plurality of neural networks by comparing the one or more room acoustic metrics with the one or more reference room acoustic metrics associated with the different room types associated with the plurality of neural networks, wherein the processor device is arranged to process the audio input data captured by the microphone in combination with the speech data received into the processed audio data by using the matching neural network.
9. The audio device according to claim 8, wherein the one or more room acoustic metrics comprise reverberation time for a given frequency band, such as RT60, a Direct-To-Reverberant Ratio (DRR) and/or Early Decay Time (EDT).
10. The audio device according to claim 8, wherein the plurality of neural networks comprise a generally trained neural network, and the generally trained neural network is selected as the matching neural network in case no matching neural network is found by comparing the one or more room acoustics metrics with the one or more reference room acoustic metrics.
11. The audio device according to claim 8, wherein the plurality of neural networks have been trained with different loss functions, wherein the different loss functions differ in in terms of trade-offs between different distortion types.
12. The audio device according to claim 8, further comprising a data communication device arranged to receive speech data from a far-end room, an output transducer arranged to generate sound based on the speech data received, wherein the room response data captured by the microphone is based on sound generated by the output transducer using the speech data.
13. A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processor devices of an audio device, the one or more programs comprising instructions for performing the method according to any one of the claim 1.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0068] The above and other features and advantages will become readily apparent to those skilled in the art by the following detailed description of exemplary embodiments thereof with reference to the attached drawings, in which:
[0069]
[0070]
[0071]
[0072]
DETAILED DESCRIPTION
[0073] Various embodiments are described hereinafter with reference to the figures. Like reference numerals refer to like elements throughout. Like elements will, thus, not be described in detail with respect to the description of each figure. It should also be noted that the figures are only intended to facilitate the description of the embodiments. They are not intended as an exhaustive description of the claimed invention or as a limitation on the scope of the claimed invention. In addition, an illustrated embodiment needs not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced in any other embodiments even if not so illustrated, or if not so explicitly described.
[0074]
[0075] To be able to associate the room with the matching neural network 102c room response data 106 can be provided to the audio device 100. The room response data 106 may be data captured by a microphone of the audio device 100 itself or it may be audio data captured by a microphone of another audio device, e.g. a stand-alone microphone communicatively connected to the audio device 100. The room response data 106 can as the name suggests comprise data reflecting room acoustics of the room in which the audio device 100 is placed.
[0076] The room response data 106 may be obtained in response to sound generated by the audio device 100 itself for the purpose of determining room acoustics metrics. For instance, a test signal, a sound having a well-defined frequency, amplitude and duration, may be output from the audio device 100 and the room response data 106 may constitute reverberation and/or echo etc. originating from the test signal, e.g., the room response data 106 may then be in the form of the modulated test signal which have been modulated by the room acoustics, the room acoustics may then be determined by determining the modulation of the test signal. Another possibility is that the room response data 106 is formed in response to sound produced by the audio device 100 during a conference call, that is, sound produced based on data generated by a microphone in a far-end device. In a similar manner to the test signal, the signal received from the far-end is a known signal, then by comparing the far-end signal before being outputted by the loudspeaker and after being modulated by the room acoustics, the room acoustics may be determined.
[0077] Once having received the room response data 106, one or more room acoustic metrics may be determined by using a processor device 108 of the audio device 100. Generally, the room acoustic metrics may be any metrics that provides information on how sound waves provided in the room are affected by the room itself as well as objects placed in the room. By way of example, the one or more room acoustic metrics may comprise reverberation time for a given frequency band, e.g. RT60, a Direct-To-Reverberant Ratio (DRR) and/or Early Decay Time (EDT). The one or more room acoustic metrics may comprise a room impulse response (RIR), or a room transfer function (RTF). If the room impulse response is determined, the RT60 and other room acoustics may be determined based on the room impulse response.
[0078] Based on the room acoustic measures determined, the matching neural network 102c may be determined. This may be made by having each neural network 102a-d associated with one or more reference room acoustic metrics, i.e., different room types, and to find the matching network 102c, the room acoustic measures determined can be compared to the different reference room acoustic metrics. Once having found a neural network having matching room acoustic metrics among the neural networks 102a-d held in a memory 110 of the audio device 100, this neural network may be assigned as the matching network 102c. Different criteria may be used for comparison. For instance, in case there are several room acoustic metrics, these may be weighted differently. Even though it is herein disclosed as that one of the neural networks 102a-d is selected, it is also an option that several matching neural networks are selected even though not illustrated. If having a set up in which several matching neural networks are used, the audio data formed from the matching neural networks may be combined in a latter step before being transferred to e.g. a speaker. When stating a neural network has matching room acoustic metrics, it may be understood as the neural network being associated with one or more room acoustics metrics which are within a certain tolerance threshold of the determined one or more room acoustic metrics. The tolerance threshold may be determined via empirical data or may be set by an audio engineer during tuning of the audio device.
[0079] As illustrated, one of the neural networks 102a-d may be a generic neural network 102d that can be used if none of the other neural networks 102a-c is matching the room acoustic metrics determined. The generic neural network 102d may be trained based on data originating from all sorts of room types, while the other neural networks 102a-c may each be trained for a specific room type. As an alternative, instead of using the generic neural network 102d in case none of the others neural networks 102a-c are found matching, a non-neural network approach can be used. Put differently, a classical approach for suppressing echo and/or reverberation can be used. By way of example, an echo and/or reverberation suppression device using pre-programmed steps and/or thresholds may be used.
[0080] By way of example, a first neural network 102a may be associated with a one-person room of approximately 2 square meters and a second neural network 102b may be associated with an eight-person conference room of approximately 20 square meters. In addition to the size of the rooms, the room types may also differ in that they may be more or less crowded with people, in that they may more or less crowded with objects (e.g. furniture), in acoustic absorption, etc. For instance, the second neural network 102c may be associated with the eight-person conference room with four persons in the room and the audio device 100 placed on a table approximately in a center of the room. The third neural network 102c may on the other hand be associated with the eight-person conference room but with eight persons in the room and also with the audio device placed close to a wall of the room instead of in the center. Thus, the term room type should in this context be understood from an acoustics perspective. More particularly, this term is to be understood as that any room environment providing for a different type of acoustics is to be considered a specific room type. The level of detail, that is, how many different neural networks that are provided may depend on available storage space in the memory, as well as availability of data. More particularly, in case there is sufficient amount of data available for a certain room type, training a neural network for this room type is made possible. To meet that there may be room types not matching any of the available room types, the generic neural network 102d may be provided as explained above.
[0081] Regarding the alternative approach to use the generic neural network 102d in case no matching neural network can be found as discussed above, a traditional approach, above referred to as the classical approach, i.e. non-machine learning approach, may be used. An advantage with such arrangement is that in case there are some room types not covered by any of the neural networks 102a-d available due to that there has been no training data available, these can be handled by a number of pre-determined routines that may be providing a more reliable output than a poorly trained neural network. An example of such traditional processing is Residual Echo Suppression (RES). More particularly, the traditional processing may be so-called non-linear, harmonic distortion RES algorithms as described in the article Nonlinear residual acoustic echo suppression for high levels of harmonic distortion by Bendersky et al. at University of Buenos Aires.
[0082] The room acoustic metrics used for selecting the matching neural network 102c may be a single metric or a combination of metrics. For example, the room acoustic metrics may be RT60, i.e. Reverberation Time 60, herein defined as the time required for the sound in a room to decay over a specific dynamic range, usually taken to be 60 dB. The RT60 or other similar acoustic descriptors may also be estimated based on a determined impulse response. By having the different neural networks 102a-c, apart from the generic neural network 102d, associated with different RT60 ranges, it is made possible to, once having the RT60 measure determined for the room of the audio device 100, to find the matching neural network 102c.
[0083] Compensating for the room acoustics can be done in different ways. As illustrated in
[0084]
[0085] As illustrated by dashed lines, speech originating from a person speaking in the far-end room 302 may be transferred from a far-end data communication device 316 via a data communication network 312, e.g. a mobile data communication network, to a near-end data communication device 314. Far-end speech data, formed by a microphone of the far-end device 306 capturing the speech, may be processed by a far-end digital signal processor (DSP) before this is transferred to the near-end device 304, and, after having received the speech data by the near-end data communication device 314, the speech data may be processed by a near-end DSP 318. The processing in-between the near-end DSP 318 and the far-end DSP may comprise steps of encoding and decoding and/or steps of enhancing the signal.
[0086] As illustrated, the speech data may be transferred to and outputted from the speaker, i.e. output transducer 202, of the near-end device 304. In addition, the far-end speech data may be transferred to an impulse response estimator 322. The far-end speech data outputted from output transducer 202 may be picked-up by microphone 200 to obtain modulated far-end speech data, where the modulated far-end speech data being the far-end speech data having been modulated by the acoustics of the near-end room. The modulated far-end speech data may be passed to the estimator 322. The estimator 322 may then use the far-end speech data and the far-end modulated speech data to determine one or more room acoustic metrics, such as an impulse response. The estimator 322 may comprise a linear echo canceller for determining an impulse response. The linear echo canceller may be configured to estimate the impulse response from far-end speech data and the corresponding microphone signal. In some embodiments the linear echo canceller may be configured to perform linear echo cancellation, while the selected neural network may be configured to perform residual echo cancellation. The linear echo canceller may comprise an adaptive filter configured for linear echo cancellation by a normalized least mean square method. The impulse response estimated by the impulse response estimator 322 may be transferred to a selector 324. The selector 324 can be configured to select the matching neural network 102c based on the determined one or more room acoustic metrics, also referred to as neural network models, among the neural networks 102a-d.
[0087] Once having selected the matching neural network 102c, based on output provided by the selector 324, near-end speech data, illustrated by solid lines, obtained by the microphone 200 of the near-end device 304 is processed by using the matching neural network 102c. As illustrated, the far-end speech data captured in the far-end room 302 may also be used as input to the matching neural network 102c. An advantage with this is that the echo suppression may be further improved.
[0088] Even though the example presented above is referring to speech data, it should be noted that the approach is not limited to speech data, but can be used for any type of audio data, and also speech data combined with other types of audio data. It should also be noted that even though not illustrated also the far-end device 306 may be equipped with the neural networks 102a-d such that echo suppression can be achieved in both directions.
[0089]
[0090] Optionally, speech data originating from the far-end room 302 may be obtained 408 via the data communication device 314. The speech data obtained can be used for generating sound by using the output transducer 202. As an effect, the room response data captured by the microphone may be based on sound generated by the output transducer. This is in turn provides for that the selection 406 of the matching neural network 102c may be based on the room response data 106 in combination with the speech data received.
[0091] Optionally, the audio input data 104 captured by the microphone 200 in combination with the speech data received may be processed 412 into the processed audio data by using the matching neural network 102c.
[0092] Optionally, the processed audio data may be transferred 414 to a far-end device placed in the far-end room, wherein the far-end device may be provided with an output transducer arranged to generate sound based on the processed audio data.
[0093] Although particular features have been shown and described, it will be understood that they are not intended to limit the claimed invention, and it will be made obvious to those skilled in the art that various changes and modifications may be made without departing from the scope of the claimed invention. The specification and drawings are, accordingly to be regarded in an illustrative rather than restrictive sense. The claimed invention is intended to cover all alternatives, modifications and equivalents.
LIST OF REFERENCES
[0094] 100audio device [0095] 102a-dneural networks [0096] 104audio input data [0097] 106room response data [0098] 108processor device [0099] 110memory [0100] 200microphone [0101] 202output transducer/speaker [0102] 204echo suppression device [0103] 300near-end room [0104] 302far-end room [0105] 304near-end device [0106] 306far-end device [0107] 312data communication network [0108] 314near end data communication device [0109] 316far end data communication device [0110] 318near end DSP [0111] 320far end DSP [0112] 322impulse response estimator [0113] 324selector