EXTRACTION OF AN AUDIO OBJECT
20220383894 · 2022-12-01
Assignee
Inventors
Cpc classification
G10L21/0308
PHYSICS
H04S2400/11
ELECTRICITY
H04S7/30
ELECTRICITY
International classification
Abstract
A method for extracting at least one audio object from at least two audio input signals, each of which contains the audio object. The second audio input signal is syncronized with the first audio input signal while obtaining a synchronized second audio input signal. The audio object is extracted by applying at least one trained model to the first audio signal and to the synchronized second audio input signal. The audio object is outputted. Further, the step of synchronizing the second audio input signal with the first audio input signal includes the steps of: generating audio signals; analytically calculating a correlation between the audio signals; optimizing the correlation vector; and determining the synchronized second audio input signal using the optimized correlation vector.
Claims
1. A method for extracting at least one audio object from at least two audio input signals, each of the at least two audio input signals comprise the audio object, the method comprising: synchronizing a second audio input signal with a first audio input signal while obtaining a synchronized second audio input signal; and extracting the audio object by applying at least one trained model to the first audio signal and to the synchronized second audio input signal; outputting the audio object, wherein the step of synchronizing the second audio input signal with the first audio input signal comprises: generating audio signals by applying a first trained operator to the audio input signals; analytically calculating a correlation between the audio signals while obtaining a correlation vector; optimizing the correlation vector using a second trained operator while obtaining a synchronization vector; and determining the synchronized second audio input signal using the synchronization vector.
2. The method according to claim 1, wherein the first trained operator comprises a trained transformation of the audio input signals into a feature domain.
3. The method according to claim 1, wherein the second trained operator comprises at least one normalization of the correlation vector.
4. The method according to claim 1, wherein the second trained operator has an iterative method having a finite number of iteration steps, and wherein a synchronization vector is determined in each iteration step.
5. The method according to claim 4, wherein the number of iteration steps of the second trained operator is defined on the user side.
6. The method according to claim 4, wherein, in each iteration step of the second trained operator, a stretched convolution of the audio signal with at least part of the synchronization vector takes place.
7. The method according to claim 4, wherein, in each iteration step, a normalization of the synchronization vector and/or a stretched convolution of the synchronized audio input signal with the synchronization vector takes place.
8. The method according to claim 1, wherein the second trained operator provides for the determination of at least one acoustic model function.
9. The method according to claim 1, wherein the trained model of extracting the audio object provides for at least one transformation of the first audio input signal and the synchronized second audio input signal, in each case in a higher-dimensional representation domain.
10. The method according to claim 1, wherein the trained model of extracting the audio object provides for the application of at least one learned filter mask to the first audio input signal and to the synchronized second audio input signal.
11. The method according to claim 9, wherein the trained model of extracting the audio object provides for at least one transformation of the audio object into the time domain of the audio input signals.
12. The method according to claim 1, wherein the steps of synchronizing and/or extracting and/or outputting the audio object (11) are assigned to a single neural network.
13. The method according to claim 12, wherein the neural network is trained with target training data, the target training data comprising audio input signals and corresponding predefined audio objects, the method comprising the following training steps: forward propagating the neural network with the target training data while obtaining an ascertained audio object; determining an error vector between the ascertained audio object and the predefined audio object; and changing parameters of the neural network by backward propagating the neural network with the error vector if a quality parameter of the error vector exceeds a predefined value.
14. The method according to claim 1, wherein the method is configured to run continuously.
15. The method according to claim 1, wherein the audio input signals are in each case parts of audio signals which are continuously read in and have predefined temporal lengths.
16. The method according to claim 1, wherein the method is configured such that the latency of the method is at most 100 ms, at most 80 ms, or at most 40 ms.
17. A system for extracting an audio object from at least two audio input signals, the system comprising a control unit configured to carry out the method according to claim 1.
18. The system according to claim 17, further comprising: a first microphone for receiving the first audio input signal; and a second microphone for receiving the second audio input signal, the first and second microphone being connectable to the system such that the audio input signals of the microphones are transmitted to the control unit.
19. The system according to claim 17, wherein the system is a component of a mixing console.
20. A computer program having program code, which computer program is configured to carry out the steps of the method according to claim 1 when the computer program is executed on a computer or a corresponding computing unit or on a control unit of a system.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0024] The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus, are not limitive of the present invention, and wherein:
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
DETAILED DESCRIPTION
[0031]
[0032] The sound 12 is recorded by two microphones 13, 14 which each generate an audio input signal a1 , a2, so that the audio input signals a1 , a2 contain the sound 12. Due to the different distances between the microphones 13, 14 and the sound 12, the sound 12 is at different positions in time of the audio input signals a1, a2. In addition, the audio input signals a1, a2 differ from one another due to the acoustic properties of the surroundings and therefore also have undesired components which are caused, for example, due to the propagation paths of the sound to the microphones 13, 14, for example in the form of reverberation and/or suppressed frequencies, and are referred to within the meaning of the invention as background noise. Within the meaning of the invention, a first acoustic model function M1 reproduces the acoustic influences of the surroundings and the recording characteristics of the microphone 13 on the audio input signal a1 recorded by the first microphone 13. In this respect, the audio input signal a1 mathematically corresponds to a convolution of the sound 12 with the first acoustic model function M1. This applies analogously to a second acoustic model function M2 and to the recorded audio input signal a2 of the second microphone 14.
[0033] The microphones 13, 14 are connected to the mixing console 10a, so that the audio input signals a1, a2 are transmitted to a control unit 15 of the system 10, so that the control unit 15 evaluates the audio input signals a1, a2 and extracts and outputs the sound 12 for further use from the audio input signals a1, a2 extracted using the method according to the invention. The control unit 15 for extracting the audio object 11 is a microcontroller and/or a program code block of a corresponding computer program. The control unit 15 comprises a trained neural network which is in particular forward propagated with audio input signals a1, a2. The neural network is trained to extract the specific audio object 11, i.e. in the present case the sound 12, from the audio input signals a1, a2 and in particular to separate it from background noise components of the audio input signals a1, a2. Substantially, the effects of the acoustic model functions Ml, M2 on the sound 12 in the audio input signals a1, a2 are compensated for.
[0034]
[0035] According to
[0036] The method steps of synchronizing V1, of extracting V2 the sound 12 and of outputting V3 said sound are assigned to a single, trained neural network, so that the method is designed as an end-to-end method. As a result, it is trained as a whole and runs automatically and continuously, wherein the extraction of the sound takes place in real time, i.e. with a maximum latency of 40 ms.
[0037]
[0038] In the second method step V5 of
[0039] The calculation V5 results in a cross-correlation vector k which is shown as a model in
[0040] In the fourth method step in
[0041]
[0042]
[0043] The factor d.sub.i corresponds to the extent of the limitation of the cross-correlation vector for the iteration step i, with the summation taking place via the +/− the factor d.sub.i. This process is repeated until the number of iteration steps I specified on the user side has been carried out. Finally, a stretched convolution V9 of the audio signal m2 with the last calculated synchronization vector S.sub.i takes place, whereupon the synchronized second audio signal a2′ is calculated and output V7. The calculation of the synchronization vector s on the basis of the partial range of the parameters ascertained in the previous iteration step reduces the complexity of the calculations, which accelerates the runtime of the method without impairing the accuracy thereof.
[0044]
[0045] In the second method step V11, the separation of the audio object 11 from the audio input signals a1, a2′ takes place by applying a second trained model of the neural network to the audio input signals a1, a2′. The parameters of the second trained model were also optimized by the previous training and are in particular dependent on the first trained model of the preceding method step V10. As a result of this method step V11, the audio object 11 is obtained from the audio input signals a1, a2′ and is still in the higher-dimensional representation domain.
[0046] In the third method step V12 of
[0047] To ensure that the neural network can reliably extract the audio object 11 from the audio input signals a1, a2, it must be trained before use. This is done, for example, by the training steps V13 to V19 described below, which are shown in a schematic flow chart in
[0048] Predefined audio objects 16 are generated V14 using predefined algorithms for specified audio input signals a1, a2. The predefined audio objects 16 are always of the same type, so that the method is specifically trained with regard to one type of audio objects 16. The generated audio input signals a1, a2 run through the method according to the invention according to
[0049] If the quality parameter exceeds the predefined value, the termination criterion is not met and the gradient of the error vector P is determined in the next method step V18 and backward propagated through the neural network, so that all parameters of the neural network are adjusted. The training method V13 is then repeated with further data sets until the error vector P reaches a sufficiently good value and the query V17 shows that the termination criterion has been met. Then the training process V13 is completed V19 and the method can be applied to real data. Ideally, the audio objects 11 used as predefined audio objects 16 in the training phase are those that are also to be ascertained in the application of the method, for example kick sounds 12 from soccer balls, which sounds have already been recorded.
[0050] The invention being thus described, it will be obvious that the same may be varied in many ways. Such variations are not to be regarded as a departure from the spirit and scope of the invention, and all such modifications as would be obvious to one skilled in the art are to be included within the scope of the following claims.