EXTRACTION OF AN AUDIO OBJECT

Abstract

A method for extracting at least one audio object from at least two audio input signals, each of which contains the audio object. The second audio input signal is syncronized with the first audio input signal while obtaining a synchronized second audio input signal. The audio object is extracted by applying at least one trained model to the first audio signal and to the synchronized second audio input signal. The audio object is outputted. Further, the step of synchronizing the second audio input signal with the first audio input signal includes the steps of: generating audio signals; analytically calculating a correlation between the audio signals; optimizing the correlation vector; and determining the synchronized second audio input signal using the optimized correlation vector.

Claims

1. A method for extracting at least one audio object from at least two audio input signals, each of the at least two audio input signals comprise the audio object, the method comprising: synchronizing a second audio input signal with a first audio input signal while obtaining a synchronized second audio input signal; and extracting the audio object by applying at least one trained model to the first audio signal and to the synchronized second audio input signal; outputting the audio object, wherein the step of synchronizing the second audio input signal with the first audio input signal comprises: generating audio signals by applying a first trained operator to the audio input signals; analytically calculating a correlation between the audio signals while obtaining a correlation vector; optimizing the correlation vector using a second trained operator while obtaining a synchronization vector; and determining the synchronized second audio input signal using the synchronization vector.

2. The method according to claim 1, wherein the first trained operator comprises a trained transformation of the audio input signals into a feature domain.

3. The method according to claim 1, wherein the second trained operator comprises at least one normalization of the correlation vector.

4. The method according to claim 1, wherein the second trained operator has an iterative method having a finite number of iteration steps, and wherein a synchronization vector is determined in each iteration step.

5. The method according to claim 4, wherein the number of iteration steps of the second trained operator is defined on the user side.

6. The method according to claim 4, wherein, in each iteration step of the second trained operator, a stretched convolution of the audio signal with at least part of the synchronization vector takes place.

7. The method according to claim 4, wherein, in each iteration step, a normalization of the synchronization vector and/or a stretched convolution of the synchronized audio input signal with the synchronization vector takes place.

8. The method according to claim 1, wherein the second trained operator provides for the determination of at least one acoustic model function.

9. The method according to claim 1, wherein the trained model of extracting the audio object provides for at least one transformation of the first audio input signal and the synchronized second audio input signal, in each case in a higher-dimensional representation domain.

10. The method according to claim 1, wherein the trained model of extracting the audio object provides for the application of at least one learned filter mask to the first audio input signal and to the synchronized second audio input signal.

11. The method according to claim 9, wherein the trained model of extracting the audio object provides for at least one transformation of the audio object into the time domain of the audio input signals.

12. The method according to claim 1, wherein the steps of synchronizing and/or extracting and/or outputting the audio object (11) are assigned to a single neural network.

13. The method according to claim 12, wherein the neural network is trained with target training data, the target training data comprising audio input signals and corresponding predefined audio objects, the method comprising the following training steps: forward propagating the neural network with the target training data while obtaining an ascertained audio object; determining an error vector between the ascertained audio object and the predefined audio object; and changing parameters of the neural network by backward propagating the neural network with the error vector if a quality parameter of the error vector exceeds a predefined value.

14. The method according to claim 1, wherein the method is configured to run continuously.

15. The method according to claim 1, wherein the audio input signals are in each case parts of audio signals which are continuously read in and have predefined temporal lengths.

16. The method according to claim 1, wherein the method is configured such that the latency of the method is at most 100 ms, at most 80 ms, or at most 40 ms.

17. A system for extracting an audio object from at least two audio input signals, the system comprising a control unit configured to carry out the method according to claim 1.

18. The system according to claim 17, further comprising: a first microphone for receiving the first audio input signal; and a second microphone for receiving the second audio input signal, the first and second microphone being connectable to the system such that the audio input signals of the microphones are transmitted to the control unit.

19. The system according to claim 17, wherein the system is a component of a mixing console.

20. A computer program having program code, which computer program is configured to carry out the steps of the method according to claim 1 when the computer program is executed on a computer or a corresponding computing unit or on a control unit of a system.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0024] The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus, are not limitive of the present invention, and wherein:

[0025] FIG. 1 is a schematic view of a system according to the invention;

[0026] FIG. 2 is an overview of a method according to the invention in a flow chart with model signals;

[0027] FIG. 3 is a flow chart for the method step of synchronizing audio input signals with model signals;

[0028] FIG. 4 is a flow chart for an iterative method of synchronization;

[0029] FIG. 5 is a flow chart for extracting the audio object; and

[0030] FIG. 6 is a flow chart for training the method according to the invention.

DETAILED DESCRIPTION

[0031] FIG. 1 is a schematic representation of an embodiment of a system 10 according to the invention for extracting an audio object 11, the system 10 being a mixing console 10a. Audio objects 11 within the meaning of the invention are acoustic signals that are assigned to an event and/or to an object. In the present embodiment of the invention, the audio object 11 is the sound 12 of a soccer ball (not shown in FIG. 1) being kicked.

[0032] The sound 12 is recorded by two microphones 13, 14 which each generate an audio input signal a1 , a2, so that the audio input signals a1 , a2 contain the sound 12. Due to the different distances between the microphones 13, 14 and the sound 12, the sound 12 is at different positions in time of the audio input signals a1, a2. In addition, the audio input signals a1, a2 differ from one another due to the acoustic properties of the surroundings and therefore also have undesired components which are caused, for example, due to the propagation paths of the sound to the microphones 13, 14, for example in the form of reverberation and/or suppressed frequencies, and are referred to within the meaning of the invention as background noise. Within the meaning of the invention, a first acoustic model function M1 reproduces the acoustic influences of the surroundings and the recording characteristics of the microphone 13 on the audio input signal a1 recorded by the first microphone 13. In this respect, the audio input signal a1 mathematically corresponds to a convolution of the sound 12 with the first acoustic model function M1. This applies analogously to a second acoustic model function M2 and to the recorded audio input signal a2 of the second microphone 14.

[0033] The microphones 13, 14 are connected to the mixing console 10a, so that the audio input signals a1, a2 are transmitted to a control unit 15 of the system 10, so that the control unit 15 evaluates the audio input signals a1, a2 and extracts and outputs the sound 12 for further use from the audio input signals a1, a2 extracted using the method according to the invention. The control unit 15 for extracting the audio object 11 is a microcontroller and/or a program code block of a corresponding computer program. The control unit 15 comprises a trained neural network which is in particular forward propagated with audio input signals a1, a2. The neural network is trained to extract the specific audio object 11, i.e. in the present case the sound 12, from the audio input signals a1, a2 and in particular to separate it from background noise components of the audio input signals a1, a2. Substantially, the effects of the acoustic model functions Ml, M2 on the sound 12 in the audio input signals a1, a2 are compensated for.

[0034] FIG. 2 shows an embodiment of the method according to the invention in an overview as a flow chart with model audio input signals a1 , a2 on which the method is carried out. In a first step V1, a synchronization of the second audio input signal a2 with the first audio input signal a1 takes place, so that a synchronized second audio input signal a2′ is obtained as a the result. Within the meaning of the invention, the synchronized second audio input signal a2′ has in particular the sound 12 at substantially the same time position as the first audio input signal a1, which significantly accelerates and simplifies the subsequent method steps. In this respect, the synchronization V1 of the audio input signals a1 , a2 corresponds in particular to a compensation for the propagation time differences between the audio input signals a1, a2.

[0035] According to FIG. 2, the extraction V2 of the sound 12 takes place by applying a trained model to the first audio input signal a1 and to the synchronized second audio input signal a2′, so that, as a result, the sound 12 is obtained as an audio signal. The trained model is assigned to the neural network and is trained as a part thereof for the extraction of the specific audio object 11, in this case the sound 12. In the subsequent method step, the output V3 of the sound 12 takes place as an audio output signal Z.

[0036] The method steps of synchronizing V1, of extracting V2 the sound 12 and of outputting V3 said sound are assigned to a single, trained neural network, so that the method is designed as an end-to-end method. As a result, it is trained as a whole and runs automatically and continuously, wherein the extraction of the sound takes place in real time, i.e. with a maximum latency of 40 ms.

[0037] FIG. 3 is a flow chart of a method sequence for synchronizing V1 audio input signals a1, a2 with model audio input signals a1, a2 to show the method steps. In a first method step V4 of FIG. 3, a first trained operator of the neural network is applied to the audio input signals a1, a2 in order to generate audio signals m1, m2. In one embodiment of the invention, the audio input signals a1, a2 are transformed by the first trained operator of the neural network into a higher-dimensional feature domain in the time domain compared to the audio input signals a1, a2 for the audio signals m1, m2 in order to simplify and speed up subsequent calculations. Depending on the type of audio object 11, a processing of the audio signals m1, m2 takes place already during the transformation. FIG. 3 shows the transformed audio signals m1, m2 as a model.

[0038] In the second method step V5 of FIG. 3, the analytical calculation of the cross-correlation takes place as a correlation between the audio signals m1, m2, which correlation is mathematically defined as follows:

[00001] $(m_{1} .star-solid. m_{2}) [t] \hat{=} {.Math.}_{n = - \infty}^{\infty} m_{1} [n] m_{2} [n + t]$

[0039] The calculation V5 results in a cross-correlation vector k which is shown as a model in FIG. 3. In the third method step V6, the cross-correlation vector k is optimized using a second trained operator of the neural network, wherein the calculation of the acoustic model function M takes place using the second trained operator in order to compensate for the effects thereof on the audio signals m1, m2. The second trained operator thus serves, for example, as an acoustic filter and, in the embodiment in FIG. 3, provides in particular for a normalization of the cross-correlation vector k, for example by means of a softmax function. FIG. 3 shows the synchronization vector s thus obtained as a model.

[0040] In the fourth method step in FIG. 3, the calculation V7 of the synchronized second audio input signal a2′ takes place by convolving the synchronization vector s with the second audio input signal a2.

[0041] FIG. 3 shows the synchronized second audio input signal a2′ as a model. In comparison to the original audio input signal a2, it can be seen that in the greatly simplified model considered here, a compensation of the propagation time delay takes place as a time offset. As already described, the synchronized second audio input signal a2′ is then used for the extraction V2 of the audio object 11.

[0042] FIG. 4 shows a further embodiment of the synchronization V1 of the audio input signals a1, a2, in which an iterative method is provided for accelerating the calculation, the number of iteration steps I being specified on the user side. In the first iteration step, a calculation of the correlation vector between the audio signals m1, m2 takes place similarly to the method according to FIG. 3 up to the calculation V7 of the synchronized audio input signal a2′, wherein the synchronization vector s.sub.i of the current iteration step i is limited in the context of the optimization V6 at each iteration step i by means of the maxpool function. Then—in each iteration step i—the calculation V8 of the iterative audio signal m2i for the iteration step i takes place by means of a stretched convolution which is mathematically defined as follows:

[00002] $(a_{2} *_{d_{i}} s) (t) = {.Math.}_{n = - d_{i}}^{d_{i}} a_{2} (d_{i} .Math. n) s (n + t) .Math.$

[0043] The factor d.sub.i corresponds to the extent of the limitation of the cross-correlation vector for the iteration step i, with the summation taking place via the +/− the factor d.sub.i. This process is repeated until the number of iteration steps I specified on the user side has been carried out. Finally, a stretched convolution V9 of the audio signal m2 with the last calculated synchronization vector S.sub.i takes place, whereupon the synchronized second audio signal a2′ is calculated and output V7. The calculation of the synchronization vector s on the basis of the partial range of the parameters ascertained in the previous iteration step reduces the complexity of the calculations, which accelerates the runtime of the method without impairing the accuracy thereof.

[0044] FIG. 5 is a flow chart of an embodiment of the extraction V2 of the audio object 11 from the audio input signal a1 and the synchronized second audio input signal a2′. In a first method step V10, the audio input signals a1 , a2′ are each transformed into a higher-dimensional representation domain by applying a first trained model of the neural network in order to simplify the subsequent calculations. For example, the first trained model has a common filter bank having, in particular, a third-octave band filter bank and/or a mel filter bank, the parameters of the filters having been optimized by the previous training of the neural network.

[0045] In the second method step V11, the separation of the audio object 11 from the audio input signals a1, a2′ takes place by applying a second trained model of the neural network to the audio input signals a1, a2′. The parameters of the second trained model were also optimized by the previous training and are in particular dependent on the first trained model of the preceding method step V10. As a result of this method step V11, the audio object 11 is obtained from the audio input signals a1, a2′ and is still in the higher-dimensional representation domain.

[0046] In the third method step V12 of FIG. 5, the separated audio object 11 is transformed into the original, one-dimensional time domain of the audio signals a1, a2 by applying a third trained model of the neural network to the audio object 11, wherein the parameters of the third trained model are dependent on those of the other trained models and were jointly optimized by the previous training. In this respect, the third trained model of the transformation according to the third method step V12 of FIG. 5 can be seen functionally as a complement to the transformation V10 according to the first trained model. If, for example, a one-dimensional convolution is provided in the first trained model of the first method step V10, a transposed one-dimensional convolution takes place in the inverse transformation V12.

[0047] To ensure that the neural network can reliably extract the audio object 11 from the audio input signals a1, a2, it must be trained before use. This is done, for example, by the training steps V13 to V19 described below, which are shown in a schematic flow chart in FIG. 6. In the considered embodiments of the method according to the invention, the method steps mentioned are assigned to a single neural network and can each be differentiated, so that all trained components are specifically trained with regard to the audio object 11 using the training method V13 described below.

[0048] Predefined audio objects 16 are generated V14 using predefined algorithms for specified audio input signals a1, a2. The predefined audio objects 16 are always of the same type, so that the method is specifically trained with regard to one type of audio objects 16. The generated audio input signals a1, a2 run through the method according to the invention according to FIG. 2 and are in particular forward propagated by the neural network V15. The audio object 17 thus ascertained is compared with the predefined audio object 16 in order to determine V16 a mathematical error vector P on this basis. A query V17 takes place subsequently as to whether a quality parameter of the error vector P falls below a predefined value and the ascertained audio object 17 was extracted sufficiently well.

[0049] If the quality parameter exceeds the predefined value, the termination criterion is not met and the gradient of the error vector P is determined in the next method step V18 and backward propagated through the neural network, so that all parameters of the neural network are adjusted. The training method V13 is then repeated with further data sets until the error vector P reaches a sufficiently good value and the query V17 shows that the termination criterion has been met. Then the training process V13 is completed V19 and the method can be applied to real data. Ideally, the audio objects 11 used as predefined audio objects 16 in the training phase are those that are also to be ascertained in the application of the method, for example kick sounds 12 from soccer balls, which sounds have already been recorded.

[0050] The invention being thus described, it will be obvious that the same may be varied in many ways. Such variations are not to be regarded as a departure from the spirit and scope of the invention, and all such modifications as would be obvious to one skilled in the art are to be included within the scope of the following claims.

EXTRACTION OF AN AUDIO OBJECT

Assignee

Inventors

Cpc classification

Classification Explorer

G10L21/0308

PHYSICS

Classification Explorer

G10L25/30

PHYSICS

Classification Explorer

H04S2400/11

ELECTRICITY

Classification Explorer

H04S7/30

ELECTRICITY

Classification Explorer

G10L21/055

PHYSICS

International classification

Classification Explorer

G10L21/055

PHYSICS

Classification Explorer

H04S7/00

ELECTRICITY

Abstract

Claims

Description