AUDIO CROPPING
20210358514 · 2021-11-18
Inventors
- David Anthony Betts (Cambridgeshire, GB)
- James Peter McTavish (Oxford, GB)
- Joe Patrick Lynas (Cambridgeshire, GB)
Cpc classification
H04S2420/07
ELECTRICITY
H04S2400/15
ELECTRICITY
G06F2203/04808
PHYSICS
H04S7/30
ELECTRICITY
G06F3/0484
PHYSICS
H04R2430/03
ELECTRICITY
International classification
G06F3/0484
PHYSICS
H04N5/262
ELECTRICITY
Abstract
A method of cropping a portion of an audio signal captured from a plurality of spatially separated audio sources in a scene, the method comprising: capturing the audio signal with one or more recording devices; separating the audio signal into a plurality of components each associated with one or more of the plurality of audio sources; selecting a spatial region in the scene; determining which of the plurality of components are associated with an audio source positioned outside of the selected spatial region; and cropping the plurality of components associated with an audio source positioned outside of the selected spatial region out of the audio signal.
Claims
1. A method of cropping a portion of an audio signal captured from a plurality of spatially separated audio sources in a scene, the method comprising: capturing the audio signal with one or more recording devices; separating the audio signal into a plurality of components associated with one or more of the plurality of audio sources; selecting a spatial region in the scene; determining which of the plurality of components are associated with an audio source positioned outside of the selected spatial region; and cropping the plurality of components associated with an audio source positioned outside of the selected spatial region out of the audio signal.
2. The method of claim 1, wherein selecting the spatial region in the scene comprises: on a display of a user device, displaying a spatial representation of the scene; and with a user interface of the user device, selecting the spatial region on the displayed spatial representation.
3. The method of claim 2, comprising: with an image capture device, capturing image data of the scene, and constructing the spatial representation of the scene from the captured image data.
4. The method of claim 1, wherein separating the audio signal into a plurality of components comprises performing blind source separation on the captured audio signal.
5. The method of claim 4, wherein performing said blind source separation comprises: converting the captured audio signal to time-frequency domain data comprising a plurality of frames for a plurality of times and frequencies, and constructing a multi-channel filter to operate on said time-frequency data frames to separate the plurality of components by source by calculating a set of filter coefficients corresponding to each source.
6. The method of claim 5 wherein said cropping comprises: selecting and applying the set of filter coefficients of the multi-channel filter corresponding to those sources determined to be inside of the selected spatial region.
7. The method of claim 1, wherein determining which of the plurality of components are associated with an audio source positioned outside of the selected spatial region comprises, for each of the plurality of components: calculating phase correlations across a set of possible directions of arrival of the component at the one or more recording devices; calculating a probability of observing the calculated phase correlations; and determining the source is outside of the selected spatial region when the calculated probability is below the threshold, and/or determining the source is inside the of the selected spatial region when the calculated probability is at or above the threshold.
8. The method of claim 7, wherein calculating the probability of observing the calculated phase correlations comprises: ranking the phase correlations, comparing the rankings to rankings of phase correlations in training data.
9. The method of claim 3, wherein the set of possible directions of arrival comprises a set of vectors in a 3D coordinate system of the spatial representation of the scene.
10. The method of claim 7, wherein calculating the probability of observing the calculated phase correlations comprises calculating:
p.sub.m(d.sub.m∈r.sub.11 . . . r.sub.NF) where, for each component associated with an audio source m, p.sub.m is the probability that d, a direction in a 3D coordinate system of the scene indicating a direction of arrival of the component at the one or more recording devices, is an element of the set
, the set of directions in the selected spatial region, given the phase correlation rank r.sub.11 . . . r.sub.NF of the direction d over N directions over F frequencies.
11. The method of claim 5, comprising calculating said phase correlations using a phase transformation method.
12. The method of claim 11, wherein the phase transformation method comprises a steered response power phase transformation (SRP-PHAT) method.
13. The method of claim 3, comprising cropping out of the image data regions outside of the selected spatial region.
14. The method of claim 1, comprising: recognising in the cropped audio signal a speech component, isolating the speech component, and outputting with a playback device the speech component.
15. The method of claim 1, comprising: identifying in the cropped audio signal the component of the plurality of components having the highest volume, and outputting with a playback device the component having the highest volume.
16. An audio-visual system comprising: one or more recording devices configured to capture an audio signal from a plurality of spatially separated audio sources in a scene, a user device comprising a display configured to display a spatial representation of the scene and a user interface for selecting a spatial region on the displayed spatial representation, and one or more processors configured to: separate the audio signal into a plurality of components associated with one of the plurality of audio sources; determine which of the plurality of components are associated with an audio source positioned outside of the spatial region selected with the user interface; and crop the plurality of components associated with an audio source positioned outside of the selected spatial region out of the audio signal.
17. The audio-visual system of claim 16, wherein the user device comprises said one or more recording devices.
18. The audio-visual system of claim 16, comprising one or more image capture devices configured to capture image data of the scene, and wherein the one or more processors are configured to construct the spatial representation of the scene from the captured image data.
19. The audio-visual system of claim 18, wherein the user device comprises said one or more image capture devices.
20. The audio-visual system of claim 18, wherein the one or more processors are configured to crop out of the captured image data of the scene regions outside of the selected spatial region.
21. The audio-visual system of claim 20, comprising one or more playback devices configured to output the cropped audio signal and the cropped image data.
22. The audio-visual system of claim 16, wherein said user device is mobile device.
23. The audio-visual system of claim 16, wherein the audio-visual system comprises a video teleconferencing system.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0224] These and other aspects of the invention will now be further described by way of example only, with reference to the accompanying figures in which:
[0225]
[0226]
[0227]
[0228]
[0229]
[0230]
[0231]
[0232]
[0233]
[0234]
DETAILED DESCRIPTION
[0235] As described above,
[0236] In this simplified example, the sources S1, S2, S3 and S4 are respectively at coordinates (1, 8), (7, 7), (7, 2) and (2, 3) which, if the 64 unique directions in the scene start at number 1 for coordinate (1,1) and 64 for coordinate (8, 8) correspond to directions d.sub.15, d.sub.18, d.sub.57, d.sub.56.
[0237] The recording device 10 captures an audio signal which is separated into its constituent components using blind source separation as described above.
[0238] The phase correlation values Φ.sub.f(S.sub.mf,d) for each of the directions d.sub.1 . . . 64 in the grid are calculated (that is, the phase correlation values for directions at coordinates (1, 1) all the way through to (8, 8) are calculated) and, in this simplified example, have the distribution shown in the heat map of
[0239] Assume a user has selected a spatial region indicated by box 201 shown in
[0240] To determine which of the plurality of components are associated with a source positioned outside of the selected spatial region and which are associated with a source positioned inside the selected spatial region, phase correlation values Φ.sub.f, (S.sub.mf,d) are ranked in a list, for example by high to low or low to high phase correlation value or some other order. Thus the directions or coordinates corresponding to a peak will be ranked 1.sup.st, the immediately surrounding directions or coordinates will be ranked 2.sup.nd to 9.sup.th and the other directions or coordinates will be ranked 10.sup.th to 64.sup.th
[0241] To help illustrate this, let's consider the example of the 2.sup.nd row of directions, that is, the row of squares indicated by coordinates (1, 2) to (8, 2) corresponding to directions d.sub.9 . . . 16 using the above index notation. A simplified distribution of ranks may be as follows:
[0242] Coordinate (7,2) or direction d.sub.215 has a peak phase correlation value of 1 so is ranked 1.sup.st. Coordinates (1,2), (2,2), (3,2), (6,2), (8,2) or directions d.sub.9, 10, 11, 14, 16 have phase correlation values of 0.5 so will be ranked among the directions in the scene having the ranks 2.sup.nd to 9.sup.th. Note that the rank is the index into any valid sorting of the phase correlations such that each phase correlation is less than or equal to the next. This means identical phase correlation values will be assigned consecutive ranks. In practice however noise effects make identical phase correlations highly unlikely. Coordinates (4, 2), (5, 2) or directions d.sub.12, 13 have phase correlation values of 0 so are ranked among the directions in the scene having the ranks 10.sup.th to 64.sup.th.
[0243] Thus, using the index 1 . . . N notation to indicate directions d.sub.1 . . . 64 we might have by way of illustrative example a rank distribution for directions (r.sub.9 . . . r.sub.16) of:
r.sub.9: 2.sup.nd
r.sub.10: 3.sup.rd
r.sub.11: 4.sup.th
r.sub.12: 21.sup.St
r.sub.13: 57.sup.th
r.sub.14: 9.sup.th
r.sub.15: 1.sup.st
r.sub.16: 7.sup.th
[0244] We can follow the same process for all the directions d.sub.1 . . . 64 in the scene to obtain the rank distribution (r.sub.1 . . . r.sub.64). If we do the same for all observed frequencies f we obtain the rank distribution (r.sub.11 . . . r.sub.64F) which is the rank distribution across all frequencies and across all directions.
[0245] We can now compare our current, observed rank distribution with rank distributions from previously captured training data where the source positions or directions were known. This allows us to estimate the probability of observing the rank distribution(r.sub.11 . . . r.sub.64F) we actually observed, for/given the many possible source positions or directions (described above as states s) in the training data. That is, by comparing the observed rank distribution to training data rank distributions, we can estimate the probabilities: p.sub.m(r.sub.11 . . . r.sub.64F|S)
[0246] As previously described, by using Bayes rule on the now known probabilities p.sub.m(r.sub.11 . . . r.sub.64F|s) we obtain p.sub.m(s.sub.d|r.sub.11 . . . r.sub.64F) which allows us to evaluate the expression for the probability that a source m comes from our selected audio crop region 201:
[0247] We then do the same for all the sources m. We can then apply a threshold to the calculated probabilities p.sub.m(d.sub.m∈r.sub.11 . . . r.sub.64F) to make the determination as to whether or not a source m in direction d is or isn't in the set of directions
inside the selected spatial region.
[0248] When we know which sources are or aren't inside the selected spatial region, we can simply remove the component or components of the audio signal associated with those sources from the audio signal and it is this removal (based position in the scene) which is referred herein to as cropping. The audio signal can then be passed to a playback device or be processed further.
[0249]
[0250] The user device 303 is provided with a user interface for selecting a spatial region on the displayed spatial representation 305. The user interface may be a touchscreen interface allowing the user to drag a shape around a desired spatial region. Alternatively, any other known input interface and/or software may be used to allow a user to select a spatial region on the spatial representation and to associate the selection with the corresponding spatial region in the scene. To indicate the selection, a visual representation of the selection 307, for example an outline around the selected region, may be rendered on the display.
[0251]
[0252] It is further envisaged that the user may simultaneously apply the crop image/video 320 and crop audio 323 functionalities at the same time. The user may also be prompted when cropping image/video if they would also like to crop audio to match their image/video cropping selection.
[0253] In the example of
[0254] As described in connection with the method above, the one or more processors of the audio-visual system 300 are accordingly configured to: separate the captured audio signal into a plurality of components associated with one of the plurality of audio sources 301a-301d; determine which of the plurality of components are associated with an audio source positioned outside of the spatial region selected with the user interface; and crop the plurality of components associated with an audio source positioned outside of the selected spatial region out of the audio signal.
[0255] The one or more processors may further be configured to crop out of the captured image data any regions outside of the user's selection. Thus the user may, in one step, crop both the image data and the audio data thereby providing significant advantages over known methods.
[0256] Finally, the audio-visual system 300 comprises one or more playback devices such as a display and/or speakers (not shown) configured to output the cropped audio signal and/or the cropped image/video data.
[0257]
[0258] It is assumed that the acoustic scene is quasi-static and thus the filter coefficient determiner 311 and spatial filter 310 can operate in parallel. The latency is then determined by the main acoustic path, and depends upon the group delay of the filter coefficients, the latency of the spatial filter implementation, and the input/output transmission delays. Many different types of spatial filter may be used for example one low latency filter implementation is to use a direct convolution; a more computationally efficient alternative is described in Gardener, W G (1995), “Efficient Convolution without Input-output Delay”, Journal of the Audio Engineering Society′, 43 (3), 127-136.
[0259] The skilled person will recognise that the signal processing illustrated in the architecture of
[0260] An example spatial filter 310 for the architecture of
[0261]
[0262] The Discrete Fourier Transform (DFT) is a method of transforming a block of data between a time domain representation and a frequency domain representation. The STFT is an invertible method where overlapping time domain frames are transformed using the DFT to a time-frequency domain. The STFT is used to apply the filtering in the time-frequency domain; in embodiments when processing each audio channel, each channel in a frame is transformed independently using a DFT. Optionally the spatial filtering could also be applied in the time-frequency domain, but this incurs a processing latency and thus more preferably the filter coefficients are determined in the time-frequency domain and then inverse transformed back into the time domain. The time domain convolution maps to frequency domain multiplication.
[0263] As shown in
[0264] The frame weights are determined by a source characterisation module 504. In implementations this determines frame weights according to one or more of the previously described heuristics. This may operate in the time domain or (more preferably) in the time-frequency domain as illustrated. More particularly, in implementations this may implement a multiple-source direction of arrival (DOA) estimation procedure. The skilled person will be aware of many suitable procedures including, for example, an MVDR (Minimum Variance Distortionless Response) beamformer, the MUSIC (Multiple Signal Classification) procedure, or a Fourier method (finding the peaks in the angular Fourier spectrum of the acoustic signals, obtained from a combination of the sensor array response and observations X).
[0265] The output data from such a procedure may comprise time series data indicating source activity (amplitude) and source direction. The time or time-frequency domain data or output of such a procedure may also be used to identify: a frame containing an impulsive event; and/or frames with less than a threshold sound level; and/or to classify a sound, for example as air conditioning or the like.
[0266] Referring now to
[0267] The blind source separation module 508 provides a set of demixing matrices as an output, defining frequency domain filter coefficients Wf. In embodiments these are provided to module 509 which removes the scaling ambiguity as previously described, providing filter coefficients for a source s at all the microphones (or reduced set of microphones).
[0268] The user selects 510 a spatial region as has been described above. In some implementations, this may comprise mapping the directions d in the visual representation on which the user has made their selection to the directions d for which the phase correlations Φ.sub.f (S.sub.mf,d) have been calculated. Thus, the user's selection in the visual representation corresponds to making a selection of directions d from the whole set of directions in the scene. With the selection of d now made, the probability of:
p.sub.m(d.sub.m∈r.sub.11 . . . r.sub.NF)=
p.sub.m(s.sub.d|r.sub.11 . . . r.sub.NF)
may be calculated for each of the sources. The filter coefficients Wf(s) for selecting those sources that meet (for example exceed) the chosen probability threshold are then output by the filter determiner module for use by the spatial filter after conversion back into the time domain. When these coefficients are used by the spatial filter, this has the effect of removing (i.e. cropping) those components of the input audio signal that are from sources outside the selected spatial region.
[0269] In some implementations, if a priori knowledge of the source directions is available, an additional layer of user selection may optionally be introduced by a source selection module 511 that operates on a pseudo inverse of the demixing matrix, using the microphone phase responses to choose a specific source s based on that a priori knowledge. That is, the source may be selected 512 either by the user, for example the user indicating a direction of the desired source, based on said a priori knowledge of the source direction.
[0270]
[0271]
[0272] No doubt many other effective alternatives will occur to the skilled person. It will be understood that the invention is not limited to the described embodiments and encompasses modifications apparent to those skilled in the art lying within the scope of the claims appended hereto. It should further be noted that the invention also encompasses any combination of embodiments described herein, for example an embodiment may combine the features of any one or more of the independent and/or dependent claims.
[0273] For example, where it the audio signal is described as being separated into a plurality of components associated with one or more of the plurality of audio sources, it is envisaged that in an ideal situation, there may be a one-to-one correspondence whereby each of the plurality of components is associated with one of the audio sources. However, in practice, for example if the scene is more complex and contains many audio sources, BSS may be unable to distinguish the sources perfectly and inadvertently merge sources and thus it is also envisaged that in some scenarios, each component may be associated with multiple sources. Finally, in some cases, for example where there are not many audio sources, some of the components may mostly contain ambient noise.
[0274] For example, the audio and/or video cropping may occur in real time or near real time as the audio and/or video data is being recorded by or streamed to the user device or it instead may performed after the recording has taken place as part of an editing process.