AUDIO CROPPING

20210358514 · 2021-11-18

    Inventors

    Cpc classification

    International classification

    Abstract

    A method of cropping a portion of an audio signal captured from a plurality of spatially separated audio sources in a scene, the method comprising: capturing the audio signal with one or more recording devices; separating the audio signal into a plurality of components each associated with one or more of the plurality of audio sources; selecting a spatial region in the scene; determining which of the plurality of components are associated with an audio source positioned outside of the selected spatial region; and cropping the plurality of components associated with an audio source positioned outside of the selected spatial region out of the audio signal.

    Claims

    1. A method of cropping a portion of an audio signal captured from a plurality of spatially separated audio sources in a scene, the method comprising: capturing the audio signal with one or more recording devices; separating the audio signal into a plurality of components associated with one or more of the plurality of audio sources; selecting a spatial region in the scene; determining which of the plurality of components are associated with an audio source positioned outside of the selected spatial region; and cropping the plurality of components associated with an audio source positioned outside of the selected spatial region out of the audio signal.

    2. The method of claim 1, wherein selecting the spatial region in the scene comprises: on a display of a user device, displaying a spatial representation of the scene; and with a user interface of the user device, selecting the spatial region on the displayed spatial representation.

    3. The method of claim 2, comprising: with an image capture device, capturing image data of the scene, and constructing the spatial representation of the scene from the captured image data.

    4. The method of claim 1, wherein separating the audio signal into a plurality of components comprises performing blind source separation on the captured audio signal.

    5. The method of claim 4, wherein performing said blind source separation comprises: converting the captured audio signal to time-frequency domain data comprising a plurality of frames for a plurality of times and frequencies, and constructing a multi-channel filter to operate on said time-frequency data frames to separate the plurality of components by source by calculating a set of filter coefficients corresponding to each source.

    6. The method of claim 5 wherein said cropping comprises: selecting and applying the set of filter coefficients of the multi-channel filter corresponding to those sources determined to be inside of the selected spatial region.

    7. The method of claim 1, wherein determining which of the plurality of components are associated with an audio source positioned outside of the selected spatial region comprises, for each of the plurality of components: calculating phase correlations across a set of possible directions of arrival of the component at the one or more recording devices; calculating a probability of observing the calculated phase correlations; and determining the source is outside of the selected spatial region when the calculated probability is below the threshold, and/or determining the source is inside the of the selected spatial region when the calculated probability is at or above the threshold.

    8. The method of claim 7, wherein calculating the probability of observing the calculated phase correlations comprises: ranking the phase correlations, comparing the rankings to rankings of phase correlations in training data.

    9. The method of claim 3, wherein the set of possible directions of arrival comprises a set of vectors in a 3D coordinate system of the spatial representation of the scene.

    10. The method of claim 7, wherein calculating the probability of observing the calculated phase correlations comprises calculating:
    p.sub.m(d.sub.m∈custom-characterr.sub.11 . . . r.sub.NF) where, for each component associated with an audio source m, p.sub.m is the probability that d, a direction in a 3D coordinate system of the scene indicating a direction of arrival of the component at the one or more recording devices, is an element of the set custom-character, the set of directions in the selected spatial region, given the phase correlation rank r.sub.11 . . . r.sub.NF of the direction d over N directions over F frequencies.

    11. The method of claim 5, comprising calculating said phase correlations using a phase transformation method.

    12. The method of claim 11, wherein the phase transformation method comprises a steered response power phase transformation (SRP-PHAT) method.

    13. The method of claim 3, comprising cropping out of the image data regions outside of the selected spatial region.

    14. The method of claim 1, comprising: recognising in the cropped audio signal a speech component, isolating the speech component, and outputting with a playback device the speech component.

    15. The method of claim 1, comprising: identifying in the cropped audio signal the component of the plurality of components having the highest volume, and outputting with a playback device the component having the highest volume.

    16. An audio-visual system comprising: one or more recording devices configured to capture an audio signal from a plurality of spatially separated audio sources in a scene, a user device comprising a display configured to display a spatial representation of the scene and a user interface for selecting a spatial region on the displayed spatial representation, and one or more processors configured to: separate the audio signal into a plurality of components associated with one of the plurality of audio sources; determine which of the plurality of components are associated with an audio source positioned outside of the spatial region selected with the user interface; and crop the plurality of components associated with an audio source positioned outside of the selected spatial region out of the audio signal.

    17. The audio-visual system of claim 16, wherein the user device comprises said one or more recording devices.

    18. The audio-visual system of claim 16, comprising one or more image capture devices configured to capture image data of the scene, and wherein the one or more processors are configured to construct the spatial representation of the scene from the captured image data.

    19. The audio-visual system of claim 18, wherein the user device comprises said one or more image capture devices.

    20. The audio-visual system of claim 18, wherein the one or more processors are configured to crop out of the captured image data of the scene regions outside of the selected spatial region.

    21. The audio-visual system of claim 20, comprising one or more playback devices configured to output the cropped audio signal and the cropped image data.

    22. The audio-visual system of claim 16, wherein said user device is mobile device.

    23. The audio-visual system of claim 16, wherein the audio-visual system comprises a video teleconferencing system.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0224] These and other aspects of the invention will now be further described by way of example only, with reference to the accompanying figures in which:

    [0225] FIG. 1 is an illustration of an example acoustic scene.

    [0226] FIG. 2 is a spatial representation of the acoustic scene of FIG. 1.

    [0227] FIG. 3a is an audio-visual system according to the present disclosure.

    [0228] FIG. 3b illustratively shows an example user interface of an audio-visual system according to the present disclosure.

    [0229] FIG. 3c illustratively shows example architecture of an audio-visual system according to the present disclosure.

    [0230] FIG. 4 illustratively shows example architecture of a spatial filter according to the present disclosure.

    [0231] FIG. 5a illustratively shows an example frame buffer management implementation according to the present disclosure.

    [0232] FIG. 5b illustratively shows an example implementation of a frequency domain filter coefficient determiner according to the present disclosure.

    [0233] FIG. 6 illustratively shows an example of a general purpose computing system programmed to implement an audio-visual system of the present disclosure.

    [0234] FIG. 7 is a flowchart of a method according to the present disclosure.

    DETAILED DESCRIPTION

    [0235] As described above, FIG. 1 is an illustration of an example acoustic scene comprising four sources S1-S4 with respective audio channels H1-H4 to a recording device such as a microphone array 10 comprising (in this example) 8 microphones P FIG. 2 is a spatial representation of the acoustic scene of FIG. 1 onto which a simplified heat map of phase correlation values Φ.sub.f(S.sub.mf,d) has been overlaid. The spatial representation in this example comprises 2D coordinates system made up of a 8×8 grid of squares indicated by coordinates (1, 1) to (8, 8) making up a grid 64 unique directions d.sub.1 . . . 64. In practice, it is envisaged that a spatial representation may comprise a 3D or other coordinate system. Each square is associated with a direction d.sub.1 . . . 64 relative to the recording device 10 position.

    [0236] In this simplified example, the sources S1, S2, S3 and S4 are respectively at coordinates (1, 8), (7, 7), (7, 2) and (2, 3) which, if the 64 unique directions in the scene start at number 1 for coordinate (1,1) and 64 for coordinate (8, 8) correspond to directions d.sub.15, d.sub.18, d.sub.57, d.sub.56.

    [0237] The recording device 10 captures an audio signal which is separated into its constituent components using blind source separation as described above.

    [0238] The phase correlation values Φ.sub.f(S.sub.mf,d) for each of the directions d.sub.1 . . . 64 in the grid are calculated (that is, the phase correlation values for directions at coordinates (1, 1) all the way through to (8, 8) are calculated) and, in this simplified example, have the distribution shown in the heat map of FIG. 2. The grid squares that contain the sources contain a phase correlation peak, whereas grid squares far from the sources have low phase correlation values. For the sake of example only, let's say the peak has a phase correlation value of 1, the surrounding directions have a value of 0.5 and the other directions have a value of 0. It is of course noted that, in practice, the distribution of phase correlations Φ.sub.f(S.sub.mf,d) are unlikely to be as simple as is shown in FIG. 2 and are instead difficult to analyse for the reasons given above.

    [0239] Assume a user has selected a spatial region indicated by box 201 shown in FIG. 2. In this example, the selected region contains only source S3 so the components associated with sources S1, S2, and S3 are to be cropped out.

    [0240] To determine which of the plurality of components are associated with a source positioned outside of the selected spatial region and which are associated with a source positioned inside the selected spatial region, phase correlation values Φ.sub.f, (S.sub.mf,d) are ranked in a list, for example by high to low or low to high phase correlation value or some other order. Thus the directions or coordinates corresponding to a peak will be ranked 1.sup.st, the immediately surrounding directions or coordinates will be ranked 2.sup.nd to 9.sup.th and the other directions or coordinates will be ranked 10.sup.th to 64.sup.th

    [0241] To help illustrate this, let's consider the example of the 2.sup.nd row of directions, that is, the row of squares indicated by coordinates (1, 2) to (8, 2) corresponding to directions d.sub.9 . . . 16 using the above index notation. A simplified distribution of ranks may be as follows:

    [0242] Coordinate (7,2) or direction d.sub.215 has a peak phase correlation value of 1 so is ranked 1.sup.st. Coordinates (1,2), (2,2), (3,2), (6,2), (8,2) or directions d.sub.9, 10, 11, 14, 16 have phase correlation values of 0.5 so will be ranked among the directions in the scene having the ranks 2.sup.nd to 9.sup.th. Note that the rank is the index into any valid sorting of the phase correlations such that each phase correlation is less than or equal to the next. This means identical phase correlation values will be assigned consecutive ranks. In practice however noise effects make identical phase correlations highly unlikely. Coordinates (4, 2), (5, 2) or directions d.sub.12, 13 have phase correlation values of 0 so are ranked among the directions in the scene having the ranks 10.sup.th to 64.sup.th.

    [0243] Thus, using the index 1 . . . N notation to indicate directions d.sub.1 . . . 64 we might have by way of illustrative example a rank distribution for directions (r.sub.9 . . . r.sub.16) of:

    r.sub.9: 2.sup.nd
    r.sub.10: 3.sup.rd
    r.sub.11: 4.sup.th
    r.sub.12: 21.sup.St
    r.sub.13: 57.sup.th
    r.sub.14: 9.sup.th
    r.sub.15: 1.sup.st
    r.sub.16: 7.sup.th

    [0244] We can follow the same process for all the directions d.sub.1 . . . 64 in the scene to obtain the rank distribution (r.sub.1 . . . r.sub.64). If we do the same for all observed frequencies f we obtain the rank distribution (r.sub.11 . . . r.sub.64F) which is the rank distribution across all frequencies and across all directions.

    [0245] We can now compare our current, observed rank distribution with rank distributions from previously captured training data where the source positions or directions were known. This allows us to estimate the probability of observing the rank distribution(r.sub.11 . . . r.sub.64F) we actually observed, for/given the many possible source positions or directions (described above as states s) in the training data. That is, by comparing the observed rank distribution to training data rank distributions, we can estimate the probabilities: p.sub.m(r.sub.11 . . . r.sub.64F|S)

    [0246] As previously described, by using Bayes rule on the now known probabilities p.sub.m(r.sub.11 . . . r.sub.64F|s) we obtain p.sub.m(s.sub.d|r.sub.11 . . . r.sub.64F) which allows us to evaluate the expression for the probability that a source m comes from our selected audio crop region 201:

    [00033] p m ( d m �� | r 1 1 . . . r 6 4 F ) = .Math. d �� p m ( s d | r 1 1 . . . r 6 4 F )

    [0247] We then do the same for all the sources m. We can then apply a threshold to the calculated probabilities p.sub.m(d.sub.m∈custom-characterr.sub.11 . . . r.sub.64F) to make the determination as to whether or not a source m in direction d is or isn't in the set of directions custom-character inside the selected spatial region.

    [0248] When we know which sources are or aren't inside the selected spatial region, we can simply remove the component or components of the audio signal associated with those sources from the audio signal and it is this removal (based position in the scene) which is referred herein to as cropping. The audio signal can then be passed to a playback device or be processed further.

    [0249] FIG. 3a is an illustration of an audio-visual system 300 according to the present disclosure. The audio-visual system 300 comprises one or more recording devices 301, for example a microphone array such as that which may be found in a user device 303 such as a smartphone or tablet or other mobile device, configured to capture an audio signal from a plurality of spatially separated audio sources 302a-302d in a scene. It is also envisaged that the one or more recording devices 301 may be separate to the user device 303 and instead communicatively coupled it. The audio-visual system 300 accordingly comprises said user device 303 comprising a display 304 configured to display a spatial representation 305 of the scene. The special representation may, for example, be a graphical rendering of the scene constructed from a video or image of the scene captured by an image capture device 306. As with the one or more recording devices 301, the one or more image capture devices 306 may either be part of the user device 303 or separate thereto, for example in the case of a smartphone, the on board camera may be used or an accessory camera may instead be connected to it. In the case of a videoconferencing system, the image capture device 306 may be a webcam or other accessory such device. The audio-visual system 300 may accordingly comprise one or more processors configured to construct the spatial representation of the scene from the captured image data.

    [0250] The user device 303 is provided with a user interface for selecting a spatial region on the displayed spatial representation 305. The user interface may be a touchscreen interface allowing the user to drag a shape around a desired spatial region. Alternatively, any other known input interface and/or software may be used to allow a user to select a spatial region on the spatial representation and to associate the selection with the corresponding spatial region in the scene. To indicate the selection, a visual representation of the selection 307, for example an outline around the selected region, may be rendered on the display.

    [0251] FIG. 3b illustratively shows an example user interface 317 of an audio-visual system 300 according to the present disclosure. The user device 303 in this example is a smartphone recording a video of a user 318. The visual representation of the captured video data is displayed on the display of the user device 303. The user interface 317 is provided with a number of known recording and editing options, for example, record video 319, crop image/video 320, timer options 321 and sound options 322. However, in addition to the known editing options and in accordance with the present disclosure, a new crop audio button 323 is provided which prompts the user to drag a selection over the spatial region 3 they would like to keep the audio from. Once the selection 324 has been made, any audio signal components associated with audio sources arriving at the one or more recording devices from directions outside of the selected spatial region may be cropped out of the audio signal in the manner as described above.

    [0252] It is further envisaged that the user may simultaneously apply the crop image/video 320 and crop audio 323 functionalities at the same time. The user may also be prompted when cropping image/video if they would also like to crop audio to match their image/video cropping selection.

    [0253] In the example of FIG. 3b, the user selection is made using a touchscreen and well known pinch or pull zoom gestures commonly used for by image/video cropping methods to allow a user to make a selection for example of a rectangular area on the display, although other shapes and gesture or selection methods are also envisaged. For example, it is envisaged the user may simply tap a region on the display which will select a spatial region of a predetermined size. For example, the user might tap on the mouth or face of a person in the scene and that would select a correspondingly sized region around that person's mouth.

    [0254] As described in connection with the method above, the one or more processors of the audio-visual system 300 are accordingly configured to: separate the captured audio signal into a plurality of components associated with one of the plurality of audio sources 301a-301d; determine which of the plurality of components are associated with an audio source positioned outside of the spatial region selected with the user interface; and crop the plurality of components associated with an audio source positioned outside of the selected spatial region out of the audio signal.

    [0255] The one or more processors may further be configured to crop out of the captured image data any regions outside of the user's selection. Thus the user may, in one step, crop both the image data and the audio data thereby providing significant advantages over known methods.

    [0256] Finally, the audio-visual system 300 comprises one or more playback devices such as a display and/or speakers (not shown) configured to output the cropped audio signal and/or the cropped image/video data.

    [0257] FIG. 3c illustrates example architecture of the audio-visual system 300 of the present disclosure. As described above, the audio-visual system 300 comprises one or more recording devices 302, for example a microphone array with microphones 302a-n, coupled to a multi-channel analogue-to-digital converter 308. This provides a digitised multi-channel audio output 309 to a spatial filter 310 which may be implemented as a multi-channel linear convolutional filter, and to a filter coefficient determiner 311. The filter coefficient determiner 311 determines coefficients of a demixing filter which are applied by spatial filter 310 to extract audio from one (or more) sources for a demixed audio output 314. The filter determiner 311 accepts user input, for example the above described user specified spatial region, and has an output 313 comprising demixing filter coefficients for the selected spatial region. The demixed audio 314 is provided to a digital-to-analogue converter 315 which provides a time domain audio output 316, for example to speakers or headphones or the like, or for storage/further processing (for example speech recognition), communication (for example over a wired or wireless network such as a mobile phone network and/or the Internet), or other uses.

    [0258] It is assumed that the acoustic scene is quasi-static and thus the filter coefficient determiner 311 and spatial filter 310 can operate in parallel. The latency is then determined by the main acoustic path, and depends upon the group delay of the filter coefficients, the latency of the spatial filter implementation, and the input/output transmission delays. Many different types of spatial filter may be used for example one low latency filter implementation is to use a direct convolution; a more computationally efficient alternative is described in Gardener, W G (1995), “Efficient Convolution without Input-output Delay”, Journal of the Audio Engineering Society′, 43 (3), 127-136.

    [0259] The skilled person will recognise that the signal processing illustrated in the architecture of FIG. 3c may be implemented in many different ways. For example the filter determiner, in some implementations with the above described user interface, and/or spatial filter and/or DAC 315 may be implemented on a general purpose computing device such as a mobile phone, tablet, laptop or personal computer. In embodiments the microphone array and ADC 308 may comprise part of such a general purpose computing device. Alternatively some or all of the architecture of FIG. 3c may be implemented on a dedicated device such as dedicated hardware (for example an ASIC), and/or using a digital signal processor (DSP). A dedicated approach may reduce the latency on the main acoustic path which is otherwise associated with input/output to/from a general purpose computing device, but this may be traded against the convenience of use of a general purpose device.

    [0260] An example spatial filter 310 for the architecture of FIG. 3c is shown in FIG. 4. The illustrated example shows a multi-channel linear discrete convolution filter in which the output is the sum of the audio input channels convolved with their respective filter co-efficients, as described in eq(1) in the background section above. In embodiments a multi-channel output such as a stereo output is provided. For a stereo output either the spatial filter output may be copied to all the output channels or more preferably, as shown in FIG. 3a, a separate spatial filter is provided for each output channel. This latter approach is advantageous as it can approximate the source as heard by each ear (since the microphones are spaced apart from one another). This can lead to a more natural sounding output which still retains some spatial cues from the source.

    [0261] FIG. 5a shows an example implementation of frame buffer management according the present disclosure. FIG. 5 also illustrates time-frequency and frequency-time domain conversions (not shown in FIG. 2) for the frequency domain filter coefficient determiner 311 of FIG. 3c. In some implementations, each audio channel may be provided with an STFT (Short Time Fourier Transform) module 501a-n each configured to perform a succession of overlapping discrete Fourier transforms on an audio channel to generate a time sequence of spectra. Transformation of filter coefficients back into the time domain may be performed by a set of inverse discrete Fourier transforms 502.

    [0262] The Discrete Fourier Transform (DFT) is a method of transforming a block of data between a time domain representation and a frequency domain representation. The STFT is an invertible method where overlapping time domain frames are transformed using the DFT to a time-frequency domain. The STFT is used to apply the filtering in the time-frequency domain; in embodiments when processing each audio channel, each channel in a frame is transformed independently using a DFT. Optionally the spatial filtering could also be applied in the time-frequency domain, but this incurs a processing latency and thus more preferably the filter coefficients are determined in the time-frequency domain and then inverse transformed back into the time domain. The time domain convolution maps to frequency domain multiplication.

    [0263] As shown in FIG. 5a, a frame buffer system 500 is provided comprising a T×F frame buffer X for each microphone 1 . . . M. These store time-frequency data frames in association with frame weight/probability data as previously described. In embodiments the microphone STFT data are interleaved so that there is one frame buffer containing M×F×T STFT points. A frame buffer manager 503 operates under control of the procedure to read the stored weights for frame selection/weighting. In implementations, the frame buffer manager 503 also controls one or more pointers to identify one or more location(s) at which new (incoming) data is written into a buffer, for example to overwrite relatively less important frames with relatively more important frames. Optionally frame buffer system 500 may comprise two sets of frame buffers (for each microphone), one to accumulate new data whilst previously accumulated data from a second buffer is processed; then the second buffer can be updated. In embodiments a frame buffer may be relatively large—for example, single precision floating point STFT data at 16 kHz with 50% overlap between frames translates to ˜8 MB of data per microphone per minute of frame buffer. However the system may accumulate new frames in a temporary buffer while calculating the filter coefficients and at the beginning of the next update cycle update the frame buffer from this temporary buffer (so that a complete duplicate frame buffer is not required).

    [0264] The frame weights are determined by a source characterisation module 504. In implementations this determines frame weights according to one or more of the previously described heuristics. This may operate in the time domain or (more preferably) in the time-frequency domain as illustrated. More particularly, in implementations this may implement a multiple-source direction of arrival (DOA) estimation procedure. The skilled person will be aware of many suitable procedures including, for example, an MVDR (Minimum Variance Distortionless Response) beamformer, the MUSIC (Multiple Signal Classification) procedure, or a Fourier method (finding the peaks in the angular Fourier spectrum of the acoustic signals, obtained from a combination of the sensor array response and observations X).

    [0265] The output data from such a procedure may comprise time series data indicating source activity (amplitude) and source direction. The time or time-frequency domain data or output of such a procedure may also be used to identify: a frame containing an impulsive event; and/or frames with less than a threshold sound level; and/or to classify a sound, for example as air conditioning or the like.

    [0266] Referring now to FIG. 5b, this shows modules of an example implementation of a frequency domain filter coefficient determiner 311 for use in implementations of the disclosure. The modules of FIG. 5b operate according to the procedure as previously described. Thus the filter coefficient determination system receives digitised audio data from the multiple audio channels in a time-frequency representation, from the STFT modules 302a-n of FIG. 3c, defining the previously described observations xf. This is provided to an optional dimension reduction module 505 which reduces the effective number of audio channels according to a dimension reduction matrix Df. The dimension reduction matrix may be determined (module 506) either in response to user input defining the number of sources to demix or in response to a determination by the system of the number of sources to demix, step 507. The procedure may determine the number of sources based upon prior knowledge, or a DOA-type technique, or, for example, on some heuristic measure of the output or, say, based on user feedback on the quality of demixed output. In a simple implementation the dimension reduction matrix may simply discard some of the audio input channels but in other approaches the input channels can be mapped to a reduced number of channels, for example using PCA as previously outlined. The complete or reduced set of audio channels is provided to a blind source separation module 508 which implements a procedure as previously described to perform importance weighted/stochastic gradient-based blind source separation. As illustrated by the dashed lines, optionally dimension reduction may effectively be part of the blind source separation 508.

    [0267] The blind source separation module 508 provides a set of demixing matrices as an output, defining frequency domain filter coefficients Wf. In embodiments these are provided to module 509 which removes the scaling ambiguity as previously described, providing filter coefficients for a source s at all the microphones (or reduced set of microphones).

    [0268] The user selects 510 a spatial region as has been described above. In some implementations, this may comprise mapping the directions d in the visual representation on which the user has made their selection to the directions d for which the phase correlations Φ.sub.f (S.sub.mf,d) have been calculated. Thus, the user's selection in the visual representation corresponds to making a selection of directions d from the whole set of directions in the scene. With the selection of d now made, the probability of:


    p.sub.m(d.sub.m∈custom-characterr.sub.11 . . . r.sub.NF)=custom-characterp.sub.m(s.sub.d|r.sub.11 . . . r.sub.NF)

    may be calculated for each of the sources. The filter coefficients Wf(s) for selecting those sources that meet (for example exceed) the chosen probability threshold are then output by the filter determiner module for use by the spatial filter after conversion back into the time domain. When these coefficients are used by the spatial filter, this has the effect of removing (i.e. cropping) those components of the input audio signal that are from sources outside the selected spatial region.

    [0269] In some implementations, if a priori knowledge of the source directions is available, an additional layer of user selection may optionally be introduced by a source selection module 511 that operates on a pseudo inverse of the demixing matrix, using the microphone phase responses to choose a specific source s based on that a priori knowledge. That is, the source may be selected 512 either by the user, for example the user indicating a direction of the desired source, based on said a priori knowledge of the source direction.

    [0270] FIG. 6 shows an example of a general purpose computing system 600 programmed to implement a system as described above to improve audibility of an audio signal by blind source separation according to an embodiment of the invention. Thus the computing system comprises a processor 602, coupled to working memory 604, program memory 606, and to storage 608, such as a hard disk. Program memory 606 comprises code to implement embodiments of the invention, for example operating system code, time to frequency domain conversion code, frequency to time domain conversion code, source characterisation code, frame buffer management code, (optional) dimension reduction code, blind source separation code, scaling/permutation code, source selection code, and spatial (time domain) filter code. Working memory 604/storage 608 stores data for the above-described procedure, and also implements the frame buffer(s). Processor 602 is also coupled to a user interface 612, to a network/communications interface 612, and to an (analogue or digital) audio data input/output module 614. The skilled person will recognise that audio module 614 is optional since the audio data may alternatively be obtained, for example, via network/communications interface 612 or from storage 608.

    [0271] FIG. 7 is a flowchart of steps for performing a method 700 according to the present disclosure. The method 700 comprises: capturing 701 the audio signal with one or more recording devices; separating 702 the audio signal into a plurality of components component associated with one of the plurality of audio sources; selecting 703 a spatial region in the scene; determining 704 which of the plurality of components are associated with an audio source positioned outside of the selected spatial region; and cropping 705 the plurality of components associated with an audio source positioned outside of the selected spatial region out of the audio signal.

    [0272] No doubt many other effective alternatives will occur to the skilled person. It will be understood that the invention is not limited to the described embodiments and encompasses modifications apparent to those skilled in the art lying within the scope of the claims appended hereto. It should further be noted that the invention also encompasses any combination of embodiments described herein, for example an embodiment may combine the features of any one or more of the independent and/or dependent claims.

    [0273] For example, where it the audio signal is described as being separated into a plurality of components associated with one or more of the plurality of audio sources, it is envisaged that in an ideal situation, there may be a one-to-one correspondence whereby each of the plurality of components is associated with one of the audio sources. However, in practice, for example if the scene is more complex and contains many audio sources, BSS may be unable to distinguish the sources perfectly and inadvertently merge sources and thus it is also envisaged that in some scenarios, each component may be associated with multiple sources. Finally, in some cases, for example where there are not many audio sources, some of the components may mostly contain ambient noise.

    [0274] For example, the audio and/or video cropping may occur in real time or near real time as the audio and/or video data is being recorded by or streamed to the user device or it instead may performed after the recording has taken place as part of an editing process.