Processing and visualising audio signals
11538473 · 2022-12-27
Assignee
Inventors
Cpc classification
G10L15/22
PHYSICS
G10L17/26
PHYSICS
G06F3/167
PHYSICS
International classification
G10L15/22
PHYSICS
G10L17/26
PHYSICS
Abstract
The present disclosure relates to methods, computer programs, and computer-readable media for processing a voice audio signal. A method includes receiving, at an electronic device, a voice audio signal, identifying spoken phrases within the voice audio signal based on the detection of voice activity or inactivity, dividing the voice audio signal into a plurality of segments based on the identified spoken phrases, and in accordance with a determination that a selected segment of the plurality of segments has a duration, Tseg, longer than a threshold duration, T.sub.thresh, identifying a most likely location of a breath in the audio associated with the selected segment and dividing the selected segment into sub-segments based on the identified most likely location of a breath.
Claims
1. A method of processing a voice audio signal, the method comprising: receiving, at an electronic device, a voice audio signal; identifying spoken phrases within the voice audio signal based on the detection of voice activity or inactivity: dividing the voice audio signal into a plurality of segments based on the identified spoken phrases, and in accordance with a determination that a selected segment of the plurality of segments has a duration T.sub.seg longer than a threshold duration T.sub.thresh, wherein T.sub.thresh is selected based on a maximum duration of speech before a breath is taken: identifying a most likely location of a breath in the audio associated with the selected segment; and dividing the selected segment into sub-segments based on the identified most likely location of a breath.
2. The method of claim 1, wherein identifying a most likely location of a breath comprises searching the audio associated with the selected segment between a first time limit and a second time limit for the most likely location of a breath.
3. The method of claim 1, wherein dividing the voice audio signal into a plurality of segments comprises determining a start time and an end time in the voice audio signal for each segment; and wherein the determination that a selected segment has a duration longer than the threshold duration is performed after determining the start time of the selected segment but before determining the end time of the selected segment.
4. The method of claim 3, wherein the first time limit is a predefined minimum phrase duration, T.sub.min, after the start time of the selected segment.
5. The method of claim 4, wherein the second time limit is a predefined maximum phrase duration, T.sub.max, after the start time of the selected segment.
6. The method of claim 5, wherein the threshold duration, T.sub.thresh, equals the sum of the minimum phrase duration, T.sub.min, and the maximum phrase duration, T.sub.max.
7. The method of claim 5, wherein the threshold duration is a first threshold duration, and wherein the method further comprises: after determining an end time of the selected segment, and in accordance with a determination that the duration of the selected segment does not exceed the first threshold duration: determining if the duration of the selected segment exceeds a second threshold duration, wherein the second threshold duration is less than the first threshold duration; and if the segment duration exceeds the second threshold duration: searching the audio associated with the selected segment between the first time limit and a third time limit for a most likely location of a breath; and dividing the selected segment into sub-segments based on the identified most likely location of the breath; wherein the third time limit equals the duration of the selected segment, T.sub.dur, minus the minimum phrase duration, T.sub.min.
8. The method of claim 7, wherein the second threshold duration equals the maximum phrase duration, T.sub.max.
9. The method of claim 5, wherein the predefined maximum phrase duration, T.sub.max, is in the range 5-10 seconds or 7-9 seconds.
10. The method of claim 4, wherein the predefined minimum phrase duration, T.sub.min, is in the range 0.3-2 seconds or 0.5-1.5 seconds.
11. The method of claim 1, wherein identifying a most likely location of a breath in the audio associated with the selected segment comprises identifying a minimum signal energy in the audio associated with the selected segment.
12. The method of claim 11, wherein identifying a minimum signal energy comprises: searching the audio associated with the selected segment for the minimum signal energy within a moving time window.
13. The method of claim 12, wherein the duration of the time window is in the range 250-500 ms.
14. The method of claim 1, further comprising applying a low-pass filter to the audio associated with the selected segment prior to identifying a most likely location of a breath.
15. The method of claim 1, further comprising displaying, on a display of the electronic device, a visual representation of the segments and sub-segments.
16. A computer program stored on a non-transitory, processor readable storage the computer program configured to: receive, at an electronic device, a voice audio signal; identify spoken phrases within the voice audio signal based on the detection of voice activity or inactivity; divide the voice audio signal into a plurality of segments based on the identified spoken phrases; and in accordance with a determination that a selected segment of the plurality of segments has a duration T.sub.seg longer than a threshold duration T.sub.thresh, wherein T.sub.thresh is selected based on a maximum duration of speech before a breath is taken: identify a most likely location of a breath in the audio associated with the selected segment; and divide the selected segment into sub-segments based on the identified most likely location of a breath.
17. A non-transitory, computer readable medium comprising instructions which, when executed by a computer, cause the computer to; receive, at an electronic device, a voice audio signal; identify spoken phrases within the voice audio signal based on the detection of voice activity or inactivity; divide the voice audio signal into a plurality of segments based on the identified spoken phrases; and in accordance with a determination that a selected segment of the plurality of segments has a duration T.sub.seg longer than a threshold duration T.sub.thresh, wherein T.sub.thresh is selected based on a maximum duration of speech before a breath is taken: identify a most likely location of a breath in the audio associated with the selected segment; and divide the selected segment into sub-segments based on the identified most likely location of a breath.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) By way of example only, certain embodiments of the invention shall now be described by reference to the accompanying drawings, in which:
(2)
(3)
(4)
(5)
DETAILED DESCRIPTION
(6)
(7) User interface 100 comprises three user-selectable buttons: a settings button 101; a start recording button 102; and a transfer button 103. The transfer button 103 may be used to transfer the recorded audio to a complementary application, for example running on different electronic device. User interface 100 also comprises a list of previous recordings 104, which can be selected to open the recording. Recording of a voice audio signal is started by pressing the record button 102.
(8) Audio is captured by the electronic device and passed as raw data to a voice activity detector (VAD). An exemplary voice activity detector is described in reference [3], which is incorporated herein by reference.
(9) The VAD analyses the incoming audio signal to identify complete phrases within the spoken words based on identifying pauses in the speaker's speech. The VAD identifies a segment of the audio signal corresponding to a phrase by determining a segment start and end time. The VAD then delivers the segment start and end times to a user interface module. The user interface module determines what to display in user interface 100 and user interface 200, described below. The start and end times may be delivered in real time, or near-real time, during a live audio recording. A short delay in delivering the times may be caused by various post-processing stages in the VAD.
(10) As discussed in more detail below, the present invention identifies sub-segments in the audio signal. This may be achieved by intercepting the segment information from the VAD, and determining sub-segment start and end times where necessary to avoid overly long segments. These original and sub-segment start times can then be delivered to the user interface module to be displayed to the user. Although such extra processing may add extra delay, for example of a few seconds, the delay does not look out of place because the user is accustomed to a delay between the phrase end and the display showing the end of the block.
(11) When recording is started the display switches to the recording screen user interface 200 shown in
(12) When the user interface module receives an end of segment time, it finishes the current block; and when it receives a start of segment time it starts a new block. When these times are generated by some embodiments of the present invention, they are delivered one straight after the other, and drawn in the specified place in the display. Note that the start and end of blocks are not drawn where the cursor is, as they are always in the past due to the processing delay in the VAD.
(13) Once the cursor has reached the right hand side of the screen, the audio display moves continually to the left, keeping the cursor on the right hand side of the screen.
(14) The user can annotate the display in a number of ways. They can insert a section break using the button 203, colour a block or series of blocks using the colour buttons 204, or colour a whole section using the section colouring buttons 205. The colours can be used to identify certain categories of phrases or sections. A key to the meaning of the colours can be changed by the user in the settings menu. The user can also add text into the notes pane 206. There may be a number of other annotation options, for example including taking a photo, loading up slides, scribbling and/or adding a pre-existing photo.
(15) Once complete, the recording can be saved, and/or transferred to another computing device. For example, where the recording has been made using a mobile phone, the completed audio may be transferred to a computer.
(16) As discussed above, it can be difficult for the VAD to determine the location of a pause in the speech, leading to the misidentification of phrases, especially in noisy environments such as lecture theatres. Speech segmentation for such ambient-speech visualisation (ASV) environments requires a sensitive voice activity detector able to distinguish a speaker, at a distance, from all the background noise of the room. The main aims of an ASV-VAD are:
(17) 1. The segments should correspond to perceived phrases as closely as possible. It is important that speech phrases and the gaps between them are visualised in an intuitive way. This not only helps the auditory to visual link, but is important for editing.
(18) 2. False negatives (missed phrases) should be avoided, if necessary in favour of false positives (blocks of non-speech noise) which are acceptable, as long as something obvious has caused them, else they appear as spurious malfunctions of the system.
(19) However, in noisy environments, these two aims come into direct conflict with one another. To achieve aim (2), a VAD which can reliably detect all speech in the presence of noise is required, so the detection threshold needs to be low. To achieve aim (1), the VAD needs to also reliably detect all the pauses between spoken phrases, else the phrases get joined together. But a VAD cannot do both, especially in acoustic environments where other people are talking in the background, such as talks at exhibitions. Its detection threshold will either be low, in which case it will tend to miss pauses, or high in which case it will tend to miss speech. The problem of joining phrases together is greatly exacerbated by the necessary use of “Hangover”—where the VAD stays on after the end of speech to allow for quiet speech that is below the level of the noise.
(20) There have been many decades of research into VADs, however, none of this research helps solve this conflict, because for all other applications of a VAD only the second aim is important, and it doesn't matter if the pauses between phrases are missed. It is only a problem for speech visualisation, where a strong correspondence between spoken phrases and the visualised equivalent is required.
(21) It is noted that the real-time ASV described herein contrasts with the audio file visualisation system for voice messages described in U.S. Pat. No. 6,055,495 (A). In that system the main concern was to reduce the number of pauses, as too many were being found using a simple VAD on clean audio, and being a non-real-time system it could do this by processing the whole file at once and selecting only the longest or deepest pauses. But an ASV system rarely displays too many pauses because of the highly sensitive VAD required, and instead it is the opposite problem of too few pauses that needs solving.
(22) The present invention seeks to overcome such problems, by splitting overly long segments into shorter sub-segments based on the most likely location of a breath in the audio signal. The invention may be implemented in a notetaking software application as an extra component to the visualization system, which operates on the output of the VAD. This component ensures that even with the most sensitive VADs, sufficient pauses are found between segments to give an effective audio to visual correspondence for perception and for editing.
(23) The invention uses the fact that humans need to breathe whilst speaking, and so after a certain duration of a segment, it is safe to assume that somewhere in that segment there must be a breath.
(24)
(25) At step 301, a voice audio signal is received at an electronic device. The voice audio signal may be a live recording, as described above. Alternatively, receiving the audio signal may comprise receiving a pre-recorded audio signal, such a recording that has already been completed and saved to the electronic device for later processing.
(26) At step 302, spoken phrases are identified within the voice audio signal. The spoken phrases may be identified by a voice activity detector (VAD), as described above. In particular, spoken phrases may be identified by detecting voice activity or inactivity in the voice audio signal, such as by detecting pauses in the speaker's speech. In particular, spoken phrases may be identified by detecting voice activity or inactivity in the voice audio signal, such as by detecting pauses in the speaker's speech.
(27) At step 303, the voice audio signal is divided into a plurality of segments based on the identified spoken phrases. Dividing the voice audio signal into a plurality of segments comprises determining (for example by the VAD) a start time and an end time in the voice audio signal for each segment.
(28) At step 304, it is determined whether a selected segment of the plurality of segments has a duration, T.sub.seg, longer than a threshold duration, T.sub.thresh. The threshold duration can be selected based on a maximum expected duration of speech before a breath is taken. This determining step may be performed before the end time of the selected segment has been identified—i.e. the processing can be performed on speech being currently recorded.
(29) At step 305, a most likely location of a breath in the audio associated with the selected segment is identified. This may comprise identifying a signal energy minimum. It has been found that this identification may be done easily and effectively by finding the point in the selected segment where the signal energy measured over a certain time window is lowest (e.g. looking at the average energy in that time window or sum of signal energy across the time window). For example the time window may be moved through the audio in increments that are smaller than the time window itself, such as increments that are 10% or less the duration of the time window. As sometimes it is not background noise but the speaker breathing noisily that causes the pauses to be missed, the signal can advantageously be low-pass filtered first to help remove the breath noise. The time window for measuring the signal energy may for example be in the range 250-500 ms.
(30) At step 306, the selected segment is divided into sub-segments based on the identified most likely location of a breath. In particular, the start and end times of the sub-segments can be determined. The start and end times may then be passed to a user interface module, as described above, for visualisation on a user interface. These sub-segments may appear on the user interface as a gap between blocks representing segments. For example, the gap 207 in user interface 200 of
(31) The steps 301-306 are then repeated for each identified segment, so that the full voice audio input is divided into segments with no segment having a duration greater than the threshold duration.
(32) The threshold duration may be selected based on the maximum time before the speaker would be expected to take a breath. In particular, a maximum phrase duration T.sub.max may be defined. T.sub.max represents the longest time an average person may be expected to speak for without taking a breath. T.sub.max may particularly be in the range 7-9 seconds.
(33) In some embodiments, the threshold duration may be selected to equal (or approximately equal) the maximum phrase duration T.sub.max. In such embodiments, when segmenting the audio signal into phrases, the method monitors the duration of a speech segment, and when a segment exceeds T.sub.max, it searches the audio of that segment for the most likely place a breath would have been taken and splits the segment by inserting a pause at that point.
(34) Since it is very unusual for a speaker to breathe immediately after speaking a short utterance, the search may be restricted away from the ends of the segment. This prevents very short segments being produced by the splitting. To this end, a minimum phrase duration, T.sub.min, may also be defined. T.sub.min is the shortest utterance an average person may be expected to make that is immediately followed by a breath. T.sub.min may particularly be in the range 0.5-1.5 seconds.
(35) The method 300 may particularly be performed when the final duration of the selected segment is not yet be known. In such cases, if T.sub.max was used as the search threshold and the final duration of the segment before splitting ends up being less than T.sub.max T.sub.min, there is be a risk that the second of the sub-segments would have a duration less than T.sub.min and so would likely not match up with a complete phrase. To avoid this problem, an alternative threshold and alternative search time limits may be used.
(36) Steps 401-403 of method 400 match the corresponding steps of method 300. At step 404, it is determined if a selected segment duration, T.sub.seg, is longer than a first threshold duration, T.sub.thresh1, similar to step 304 in method 300.
(37) If it is determined that the duration is longer than the first threshold, the method 400 proceeds to steps 405 and 406, similar to steps 305 and 305 of method 300, in which a most likely location of a breath is identified and the selected segment is divided into sub-segments.
(38) In the particular embodiment illustrated in
(39) However, comparing the duration to T.sub.thresh1=T.sub.min+T.sub.max means that if the selected segment ends with a duration T.sub.dur greater than the maximum phrase duration, but less than the first threshold, then a search for a breath would not be performed. This would allow segments to be defined with final durations longer than the expected maximum time a person would speak without taking a breath—and so the segments may not match up to actual phrases in the speech.
(40) To overcome this, the method 400 compares the final duration of the selected segment to a second threshold duration, and if the duration is too long, divides the selected segment into sub-segments.
(41) As shown in
(42) If it is determined that the final duration of the selected segment does not exceed the second threshold duration, no action is taken to shorten the selected segment, and the conventionally determined start and end times of the selected segment are recorded. For example, the start and end times of the selected segment may be passed to a user interface module, to be represented on a user interface as a phrase block.
(43) On the other hand, if it is determined that the final duration of the selected segment does exceed the second threshold duration, the method proceeds to step 408.
(44) At step 408, a most likely location of a breath in the audio associated with the selected segment is identified. This process may be similar to that described above in relation to step 305 of method 300. Unlike at step 405, the final duration of the selected segment is now known. The search of the audio for the most likely location of the breath can now be performed in a time window from T.sub.min after the start time of the selected segment to T.sub.min before the end time of the selected segment. This time window ensures that the sub-segments are not shorter than the minimum phrase duration.
(45) The method then proceeds to step 409, in which the selected segment is divided into sub-segments based on the identified most likely location of a breath. In particular, the start and end times of the sub-segments can be determined, and may be passed to a user interface module for display on a user interface, as described above.
(46) The steps 401-409 are then repeated for each identified segment, so that the full voice audio input is divided into segments with no completed segment having a duration greater than the second threshold duration. The steps 407-409, in which the final duration of a completed segment are compared to the second threshold may be performed at the same time as steps 404-406 are being performed on a subsequent, not yet completed segment.
(47) By performing the methods 300 and 400 described above, segmentation of a voice audio signal may more accurately reflect the phrases originally spoken by the speaker.
(48) Although described above in terms of methods, it will be appreciated that the invention may be embodied in a computer program configured to perform any of the methods described above, or in a computer readable medium comprising instructions which, when executed, perform any of the methods described above.
(49) Although the invention has been described above with reference to one or more preferred embodiments, it will be appreciated that various changes or modifications may be made without departing from the scope of the invention as defined in the appended claims.
LIST OF REFERENCES
(50) [1] Capturing, Structuring and Representing Ubiquitous Audio”, Hindus, Schmandt and Homer (ACM Transactions on Information Systems, Vol 11, No. 4 Oct. 1993, pages 376-400). https://www.media.mit.edu/speech/papers/1993/hindus_ACM93_ubiquitious_audio.pdf [2] “Voice Annotation and Editing in a Workstation Environment” by Ades and Swinehart in a Xerox Corporation publication of 1986. http://bitsavers.trailing-edge.com/pdf/xerox/parc/techReports/CSL-86-3_Voice_Annotation_and_Editing_in_a_Workstation_Environment.pdf [3] A“A Model based Voice Activity Detector for Noisy Environments”, Kaavya Sriskandaraja, Vidhyasaharan Sethu, Phu Ngoc Le, Eliathamby Ambikairajah, Interspeech 2015. https://pdfs.semanticscholar.org/e493/1be6673 cb6067db1c855636bea6c8ab170a5.pdf