METHOD AND SYSTEM FOR TIME AND FEATURE MODIFICATION OF SIGNALS
20230032838 · 2023-02-02
Inventors
Cpc classification
G10H2250/455
PHYSICS
G10H2210/066
PHYSICS
G11B27/10
PHYSICS
G10H2210/091
PHYSICS
G10H2240/141
PHYSICS
International classification
Abstract
The application relates to a computer implemented method and system for modifying at least one feature of an input audio signal based on features in a guide audio signal. The method comprises: determining matchable and unmatchable sections of the guide and input audio signals; generating a time-alignment path for modifying the at least one feature of the input audio signal in the matchable sections of the input audio signal based on corresponding features in the matchable sections of the guide audio signal, based on the time-alignment path, modifying the at least one feature in the matchable sections of the audio input signal.
Claims
1. A computer implemented method for modifying one or more features of an input audio signal based on one or more features in a guide audio signal, the method comprising: comparing one or more audio features in the guide audio signal with one or more corresponding features in the input audio signal to determine a matchability signal, the matchability signal indicating matchable sections of the input audio signal having audio features that can be modified to match the guide audio signal, and unmatchable sections of the input audio signal having audio features that cannot be modified to match the guide audio signal; modifying one or more audio features in the input audio signal in dependence on the matchability signal.
2. The computer implemented method of claim 1, comprising: generating a time-alignment path mapping the timing of one more features of the input audio signal to the corresponding features in the guide audio signal, based on the time-alignment path, modifying the at least one feature in the matchable sections of the audio input signal.
3. The computer implemented method of claim 2, comprising: based on the matchability signal, modifying the matchable sections of the input audio signal such that one or more features in the input audio signal are matched to the guide audio signal, and/or, selectively modifying the timing features of the unmatchable sections of the input audio signal.
4. The computer implemented method of claim 2, wherein the determining of matchable and unmatchable sections of the two audio signals further comprises: dividing the guide audio signal and the input audio signal into a plurality of time frames; analyzing the plurality of time frames in the guide audio signal and the input audio signal, and classifying the plurality of time frames into frames with a sufficient signal level and time frames with an insufficient signal level, determining that frames of the guide audio signal or the input audio signals that have an insufficient signal level are unmatchable sections.
5. The computer implemented method of claim 4, wherein the determination of whether a frame has an insufficient signal level comprises: determining the noise floor and/or dynamic range of the guide audio signal and input audio signals in each respective time frame of the plurality of time frame, and determining a signal threshold based on the determined noise floor and dynamic range; comparing the signal in the time frame of the signal against the signal threshold.
6. The computer implemented method of claim 5, wherein analyzing and classifying the frames of the guide audio signal and the input audio signal into sufficient and insufficient signal, comprises classifying the frame as being voiced or unvoiced, with reference to the noise floor and/or dynamic range of the guide audio signal and input audio signals, and a voiced threshold.
7. The computer implemented method of claim 1, wherein determining the matchability signal comprises dividing the input guide audio signal an input audio signals into a plurality of time frames, and labeling each of the plurality of frames with an identifier indicating whether the individual frame has been determined as matchable or unmatchable.
8. The computer implemented method of claim 1, wherein determining the matchability signal comprises: dividing the guide audio signal and input audio signals into a plurality of time frames, and labeling each of the plurality of frames with one of a plurality of matchability values indicating the degree to which the input audio signal can be modified to match with the guide audio signal.
9. The computer implemented method of claim 8, comprising: depending on the matchability value assigned to each time frame, selecting one of a plurality of different predetermined types of audio processing to apply to the time frame of the input audio signal.
10. The computer implemented method according to claim 1, wherein the at least one feature of the audio signal to be modified includes one or more of: the timing of the audio signal, pitch, loudness, frequency, spectral energy, and/or amplitude.
11. The computer implemented method according to claim 1, wherein the matchability signal is determined based on includes: includes one or more of: the timing of the audio signal, pitch, loudness, frequency, spectral energy, and/or amplitude.
12. A system for modifying one or more features of an input audio signal based on one or more features in a guide audio signal, the system comprising a processor configured to: compare one or more audio features in the guide audio signal with one or more corresponding features in the input audio signal to determine a matchability signal, the matchability signal indicating matchable sections of the input audio signal having audio features that can be modified to match the guide audio signal, and unmatchable sections of the input audio signal having audio features that cannot be modified to match the guide audio signal; modify one or more audio features in the input audio signal in dependence on the matchability signal.
13. The system of claim 12, wherein the processor is configured to: generate a time-alignment path mapping the timing of one more features of the input audio signal to the corresponding features in the guide audio signal, based on the time-alignment path, modifying the at least one feature in the matchable sections of the audio input signal.
14. The system of claim 13, wherein the processor is configured to: based on the matchability signal, modifying the matchable sections of the input audio signal such that one or more features in the input audio signal are matched to the guide audio signal, and/or, selectively modifying the timing features of the unmatchable sections of the input audio signal.
15. The system of claim 14, wherein the determining of matchable and unmatchable sections of the two audio signals further comprises: dividing the guide audio signal and the input audio signal into a plurality of time frames; analyzing the plurality of time frames in the guide audio signal and the input audio signal, and classifying the plurality of time frames into frames with a sufficient signal level and time frames with an insufficient signal level, determining that frames of the guide audio signal or the input audio signals that have an insufficient signal level are unmatchable sections.
16. The system of claim 15, wherein the determination of whether a frame has an insufficient signal level comprises: determining the noise floor and/or dynamic range of the guide audio signal and input audio signals in each respective time frame of the plurality of time frame, and determining a signal threshold based on the determined noise floor and dynamic range; comparing the signal in the time frame of the signal against the signal threshold.
17. The system of claim 16, wherein analyzing and classifying the frames of the guide audio signal and the input audio signal into sufficient and insufficient signal, comprises classifying the frame as being voiced or unvoiced, with reference to the noise floor and/or dynamic range of the guide audio signal and input audio signals, and a voiced threshold.
18. The system of claim 12, wherein determining the matchability signal comprises dividing the input guide audio signal an input audio signals into a plurality of time frames, and labeling each of the plurality of frames with an identifier indicating whether the individual frame has been determined as matchable or unmatchable.
19. The system of claim 12, wherein the processor is configured to: divide the guide audio signal and input audio signals into a plurality of time frames, and label each of the plurality of frames with one of a plurality of matchability values indicating the degree to which the input audio signal can be modified to match with the guide audio signal.
20. The system of claim 19, wherein the processor is configured to: depending on the matchability value assigned to each time frame, select one of a plurality of different predetermined types of audio processing to apply to the time frame of the input audio signal.
21. The system of claim 12, wherein the at least one feature of the audio signal to be modified includes one or more of: the timing of the audio signal, pitch, loudness, frequency, spectral energy, and/or amplitude.
22. The system of claim 12, wherein the matchability signal is determined based on includes: includes one or more of: the timing of the audio signal, pitch, loudness, frequency, spectral energy, and/or amplitude.
Description
DESCRIPTION OF THE DRAWINGS
[0038] Example embodiments of the invention will now be described by way of example and with reference to the drawings in which:
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
EXAMPLE EMBODIMENTS
[0056] Before explaining how the new modifications are incorporated into the alignment process, the prior art processing methodology is described in further detail with reference to
[0057]
[0058] The digitized Guide signal g(nR) is passed to feature analysis block 230 where it is sampled at sample rate 1/R (where R is typically 41,000 Hz) and where it undergoes speech parameter (or feature) measurement processing (e.g. spectral energy). The variable n is data frame number. The feature analysis block 230 can be implemented as a digital processor (e.g. an N-band digital filter bank) 230 that provides as output a sequence of feature vectors f.sub.G(jT) at frame rate 1/T, where 1/T is typically 100 Hz.
[0059] The digitized Dub audio signal d(nR) also undergoes measurement in a second but identical feature measurement process 240 to create a sequence of feature vectors f.sub.D(kT). The variables j and k are data frame numbers for the Guide and Dub “tracks” respectively starting at 0, and T is the analysis interval. The signals of interest do not necessarily start at j or k=0 in the Guide and Dub tracks often starting elsewhere.
[0060] As illustrated in
[0061] From first feature analysis blocks 230 and 240, the Guide feature vectors f.sub.G(jT) and Dub feature vectors f.sub.D(kT) are then passed to time-alignment processor 250 to determine a time-warping function w(jT). As will be described later, the time warping function w(jT) provides a time-warping path (time alignment path) between the vectorised dub audio signal and the vectorised guide audio signal, in order to bring the dub audio signal into alignment with the guide the signals. The time warping path is (on a frame-by-frame basis) the sequence of dub parameter vectors that best matches the fixed sequence of guide parameter vectors by allowing dub frames to be repeated or omitted. In this regard, a frame can be understood as a sequence of digits in the respective vectors f.sub.G(jT) and f.sub.D(kT) having a notional starting point and length in time T.
[0062]
[0063] An example time warping path (time alignment path) is shown in
[0064] Some parameters of the vectors f.sub.G(jT) and f.sub.D(kT) may also be used in the time alignment processor 250 to classify successive regions of the new dialogue signal into speech and silence and from that process to produce speech/silence classification data c(jT). The classification data c(jT) may be used in the time alignment processor 250 to determine the time warping function w(jT), but in this prior art example has no other function. The w(jT) data is then used in a signal editing processer 280 for generating the corresponding editing instructions for the Dub audio signal d(nR), so that the editing can be carried out in a process in which periods of silence or audio are lengthened or shortened to provide the required alignment. Other feature classification data may also be used in the signal editing processor 280, including variables such as pitch data MP(kT) which is described later in
[0065]
[0066] The time alignment algorithm creates a number of paths starting from different origins in parallel, with each path produced describing the best sequence from the starting to the final point. What is the “best” is determined by creating a “dissimilarity Score” for every path being explored for aligning the two signals, based on the measured feature vector comparisons at frames (j,k) incorporated along that path. The path with the lowest dissimilarity Score would be deemed the optimal one and defines w(jT). A computer process known as Dynamic Programming (DP) as described in J. S. Bridle (1982) may be utilized for generating w(jT), although other means, including machine learning or other optimization algorithms, could be used.
[0067] The key aspect of using DP is that the best step to a new end point at j=j.sub.e, k=k.sub.e is found by starting at the new end point and searching backwards to at most three previous best path ends at j=j.sub.e−1, k=k.sub.e−a (where a=0,1, or 2), and connecting the new end point to the path which generates the best (i.e. lowest) dissimilarity Score.
[0068] Two main principles are used in determining w(jT): (1) the optimum set of values of j for the whole range of j from 1 to J is also optimum for any small part of the range of j; and (2) the optimum set of values of k corresponding to the values of j from j.sub.s to any value j.sub.e for which there is a corresponding k.sub.s and k.sub.e depends only on the values of k from k.sub.s to k.sub.e. The subscripts ‘s’ and ‘e’ represent ‘start’ and ‘end’, respectively. This process is illustrated in
[0069] Several data arrays are used to hold the required data, an example of which is shown in
[0070] Using the computed distances and the array of previous scores, the algorithm (also known as ‘ZIP’) computes a new best score independently for each of the L+2 new endpoints using the DP equation and, at the same time, saves in a two-dimensional array of path elements named PATH the corresponding index of k which provided each best step. The index indicates the frame index from which the best step was made, therefore each index is actually a pointer to the previous frame's path end. Successive pointers generate a path which can be traced back to its origin. The PATH array holds a multiplicity of such strings of pointers. The SCORE, DIST and PATH arrays are shown in
[0071] Having described the prior art with reference to
[0072] Embodiments of the invention provide the means to be able to align the matchable sections of the entire timespan of a Guide and Dub audio signal, while leaving the unmatchable sections largely unaltered. This removes the need for the manual identification and selection of matchable signal ranges described above. In addition it allows the operator to process longer durations of two signals without regard to where the gaps in silence and other major discrepancies arise. This makes the task of audio alignment and feature transfer simpler, faster, more reliable and accurate. The purpose of this improvement is to position the start and end of sections of the Guide's processing to be based on where continuous runs of the Dub's significate content are positioned in time.
[0073] In some embodiments of this invention, in order to demarcate the start and end of the matchable sections correctly and accurately, a pre-processing classification may be provided to run over the entire range of both the Guide and Dub audio signals to determine the most likely position of the boundaries between any matchable and unmatchable sections of two signals and generate a time varying signal indicating the approximate “matchability”. This may be followed by a modified Time Alignment algorithm to time align not only the matchable time-dependant features but also incorporate the alignment of the unmatchable ranges to optimize the matchable ranges. The modified Time alignment algorithm receives the initial approximate boundary information for each pair of matching Guide/Dub sections as well as receiving the frame by frame matchable signal data to use when computing the alignment path for the entire selected signal. The ‘matchable’ features may be defined as roughly similar features. The subsequent time- and feature-editing processing of the Dub audio signals may be carried out as described in the prior signal modifications methods in
[0074] It is also proposed that other improvements could result by having further levels or categories of signal matching to allowing for more refined modifications to the warp path creation and feature processing. Signal matching information could also inform more intelligent control of the signal modification processing based on the parameters being compared, such as pitch, loudness or other extractible features. These other types of discrepancies can be measured before a decision process selects what extra processing needs to be applied (e.g. if one decides singing is not a double but a harmony—one could apply different types of pitch correction if required). However, this again will need time-alignment information before one can measure and determine the actual discrepancies at the relevant corresponding locations in the signals and so the solution to this matter will be of similar nature to defining unmatchable areas
[0075]
[0076]
[0077] This set of processing is referred to as pre-processing which includes calculating the matchable ranges (
[0078] The process of calculating the matchable ranges is shown in
[0079]
[0080]
[0081] In some embodiments, once the matchable ranges have been created as described in Flowcharts 1 to 3, an extended time range in the Guide may be found. This may be an important step for setting the start of the Guide signal range to time-align based on the Dub's “Voiced” classification range. Adding this step may provide a wider range for a following time-alignment algorithm to compare the Guide to the Dub audio signals, thereby giving more possible and plausible solutions in which to choose the optimum time alignment path from. This may be achieved by extending an amount of time prior to and after the start and end boundaries respectively of the most likely mistiming sections between the guide and dub segments.
[0082]
t.sub.G(a)=t.sub.D(a)−T.sub.x to t.sub.G(b)=t.sub.D(b)+T.sub.x
t.sub.G(c)=t.sub.D(c)−T.sub.x to t.sub.G(d)=t.sub.D(d)+T.sub.x
[0083] After the computing of the matchable ranges in block 500, the obtained matchable signals M(kT) may be input to a modified Time Alignment process 550, shown in
[0084] In some embodiments, the Time Alignment process 550 may be modified based on the time alignment algorithm described in
[0085] The matchable signal allows the alignment algorithm to work as normal when the Guide and Dub frames' features at j and k are matchable. In this case, the dissimilarity (DIST) measures d(j,k) between the vectors f.sub.G(jT) and f.sub.D(kT) may be used in computing the local path SCORE S.sub.p(j,k) and the final output time warping path w(jT) should successfully match timing features because the SCORE of dissimilarity is the lowest (meaning that the matched features are the most similar). However, when the two signals are not matchable at a (j,k) the distance measure d(j,k) is disregarded in computing the local path SCORE because the measurement is irrelevant. The normal time-stretching penalties for a=0 and a=2 generally will lead in this case to creating time warping paths increasing by one frame of k for each frame of j (i.e., a linear or a non-linear time warped alignment path, which may leave the dissimilar parts of the signal unaltered).
[0086] In some embodiments, the determined w(jT) may be further input to an editing processor 280 and a following signal modification processor 290. Although the processes in 280 and 290 are same as the prior art described in
[0087] However, in some other embodiments, with more detailed analysis of the feature vectors r.sub.G(jT) and r.sub.D(kT), further use of detecting and measuring additional discrepancies in corresponding aligned features such as pitch, loudness, phonetic or other classifiable measures could be incorporated into
[0088] In
[0089]
[0090]
[0091] This method differs from the time-aligned feature processing described in U.S. Pat. No. 7,825,321B2 in that there was no measurement of degrees of matchability between the signals made, nor a selection of different processes to apply depending on the specified level or matching.
[0092]
[0093] In a first step (step 401) of
[0094] The next consideration is in a step defining the appropriate set of different criteria to apply when determining the matchability of the specific feature in the guide audio signal and the dub audio signal, and the corresponding values of measurement ranges to map into a multi-level matchability signal. A simple first example of this process is given for modifying the Dub's level to match or follow the Guide's level.
[0095] Signal Levels of the Guide and Dub can be measured frame by frame (in dB for example) as in blocks 260 and 270 in
[0100] Using pitch as a second example, a threshold condition of “Maximum Difference in pitch of 50 cents” may be chosen and set in step 402 (it will be appreciated the a cent is a logarithmic unit of measure for musical intervals, with one semitone being equal to 100 cents). With this example it is possible to define four levels of matchability of pitch, MP(kT), between the guide audio signal and the dub audio signal as follows: [0101] MP(kT)=0 Always uneditable [0102] MP(kT)=1—Tunable (to a specific scale for example) [0103] MP(kT)=2—Tunable and possibly matchable to the Guide pitch where the difference is greater than a Maximum Difference in Pitch threshold, P.sub.TH. [0104] MP(kT)=3—Matchable to Guide (with difference<P.sub.TH)
[0105] These are illustrated in
[0106] Returning to step 403 of
[0107] If the pitch of the dub audio signal is determined not to be valid in step 406, then the matchability level for the frame is set to be zero in step 407, based on the matchability levels discussed above. However, if the pitch of the dub signal is determined in step 406 to be a valid pitch, that is within an expected reasonable range, then the method proceeds to compare for each frame the pitch measurement of the audio dub signal to that of the time-mapped guide. This process begins in step 408, which follows after the yes branch of decision step 406.
[0108] In step 408, the value of the time mapped guide audio signal, j′=w′(kT) is determined for the present frame k in the dub audio signal, and in step 409 it is determined whether the pitch of the guide signal is a valid pitch. Again, this is determined with reference to an expected range of pitches for the guide signal. A pitch may be invalid if there is no pitch in the guide or its pitch is outside a normal frequency range. If the pitch of the guide audio signal is determined to be invalid, then the MP(kT) is set to have a value of 1.
[0109] Assuming both the pitch of the dub audio signal and the pitch of the guide audio signal are determined to be valid pitches for comparison, a comparison of the actual pitch values of both the dub and guide signals is calculated in step 410 to determine whether the pitch of the two signal is within the threshold of 50 cents or outside. In step 411, the pitch of the dub signal in the present frame k is therefore compared with the pitch of the time-mapped guide signal. If the two pitches are not within the threshold of 50 cents, then the matchability value of MP(kT) is set to be 2. If the two pitches are within the threshold of 50 cents, then the matchability value of MP(kT) is set to be 3. In step 412, the frame value of the dub signal is then incremented and the processed returns to step 404. Thus, the processed repeats until the Dub's last frame is reached to determine whether the matchability found is 0, 1, 2 or 3 as described above. This process produces a matchability signal or vector MP(kT) representing the frame-by-frame matchability of the pitch in both the dub and time-mapped guide audio signals.
[0110] As an example of determining the multi-level matchability,
[0111] At regions a, e and h, the Dub signal does not have a pitch signal which means P.sub.D(kT) is not a valid pitch, so MP(kT)=0. At region b, P.sub.D(kT) is a valid pitch but P′.sub.G(kT) does not exist. Therefore, this gives rise to MP(kT)=1. At regions c and f, P.sub.D(kT) is a valid pitch and P′.sub.G(kT) exists. The difference P.sub.DIFF between P.sub.D(kT) and P′.sub.G(kT) is calculated at step 410 in
[0112] Once the dub signal has been processed, and the end frame k has been reached (yes, in step 404), control passes to step 405 where the signal MP(kT) is filtered. This process is described in
[0113] Once the raw Matchability signal MP(kT) has been computed for the entire Dub signal, the processing described in
[0114] The process ‘C’ starts at a step 506 which determines if the length of the “Tunable and Matchable” range L.sub.TMR at the Range Count C.sub.TM is smaller than R.sub.MLT. If yes, this indicates that a small gap exists in the “Tunable and Matchable”, but may be ‘smoothed’ into the “Tunable and Matchable” range and considered as a matchable range. This is confirmed at a step 507 which determines if the “Tunable and Matchable” range can be joined to a matchable range. If yes, this recognized small gap is reclassified as ‘Matchable’ at a step 508. Returning to steps 506 and 507, if the length of the “Tunable and Matchable” range L.sub.TMR at the Range Count C.sub.TM is NOT smaller than R.sub.MLT, or the “Tunable and Matchable” range can NOT be joined to a matchable range, the Rang Count C.sub.TM value of is then incremented (C.sub.TM=C.sub.TM+1) at a step 509 and the processing returns to the step 505.
[0115] The final multilevel pitch matchability signal MP(kT) then feeds block 1280 in
[0116] a) Processing of the Dub's pitch to more closely match the Guide's pitch as described in U.S. Pat. No. 7,825,321B2 and executed commercially in VocAlign Ultra and Revoice Pro from Synchro Arts when MP(kT)=3 or 2 (using further information from the neighbouring frames pitch and other data to assess the likelihood of needing modification). In U.S. Pat. No. 7,825,321B2, it is taught that if the Dub pitch at (kT) is within a defined pitch difference threshold, no pitch adjustment is applied;
[0117] b) Use well-known tuning methods to tune the Dub Pitch to a chromatic or musical scale reference grid when MP(kT)=1; and/or
[0118] c) Do nothing, when MP(kT)=0 for example.
[0119] The block diagrams and flowcharts discussed above are intended to illustrate the operation of an example implementation of systems, methods, and computer program products according to various embodiments of the present invention. Each block in the flowchart or block diagram may represent a separate module comprising one or more executable computer instructions, or a portion of an instruction, for implementing the logical function specified in the block. The order of blocks in the diagram is only intended to be illustrative of the example. In alternative implementations, the logical functions illustrated in particular blocks may occur out of the order noted in the figures. For example, two blocks shown as adjacent one another may be carried out simultaneously or, depending on the functionality, in the reverse order. Each block in the flowchart may be implemented in software, hardware or a combination of software and hardware.
[0120] Computing devices on which the invention is implanted may include a desktop computer, a laptop computer, a tablet computer, a personal digital assistant, a mobile telephone, a smartphone, an internet enabled television, an internet enabled television receiver, an internet enabled games console or portable games device. In all embodiments, it will be understood that the control logic represented by the flowcharts and block diagrams will be stored in memory implemented in or accessible by the computer. This logic stored in a memory of the computing device can be loaded into a real time memory of the processor for real time operation.
[0121] Although described separately, the features of the embodiments outlined above may be combined in different ways where appropriate. Various modifications to the embodiments described above are possible and will occur to those skilled in the art without departing from the scope of the invention which is defined by the following claims.