System and method for improving singing voice separation from monaural music recordings

Abstract

There is provided a post processing technique or method for separation algorithms to separate vocals from monaural music recordings. The method comprises detecting traces of pitched instruments in a magnitude spectrum of a separated voice using Hough transform and removing the detected traces of pitched instruments using median filtering to improve the quality of the separated voice and to form a new separated music signal. The method further comprises applying adaptive median filtering techniques to remove the identified Hough regions from the vocal spectrogram producing separated pitched instruments harmonics and new vocals while adding the separated pitched instruments harmonics to a music signal separated using any separation algorithm to form the new separated music signal.

Claims

1. A method for improving singing voice separation from monaural music recordings, the method comprising: using Hough transform to detect traces of pitched instruments in a magnitude spectrogram of a voice separated from the monaural music recordings; and removing the detected traces of pitched instruments using adaptive median filtering to improve a quality of the voice separated from the monaural music recordings and to form a new separated music signal.

2. The method of improving singing voice separation of claim 1, wherein the method further comprises: generating the magnitude spectrogram of a mixture signal, wherein the mixture signal is a segment of the monaural music recording; converting the magnitude spectrogram to a grey scale image; applying a plurality of binarization steps to the grey scale image to generate a final binary image; applying Hough transform to the final binary image; identifying horizontal ridges represented by Hough lines and calculating variable frequency bands of the identified horizontal ridges; calculating rectangular regions denoted here as Hough regions; generating a vocal spectrogram from vocal signals separated using a reference separation algorithm; applying adaptive median filtering techniques to remove the identified Hough regions from the vocal spectrogram producing separated pitched instruments harmonics and a new vocal signal; adding the separated pitched instruments harmonics to a music signal separated using the reference separation algorithm to form the new separated music signal.

3. The method of claim 2 wherein the binarization steps are performed through a combination of global and local thresholding techniques followed by extraction of peaks inside time frames.

4. The method of claim 1 wherein the method works as a post processing step that when applied to a separation algorithm, improves separation quality.

5. A system for improving singing voice separation from monaural music recordings, the system comprising a microprocessor for: using Hough transform to detect traces of pitched instruments in a magnitude spectrogram of a voice separated from the monaural music recordings; and removing the detected traces of pitched instruments using adaptive median filtering to improve a quality of the voice separated from the monaural music recordings and to form a new separated music signal.

6. The system for improving singing voice separation of claim 5, wherein the system further comprises: generating the magnitude spectrogram of a mixture signal, wherein the mixture signal is a segment of the monaural music recording; converting the magnitude spectrogram to a grey scale image; applying a number of binarization steps to the grey scale image to generate a final binary image; implementing Hough transform to the final binary image; identifying horizontal ridges represented by Hough lines and calculating variable frequency bands of the identified horizontal ridges; calculating rectangular regions denoted here as Hough regions; generating a vocal spectrogram from vocal signals separated using a separation algorithm; applying adaptive median filtering techniques to remove the identified Hough regions from the vocal spectrogram producing separated pitched instruments harmonics and new vocals harmonics; adding the separated pitched instruments harmonics to a music signal separated using the separation algorithm to form the new separated music signal.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The subject matter that is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other aspects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which

(2) FIG. 1 is a block diagram demonstrating the main steps in the proposed post-processing system for removing pitched instruments harmonics;

(3) FIG. 2 is a block diagram demonstrating multiple steps in obtaining the binary image from the mixture magnitude spectrogram;

(4) FIG. 3 displays the process of generating a final binary image from the magnitude spectrogram;

(5) FIG. 4 represents the main steps in obtaining Hough transform regions;

(6) FIG. 5 is a block diagram demonstrating the two main steps in removing the pitched instruments harmonics from the vocals using adaptive median filtering;

(7) FIG. 6 overall demonstrates the process of removing harmonic instrument harmonics in accordance with the present system; and

(8) FIG. 7 shows box plots for the voice metrics of the reference separation algorithm before and after applying the Hough Transform based system.

DETAILED DESCRIPTION OF THE INVENTION

(9) The aspects of the method or system for improving singing voice separation from monaural music recordings according to the present invention will be described in conjunction with FIGS. 1-7. In the Detailed Description, reference is made to the accompanying figures, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.

(10) The proposed post-processing system makes use of both a mixture signal and vocals separated from any reference separation algorithm. Firstly, a magnitude spectrogram of the mixture signal is used to generate a binary image that is necessary for operation of Hough Transforms. Secondly, Hough transform is applied on the binary image generating a plurality of horizontal lines that represent pitched instruments harmonics. The bandwidth of these instrument harmonics are then determined to form rectangular regions denoted as Hough Regions. Finally, the formed Hough Regions are then removed from the magnitude spectrogram of the vocals separated from the reference separation algorithm using an adaptive median filtering technique. The removed pitched instrument harmonics are then added to the instruments separated from the reference separation algorithm. FIG. 1 denotes the proposed post-processing system. FIG. 1 is a block diagram demonstrating the main steps in the proposed post-processing system for removing pitched instruments harmonics. As depicted in the block diagram, from the input mixture signal, after the process, a new vocal and new instruments signals are obtained.

(11) In accordance with an embodiment of the present invention, the first step includes calculating a complex spectrogram from the mixture signal s using a window size and an overlap ratio that are suitable for this procedure and independent of the parameters used in the reference separation algorithm. Following this, the magnitude spectrogram S is obtained as a IJ matrix where the value at i.sup.th row and j.sup.th column is represented using Cartesian coordinates as S(x, y), where x=j and y=i. Then the magnitude spectrogram S is converted to a grey-scale image G.sub.1(x, y) whose scale is [0,1]. This is followed by a number of binarization steps as denoted in FIG. 2 in order to obtain a final binary image. On experimentation of different binarization techniques, it is discovered that parts of the spectrogram were better represented by a binary image obtained via global thresholding whereas other parts were better represented by local thresholding. Hence, best results are obtained on combining global and local thresholding. FIG. 2 is a block diagram demonstrating multiple steps in obtaining the binary image from the mixture magnitude spectrogram.

(12) A new grey-level image G.sub.2(x, y) is obtained using a global threshold, T.sub.g as shown in equation (1):

(13) $\begin{matrix} G_{2} (x, y) = {\begin{matrix} G_{1} (x, y) & if G_{1} (x, y) T_{g} \\ 0 & otherwise \end{matrix} & (1) \end{matrix}$
Following this step, Bernsen local thresholding is applied on the new gray-level image G.sub.2(x, y) to get a first binary image B.sub.1(x, y) as denoted by equations (2) and (3):

(14) $\begin{matrix} B_{1} (x, y) = {\begin{matrix} 1 & if G_{2} (x, y) T_{b} (x, y) \\ 0 & otherwise \end{matrix} & (2) \\ T_{b} (x, y) = \frac{g_{low} (x, y) + g_{high} (x, y)}{2} & (3) \end{matrix}$
wherein g.sub.low(x, y) and g.sub.high(x, y) are the minimum and maximum grey level values within a rectangular MN window centred at the point (x, y). An example of the binary image B.sub.1(x, y) obtained by global and local thresholding is shown in FIG. 3 (a). FIG. 3 generally displays the process of generating a final binary image from the magnitude spectrogram.

(15) On applying Hough transform on the binary image B.sub.1(x, y), plurality of horizontal lines were generated inside many of the vocal segments. In order to overcome this problem, it was required to have a representation to emphasize the horizontal nature of pitched instrument harmonics. For this purpose B.sub.1 is used as a mask which is applied on the magnitude spectrogram S to generate a new magnitude spectrogram S.sub.1.
S.sub.1=B.sub.1.Math.S(4)
wherein .Math. represents element-wise multiplication. Following this step, matrix S.sub.1 is represented as a row of J column vectors representing the spectra of all J time frames. The same is assumed for final binary image B.sub.2.
S.sub.1=[s.sub.1,s.sub.2, . . . ,s.sub.j, . . . ,s.sub.J](5)
B.sub.2=[b.sub.1,b.sub.2, . . . b.sub.j, . . . ,b.sub.J](6)
Peaks of the magnitude spectrum for each column s.sub.j are then calculated using the findpeaks function of MATLAB. Each of these peaks sets a value of 1 in the column vector b.sub.j of the new binary image B.sub.2 while all other values are set to 0. FIG. 3 (b) displays a segmented example of s.sub.j.

(16) However, as shown in FIG. 3 (c), some pitched instruments harmonic displayed peak points fluctuating up and down between adjacent time frames. In order to facilitate the generation of horizontal lines by Hough transform for the upcoming stage of the present invention, each displayed peak is represented by two adjacent points. The second point is chosen as the one before or after the main peak point based on whichever point has a higher value of the magnitude spectrum. An example of this result is shown in FIG. 3 (d). The following algorithm calculates the final binary image B.sub.2 from the magnitude spectrogram S.sub.1 in detail.

(17) Input: The spectrogram S.sub.1 with I rows (frequency bins) and J columns (time frames)

(18) Output: The final binary image B.sub.2

(19) B.sub.2 custom character All zeros IJ matrix for each column j{1 . . . J}

(20) f=Locations of all K peaks in s.sub.j for each location f.sub.k b.sub.j(f.sub.k)=1 if s.sub.j(f.sub.k+1)>s.sub.j(f.sub.k1) b.sub.j(f.sub.k+1)=1 else b.sub.j(f.sub.k1)=1 end if

(21) end for

(22) end for

(23) Following this step, locations of pitched instruments harmonics that appear as horizontal ridges in the mixture magnitude spectrogram are identified. This process is conducted in two steps. Initially, Hough transform is applied on the binary image B.sub.2 generated from the mixture magnitude spectrogram S to obtain the plurality of horizontal lines. Subsequent to this, variable frequency bands of these horizontal ridges are calculated using the lowest point between neighboring horizontal ridges, resulting in Hough transform regions. FIG. 4 represents the main steps in obtaining Hough transform regions.

(24) Hough transform is based on the fact that a line in the Cartesian coordinate system (Image space) can be mapped onto a point in the rho-theta space (Hough space) using parametric representation of a line making it clear that a point in the Hough space represents a line in the Image space.
=x cos +y sin (7)
Conversely, if rho and theta are the variables in the equation above, then each pixel (x, y) in the image is represented by a sinusoidal curve in the rho-theta space. In order to find the value of , corresponding to a specific line in the image (x, y plane), equation (7) is used to draw the sinusoidal curve for each point in the line. Hence, considering that there is present a binary image that consists of one line, and the sinusoidal curve for every non-zero point in the image is graphed, then the actual and coordinate of the line will be reinforced by all graphed sinusoidal curves on the rho-theta plane. This is a single Hough peak.

(25) An image with multiple lines will generate multiple peaks in Hough space. In an embodiment of the present invention, in order to obtain the horizontal lines from the binary image B.sub.2, the hough function in MATLAB is used to construct the Hough space, followed by the houghpeaks function to generate the peaks in the Hough space. Further, line segments are extracted using the houghlines function, and only horizontal lines with a certain minimum length are maintained. The result is a set of Q horizontal lines wherein each line l.sup.q is defined by the left and right points (x.sub.1, y.sub.0) and (x.sub.2, y.sub.0) respectively.

(26) The next step involves estimation of variable frequency bands. The variable frequency bands of the horizontal ridges represented by the Hough lines are estimated using the y-coordinate of the point that has the lowest magnitude spectrum value between two adjacent ridges. The following algorithm provides details of obtaining lower frequency y.sub.1 and the upper frequency y.sub.2 for each line (denoted by l for simplicity), or details regarding estimating the frequency band of a horizontal ridge represented by a horizontal line.

(27) TABLE-US-00001 Inputs: The magnitude spectrogram S and a single Hough line l defined by {x.sub.1,x.sub.2,y.sub.o} Output: The line frequency band {y.sub.1,y.sub.2} 1- Calculate x.sub.o = (x.sub.1 + x.sub.2)/2 2- Starting from (x.sub.o,y.sub.o), decrease y gradually in search for (x.sub.o,y.sub.1) such that: i-S(x.sub.o,y 1) S(x.sub.o,y), y (y.sub.1,y.sub.0] ii- S(x.sub.o,y.sub.1 1) > S(x.sub.o,y.sub.1) 3- Similarly, starting from (x.sub.o,y.sub.o), increase y gradually in search for (x.sub.o,y.sub.2) such that: i-S(x.sub.o,y + 1) S(x.sub.o,y), y [y.sub.o,y.sub.2) ii- S(x.sub.o,y.sub.2 + 1) > S(x.sub.o,y.sub.2)

(28) Following this step is a technique of Adaptive median Filtering. Till this point, a rectangular region r.sup.q={x.sub.1.sup.q, x.sub.2.sup.q, y.sub.1.sup.q, y.sub.2.sup.q} is calculated around each horizontal line l.sup.q that represents the q.sup.th harmonic segment that presumably belongs to a pitched instrument in the mixture spectrogram. It is now required to remove these regions from the vocals separated from the reference separation algorithm to refine it further from the pitched instruments. Initially, the complex spectrogram .sub.v of the separated vocals signal s.sub.v is calculated using the same window size and the overlap ratio that were used to calculate the mixture spectrogram . In order to remove Hough Regions from the magnitude spectrogram S.sub.v, an Adaptive Median Filtering technique is used which is depicted in FIG. 5 as two main steps.

(29) FIG. 5 is a block diagram demonstrating the two main steps in removing the pitched instruments harmonics from the vocals using adaptive median filtering. Firstly, for each region r.sup.q, median filters are used to generate pitched instrumentenhanced regions H.sup.q and vocals-enhanced regions V.sup.q.
H.sup.q=MD.sub.h{S.sub.v,r.sup.q,d.sub.h}(8)
V.sup.q=MD.sub.v{S.sub.v,r.sup.q,d.sub.v.sup.q}(9)
wherein MD.sub.h is the horizontal median filter with a fixed length d.sub.h, applied for each frequency slice in the region r.sup.q of the magnitude spectrogram S.sub.v, and MD.sub.v is the vertical median filter with an adaptive length d.sub.v.sup.q applied for each time frame in the region r.sup.q. In order to ensure complete removal of the rectangular region from the separated voice, d.sub.h was set to 0.1 sec. On the other side, d.sub.v.sup.q changes according to the bandwidth of the rectangular region and is calculated as
d.sub.v.sup.q=y.sub.2.sup.qy.sub.1.sup.q(10)
The pitched instrumentenhanced spectrogram H is formed as an all zeros IJ matrix except at Hough regions r.sup.q where it equals to H.sup.q respectively. On the other side, the vocals-enhanced spectrogram V is an all ones IJ matrix except at Hough regions r.sup.q where it equals to V.sup.q respectively.

(30) Secondly, Wiener filter masks M.sub.H and M.sub.V are generated from H and V as denoted in equations (11) and (12) wherein square operation is applied element-wise.

(31) $\begin{matrix} M_{H} = \frac{H^{2}}{H^{2} + V^{2}} & (11) \\ M_{V} = \frac{P^{2}}{H^{2} + V^{2}} & (12) \end{matrix}$
These generated Wiener filter masks are then multiplied (element-wise) by the original complex spectrogram of the separated vocals .sub.v to produce complex spectrograms of the pitched instruments and voice respectively , {circumflex over (V)} as equated in equations (13) and (14).
=.sub.v.Math.M.sub.H(13)
{circumflex over (V)}=.sub.v.Math.M.sub.V(14)
These complex spectrograms , {circumflex over (V)} are then inverted back to the time domain to yield the separated pitched instruments harmonics and new vocals waveforms h and v respectively. The former is added to the music signal separated from the reference algorithm s.sub.m to form the new separated music signal m.
m=s.sub.m+h(15)

(32) In order to demonstrate the effect of using the system in accordance with the present invention, diagonal median filtering algorithm was used as the reference separation algorithm, along with a song clip from MIR-1K data set. FIG. 6 (c) demonstrates the binary image generated from the mixture signal and the horizontal lines generated from Hough Transform while FIG. 6 (d) shows the new vocals spectrogram wherein pitched instruments harmonics are removed.

(33) FIG. 6 overall demonstrates the process of removing pitched instruments harmonics in accordance with the present system. FIG. 6 also denotes by an example the effect of using the system in accordance with the present invention, along with diagonal median filtering algorithm as the reference separation algorithm, and the Kenshin_1_01 song clip from the MIR-1K data set. Spectrograms are obtained with a window size of 2048 samples and 25% overlap as in FIG. 3. The original singing voice followed by the separated voice from Diagonal Median Filtering are shown in FIGS. 6 (a) and (b). The binary image generated from the mixture signal and the horizontal lines generated from Hough Transform are shown in FIG. 6 (c). The present system then determines locations of pitched instrument harmonics that are to be removed and the new voice is formed as shown in FIG. 6 (d).

(34) The MIR-1K dataset was used to evaluate the effectiveness of the proposed system. The voice and music signals were linearly mixed with equal energy to generate the mixture signal. The mixture signal and the vocals separated from the reference separation algorithm were converted to a spectrogram with window size of 2048 samples and 25% overlap. In order to obtain the binary image, the spectrogram image is divided into smaller overlapping regions. Each region has a time span of 1 sec and frequency span of 400 Hz. The overlap between regions was 20% in time and frequency axes. For each region, the first binary image was calculated using a global threshold of T.sub.g=0.1. The second binary image was calculated with Bernsen local thresholding using a rectangular neighborhood of 7171 pixels. The third binary image however was calculated from peaks per frame where the minimum peak-to-peak distance was 20 Hz. The final binary image was built from the overlapping regions binaries with the or operator.

(35) Hough lines are calculated from small overlapping regions as well. Each region had a time span of 1 sec and a frequency span of 400 Hz with 20% overlap. Hough horizontal lines were calculated for frequencies above 825 Hz since below this frequency, and in many cases, the vocal formants had long horizontal parts that resemble pitched instruments harmonics, and thus were mistakenly classified as pitched instruments. For each region, the number of Hough peaks was 40 and only Hough lines with a minimum length of 10 pixels (0.16 sec.) were considered. Overlapping Hough lines from different regions were combined together before being used to generate Hough regions.

(36) On an experimental basis, diagonal median filtering algorithm with all its parameters was initially used as the reference separation algorithm. FIG. 7 shows box plots for the voice metrics of the reference separation algorithm before and after applying the Hough Transform based system. It is clearly shown that all metrics values have increased except for the voice artifacts. This means that the overall separation performance has improved for both singing voice and music. The greatest improvement is in the voice source-to-interference ratio (SIR), which is an indication that the present system considerably reduces the interference from pitched instruments on the separated voice. The separation performance for singing voice and music indicated by the SDR (left), SIR (middle), and SAR (right) metrics obtained using the BSS_EVAL toolbox. Considering FIG. 7, two boxplots are shown for each metric wherein the leftmost one (R) is for the reference separation algorithm prior to applying the present system, and the second one (H) is subsequent to applying the present system. Median values are also displayed.

(37) Additionally, global normalized source-to-distortion ratio (GNSDR) was used to measure the overall quality of the separated voice and music from different reference algorithms before and after using the present system. The GNSDR is a common method used in many separation algorithms. It is defined as the average of the NSDR of all clips weighted by their lengths w.sub.n.

(38) $\begin{matrix} GNSDR (\hat{s}, x, s) = \frac{{.Math.}_{n = 1}^{N} w_{n} NSDR (\hat{s}, x, s)}{{.Math.}_{n = 1}^{N} w_{n}} & (16) \end{matrix}$
wherein , x, and s denote the estimated source, the input mixture, and the target source, respectively. The normalized source-to-distortion ratio (NSDR) is the improvement of the SDR between the mixture x and the estimated source
NSDR(,x,s)=SDR(,s)SDR(x,s)(27)
and wherein SDR is the source-to-distortion ratio calculated for each source as

(39) $\begin{matrix} S D R (\hat{s}, s) = 10 \log_{10} \frac{{.Math. \hat{s}, s .Math.}^{2}}{{.Math. \hat{s} .Math.}^{2} {.Math. s .Math.}^{2} - {.Math. \hat{s}, s .Math.}^{2}} & (38) \end{matrix}$

(40) The table below shows the results for many reference separation algorithms, namely; the diagonal median filtering (DMF) algorithm, the harmonic-percussive with sparsity constraints (HPSC), robust principal component analysis (RPCA), adaptive REPET (REPET+), two-stage NMF with local discontinuity (2NMFLD), and deep recurrent neural networks (DRNN). The following table displays GNSDR improvements for various Reference Algorithms:

(41) TABLE-US-00002 Reference Voice Voice Music Music Algorithm before after before after DMF+H 4.7075 4.9663 4.7293 4.9505 HPSC+H 4.2036 4.3933 3.9979 4.1631 RPCA+H 3.4590 3.6732 2.7167 3.1141 REPET+ 2.8485 3.2546 2.3699 3.0282 2NMF-LD+H 2.2816 2.6146 2.9514 3.4494 DRNN 6.1940 6.2318 6.2006 6.2679

(42) A high-pass filter with a cut-off frequency of 120 Hz was used as a post-processing step in most separation algorithms except for adaptive REPET (REPET+) where it did not improve results and for deep recurrent neural networks (DRNN) since it is a supervised (trained) approach and does not require a high pass filter. We also removed the clips used in training the deep recurrent neural networks (DRNN) from the testing dataset. Additionally, since the greatest improvement shown by the first experiment was the voice SIR, the singing voice global source-to-interference ratio (GSIR) was also calculated, which is the weighted mean of the voice SIR of all clips. The following table displays voice GSIR improvements for various Reference Algorithms:

(43) TABLE-US-00003 Reference Voice Voice Algorithm before after DMF+H 10.2083 11.4141 HPSC+H 7.1059 7.6443 RPCA+H 8.6360 9.2991 2NMF-LD+H 7.7299 8.8735 REPET+ 5.2733 6.0682 DRNN 13.1780 13.6295

(44) Results show that the present system in accordance with the present invention improves the quality of separation for all reference algorithms used, even for the supervised systems (DRNN), which is an indication to its wide applicability. Further, the results suggest that the diagonal median filtering approach when combined with the Hough Transform based system has the best separation quality over all blind or unsupervised separation algorithms.

(45) Many changes, modifications, variations and other uses and applications of the subject invention will become apparent to those skilled in the art after considering this specification and the accompanying drawings, which disclose the preferred embodiments thereof. All such changes, modifications, variations and other uses and applications, which do not depart from the spirit and scope of the invention, are deemed to be covered by the invention, which is to be limited only by the claims which follow.

System and method for improving singing voice separation from monaural music recordings

Assignee

Inventors

Cpc classification

Classification Explorer

G10H2210/056

PHYSICS

Classification Explorer

G10H1/366

PHYSICS

Classification Explorer

G10H2250/101

PHYSICS

Classification Explorer

G10H1/125

PHYSICS

Classification Explorer

G10H2210/066

PHYSICS

Classification Explorer

G10H2250/215

PHYSICS

International classification

Classification Explorer

H03F1/26

ELECTRICITY

Classification Explorer

G10H1/36

PHYSICS

Abstract

Claims

Description