METHOD AND SYSTEM FOR VOICE SEPARATION BASED ON DEGENERATE UNMIXING ESTIMATION TECHNIQUE
20220139415 · 2022-05-05
Assignee
Inventors
- Xiangru Bi (Shanghai, CN)
- Guoxia Zhang (Shanghai, CN)
- Youye XIE (Shanghai, CN)
- Qingshan ZHANG (Shanghai, CN)
Cpc classification
G10L21/0308
PHYSICS
G10L21/06
PHYSICS
International classification
G10L21/0308
PHYSICS
Abstract
The present disclosure provides method and system for voice separation based on DUET algorithm, and the method comprises receiving signals from microphones; performing a Fourier transform on the received signals; calculating a relative attenuation parameter and a relative delay parameter for each data point; selecting a clustering range for the relative delay parameters based on a distance between the microphones and a sampling frequency of the microphones, clustering the data points within the clustering range for the relative delay parameters into subsets, and performing an inverse Fourier transform on each subsets. According to the present disclosure, it is possible to provide an efficient and intelligent solution to deploy DUET on the software and/or hardware.
Claims
1. A method for voice separation based on a degenerated unmixing estimation technique (DUET), the method comprising receiving signals from microphones; performing a Fourier transform on the received signals; calculating a relative attenuation parameter and a relative delay parameter for a corresponding data point; selecting a clustering range for the relative delay parameter based on a distance between the microphones and a sampling frequency of the microphones, clustering the data points within the clustering range for the relative delay parameter into subsets, and performing an inverse Fourier transform on each of the subsets.
2. The method of claim 1, wherein the selecting the clustering range for the relative delay parameter is further based on a maximum frequency in a voice.
3. The method of claim 1, further comprising setting the cluster range of the relative attenuation parameter as a constant.
4. The method of claim 1, wherein the clustering range for the relative delay parameter is given by:
5. The method of claim 1, further comprising generating a synchronous sound by a speaker to synchronize the received signals.
6. The method of claim 5, further comprising filtering out the synchronous sound from the received signals.
7. The method of claim 5, wherein the synchronous sound is generated once or periodically.
8. The method of claim 5, wherein the synchronous sound is ultrasonic sound.
9. The method of claim 1, when
10. A system for voice separation based on a degenerate unmixing estimation technique (DUET), the system comprising a sound recording module configured to store signals received from the microphones; a processor configured to perform a Fourier transform on the received signals; calculate a relative attenuation parameter and a relative delay parameter for a corresponding data point; select a clustering range for the relative delay parameter based on a distance between the microphones and a sampling frequency of the microphones, cluster the data points within the clustering range for the relative delay parameter into subsets, and perform an inverse Fourier transform on each of the subsets.
11. The system of claim 10, wherein the processor is further configured to select the clustering range for the relative delay parameter based on a maximum frequency in a voice.
12. The system of claim 10, wherein the processor is further configured to set the clustering range of the relative attenuation parameter as a constant.
13. The system of claim 10, wherein the clustering range for the relative delay parameter is given by:
14. The system of claim 10, further comprising a speaker configured to generate a synchronous signal for synchronizing the signals received from the microphones.
15. The system of claim 14, further comprising a synchronous and filtering module configured to synchronous the signals received from the microphones with the synchronous signal and to filter out the synchronous signal from the received signals.
16. The system of claim 14, wherein the synchronous sound is generated once or periodically.
17. The system of claim 10, wherein the system is implemented in a head unit of a vehicle.
18. The system of claim 10, when
19. A non-transitory computer-readable storage medium including instructions that, when executed by one or more processors to perform the steps of: performing a Fourier transform on signals received from microphones; calculating a relative attenuation parameter and a relative delay parameter for corresponding data point; selecting a clustering range for the relative delay parameter based on a distance between the microphones and a sampling frequency of the microphones, clustering the data points within the clustering range for the relative delay parameter into subsets, and performing an inverse Fourier transform on each of the subsets.
20. The computer-readable storage medium of claim 19, wherein the selecting the clustering range for the relative delay parameter is further based on a maximum frequency in a voice.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] The disclosure can be better understood with reference to the flowing drawings and description. The components in the drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the disclosure. Moreover, in the figures, like reference numerals designate corresponding parts throughout the different views.
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
DETAILED DESCRIPTION OF EMBODIMENTS
[0026] Hereinafter, the preferred embodiment of the present disclosure will be described in more detail with reference to the accompanying drawings. In the following description of the present disclosure, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present disclosure rather unclear.
[0027] The present disclosure provides a method and a system for voice separation based on DUET.
[0028] As shown in
[0029] The received signals from microphone 1 and microphone 2 are inputted in the DUET module (not shown in
[0030] First, the Fourier transform (e.g., short-time Fourier transform, windowed Fourier transform) on the received signals are performed to output a lot of time-frequency data points (step S110).
[0031] In order to partition the time-frequency data points, a relative delay and a relative attenuation parameter for each data point are calculated, where the relative delay parameter is related to the time difference between the arrival times from a source to two microphones, and the relative attenuation parameter corresponds to the ratio of the attenuations of the paths between a source and two microphones (step S120). The relative delay and the relative attenuation pairs corresponding to one of the sources should be respectively different from those corresponding to another one of the sources, and thus the time-frequency points may be partitioned according to the different relative delay-attenuation pairs. That is to say, the data points within the clustering ranges of the relative attenuation and the relative delay parameters may be clustered into several subsets (step S130). Finally, the inverse Fourier transform (e.g., the inverse short time Fourier transform) may be performed on each subsets to output the separated signals corresponding to different sources (step S140).
[0032] The clustering ranges for the relative attenuation and relative delay parameters are selected intelligently in step S120.
[0033] Since the relative attenuation is normally small given the small relative delay required by DUET, the range of the relative attenuation may simply be set as a constant, e.g., [−0.7, 0.7], [−1.0, 1.0]. If two microphones are provided close enough (e.g., around 15 centimeters), the relative attenuation may be substantially determined by the distance therebetween.
[0034] As to the relative delay, a range within which the relative delay can be uniquely determined when the signal's true relative delay lies within this range. Such a range is called an effective range in the present disclosure.
[0035] In order to clarify the process of determining the effective range for the relative delay, the following parameters are defined as follows: [0036] f.sub.s (unit: Hz): sampling frequency of the microphones; [0037] f (unit: Hz): frequency of the continuous voice signal; [0038] f.sub.MAX (unit: Hz): the maximum frequency in the voice; [0039] ω (unit: rad/s): frequency of the continuous voice signal (ω=2πf); [0040] δ (unit: second): relative delay between signals received by two microphones; [0041] n (unit: sampling point): relative delay between signals received by two microphones in terms of sampling points; [0042] d (unit: meter): microphones separation distance; [0043] c (unit: m/s): speed of the sound.
[0044] If the voice is human speech, f is the frequency of the continuous speech signal; f.sub.MAX is the maximum frequency in the speech; and ω is the frequency of the continuous speech signal with the unit rad/s.
[0045] The relative delay is set as e.sup.−iωδ, which has a property that e.sup.−iωδ=e.sup.−i(ωδ+2π). Therefore, ωδ can only be uniquely determined when |ωδ|≤π, and if |ωδ|>π, a wrong delay would be returned and this phenomenon is called as the phase wrap effect.
[0046] It is assumed that the microphones are synchronized. Then, the effective range of the relative delay for a signal with frequency f is given by
[0047] And the intersection of the effective ranges of all frequencies in the speech is
[0048] When the continuous signals are discretized with the sampling frequency f.sub.s, the effective range in terms of sampling points becomes
[0049] Thus, if the relative delay of the speech from any direction with maximum frequency f.sub.MAX lies inside the effective range, a critical point of d is determined as follows:
[0050] The maximum frequency f.sub.max may be determined by measurement or may be preset based on the frequency range of the sound of interest.
[0051] When
the effective range is larger than the largest relative delay between those two microphones, this provides
[0052] When
[0053] Therefore, when
the selected range is
Within the range, there is no phase wrap effect, and no signal of interest would lie outside this range for the synchronized microphones. That is to say, if d is small enough, the selected range of the relative delay for the synchronized microphones would be
[0054] When
[0055] In this case, the selected range for the relative delay is
There is no phase wrap effect when the true relative delay lies within this range. Since the effective range is smaller than the largest relative delay between those two microphones, it is possible that there is a signal whose relative delay lies outside the effective range
It so, me phase wrap effect would occur and its relative delay may spread across the axis (see
[0056] Therefore, the clustering range for the relative delay parameters for the synchronized microphones in terms of the sampling point is given by:
[0057] For non-synchronized microphones, the selected range would be,
where n.sub.0 is the measured largest synchronization error of the system in terms of the sampling points.
[0058]
[0059] As shown in
[0060] If the relative delay of the speech marked by the cross is moved beyond the clustering range (for example, the person corresponding to the subset marked by the cross walks away), the phase wrap effect would occur as shown in
[0061] The method in the aforesaid embodiments of the present disclosure may realize the voice separation. The method may select a clustering range automatically based on the system settings. During the voice separation, there is either no phase wrap effect or the phase wrap effect is negligible and any data points outside the range may be. This ensures the recovery and accuracy of the voice separation and makes the computation more efficient.
[0062]
[0063] One or more of microphones 318 may be considered as a part of the system 300 or may be considered as being separate from the system 300. The number of microphones as shown in
[0064] The system includes a DUET module 312 for performing the voice separation and a memory 314 for recording the signals received from the microphones. The DUET module 312 may be implemented by hardware, software, or any combination thereof, such as, the software program performed by a processor. If the system 300 is included in a vehicle, the DUET module 312 or even the system 300 may be realized by or a part of the head unit of the vehicle.
[0065] The DUET module 312 may perform the processes in the dotted block as shown in
[0066] The system does not require manual adjustment of the clustering range, and may be implemented with relatively low cost and relatively less complexity. In addition, the system may be adapt to various scenarios, such as, a vehicle cabin, an office, home, shopping mall, a kiosk, a station, etc.
[0067] For illustrative purposes, the embodiment is described by taking a vehicle as an example hereinafter.
[0068] As shown in
[0069] In the present embodiment, the maximum frequency in the speech f.sub.MAX is set to 1100 Hz since the human voice frequency is usually within 85˜1100 Hz. The speed of sound c may be determined based on the ambient temperature and humidity. The sampling frequency of the microphones f.sub.s is known, such as, 32 KHz, 44.1 Khz, etc. The largest synchronization error of the microphones in terms of sampling points no may be measured automatically. After the time synchronization of the microphones, the largest synchronization error no may be very small or even equal to zero (see the embodiment with reference to
[0070] As shown in
[0071] In order to reduce or even remove the synchronization error of the microphones, the two microphones are controlled to start recording at the same time. However, the software instruction to open the microphones may not be executed simultaneously and the system time is accurate at millisecond level, which is far greater than the sampling interval of the microphones. The present disclosure provides a new system to achieve time synchronization of the microphones, which is illustratively shown in
[0072]
[0073] The system 500 further includes a speaker 505 to generate a synchronous sound under the control of the synchronous sound generating module 507. The synchronous sound may be a trigger synchronous sound, which is emitted once after the microphones start recording the sound. Alternatively, the synchronous sound may be periodic synchronous sound. In addition, the synchronous sound may be inaudible for a human, such as, ultrasonic sound. The synchronous sound may be an impulse signal to facilitate identification. The speaker 505 may be provided on a point on a line which is perpendicular to the line between microphone 1 and microphone 2 and passes through the midpoint of those two microphones so that the speaker is equidistant from those two microphones.
[0074] The mixtures received from the microphones may include the synchronous sound, speech 1 and speech 2, and are stored in the sound recording module 509. The sound synchronizing and filtering module 511 detects the synchronous signal in the mixtures so as to synchronizes the two mixtures. Then, the sound synchronizing and filtering module 511 removes the synchronous sound from the two mixtures. The synchronous sound may be removed by a filter or an appropriate algorithm.
[0075] According to the present embedment, time synchronization may achieve the accuracy of the microsecond level. For example, if the recording frequency is 44.1 KHz, the accuracy of time synchronization may be less than ten microseconds.
[0076] The synchronized signals are inputted into DUET module 513 for voice separation. The DUET module 513 is the same as the DUET module 312 as shown in
[0077]
[0078] As shown in
[0079] The method and the system in the aforesaid embodiments of the present disclosure may realize the synchronization of the microphones, and thus improve the accuracy and the efficiency of the DUET algorithm with relatively low cost.
[0080] It will be understood by persons skilled in the art, that one or more units, processes or sub-processes described in connection with
[0081] With regard to the processes, systems, methods, heuristics, etc., described herein, it should be understood that, although the steps of such processes, etc., have been described as occurring according to a certain ordered sequence, such processes could be practiced with the described steps performed in an order other than the order described herein. It further should be understood that certain steps could be performed simultaneously, that other steps could be added, or that certain steps described herein could be omitted. In other words, the descriptions of processes herein are provided for the purpose of illustrating certain embodiments, and should in no way be construed so as to limit the claims.
[0082] To clarify the use in the pending claims and to hereby provide notice to the public, the phrases “at least one of <A>, <B>, . . . and <N>” or “at least one of <A>, <B>, . . . <N>, or combinations thereof” are defined by the Applicant in the broadest sense, superseding any other implied definitions herebefore or hereinafter unless expressly asserted by the Applicant to the contrary, to mean one or more elements selected from the group comprising A, B, . . . and N, that is to say, any combination of one or more of the elements A, B, . . . or N including any one element alone or in combination with one or more of the other elements which may also include, in combination, additional elements not listed.
[0083] While various embodiments of the disclosure have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible that are within the scope of the disclosure. Accordingly, the disclosure is not to be restricted except in light of the attached claims and their equivalents.