Audio correction apparatus, and audio correction method thereof
09646625 ยท 2017-05-09
Assignee
- Samsung Electronics Co., Ltd. (Suwon-Si, Gyeonggi-Do, KR)
- Seoul National University R&Db Foundation (Seoul, KR)
Inventors
- Sang-Bae Chon (Suwon-si, KR)
- Kyo-gu Lee (Seoul, KR)
- Doo-yong Sung (Seoul, KR)
- Hoon Heo (Suwon-si, KR)
- Sun-min KIM (Suwon-si, KR)
- Jeong-su Kim (Yongin-si, KR)
- Sang-mo SON (Suwon-si, KR)
Cpc classification
G10H1/366
PHYSICS
G10H2210/066
PHYSICS
G10H2210/051
PHYSICS
G10H2250/631
PHYSICS
International classification
Abstract
An audio correction apparatus and an audio correction method. The audio correction method includes: receiving audio data, which may be input by a user and/or an instrument uttering sounds; detecting onset information by analyzing harmonic components of the received audio data; detecting pitch information of the received audio data based on the detected onset information; comparing the audio data with reference audio data and aligning the two based on the detected onset information and the detected pitch information; and correcting the aligned audio data to match the reference audio data.
Claims
1. An audio correction method comprising: receiving audio data; cepstral analyzing the received audio data; analyzing harmonic components of the cepstral-analyzed audio data; generating a detection function based on cepstral coefficients of the analyzed harmonic components: detecting onset information in the received audio data based on the generated detection function; detecting pitch information of the received audio data based on the detected onset information; aligning the received audio data with reference audio data based on the detected onset information and the detected pitch information; and correcting the aligned audio data to match the reference audio data.
2. The audio correction method of claim 1, wherein the detecting the onset information comprises: selecting a harmonic component of a current frame using a pitch component of a previous frame; calculating said cepstral coefficients with respect to the harmonic components using the selected harmonic component of the current frame and the harmonic component of the previous frame; generating the detection function by calculating a sum of the calculated cepstral coefficients of the plurality of harmonic components; extracting an onset candidate group by detecting a peak of the generated detection function; and detecting the onset information by removing a plurality of adjacent onsets from the extracted onset candidate group.
3. The audio correction method of claim 2, wherein the calculating the cepstral coefficients comprises: determining whether the previous frame has the harmonic component; in response to the determining yielding that the harmonic component of the previous frame exists, calculating a high cepstral coefficient; and in response to the determining yielding that no harmonic component of the previous frame exists, calculating a low cepstral coefficient.
4. The audio correction method of claim 1, wherein the detecting the pitch information comprises detecting the pitch information between the detected onset components using a correntropy pitch detection method.
5. The audio correction method of claim 1, wherein the aligning the received audio data with the reference audio data comprises: comparing the received audio data with the reference audio data; and aligning the received audio data with the reference audio data using a dynamic time warping method.
6. The audio correction method of claim 5, wherein the aligning the received audio data with the reference audio data comprises: calculating an onset correction ratio and a pitch correction ratio of the received audio data to correspond to the reference audio data.
7. The audio correction method of claim 6, wherein the correcting the aligned audio data to match the reference audio data comprises correcting the aligned audio data based on the calculated onset correction ratio and the pitch correction ratio.
8. The audio correction method of claim 1, wherein the correcting the aligned audio data comprises correcting the aligned audio data by preserving a formant of the received audio data using a synchronized overlap add (SOLA) method.
9. The audio correction method of claim 1, wherein the detecting the onset information further comprises calculating the cepstral coefficients with respect to the analyzed harmonic components using harmonic component of the previous frame and generating the detection function based on the calculated cepstral coefficients.
10. The audio correction method of claim 9, wherein the detecting the onset information in the received audio data further comprises: extracting an onset candidate group based on the calculated cepstral coefficients; and detecting the onset information by removing a plurality of adjacent onsets from the extracted onset candidate group, wherein the onset comprises one of a point in the received audio data where a musical note starts and a point where a vowel starts in a song, and wherein the onset information comprises at least one onset in a current audio frame.
11. An audio correction apparatus comprising: an inputter configured to receive audio data; an onset detector configured to detect onset information in the received audio data by analyzing harmonic components of the audio data; a pitch detector configured to detect pitch information of the audio data based on the detected onset information; an aligner configured to align the audio data with reference audio data based on the onset information and the pitch information; and a corrector configured to correct the audio data, aligned with the reference audio data by the aligner, to match the reference audio data, wherein the onset detector is configured to detect the onset information by cepstral analyzing the audio data, by analyzing the harmonic components of the cepstral-analyzed audio data, by generating a detection onset function based on cepstral coefficients of the analyzed harmonic components.
12. The audio correction apparatus of claim 11, wherein the onset detector comprises: a selector configured to select a harmonic component of a current frame using a pitch component of a previous frame; a coefficient calculator configured to calculate the cepstral coefficients of the harmonic components using the selected harmonic component of the current frame and the harmonic component of the previous frame; a function generator configured to generate the detection function by calculating a sum of the cepstral coefficients of the plurality of harmonic components calculated by the coefficient calculator; an onset candidate group extractor configured to extract an onset candidate group by detecting a peak of the detection function generated by the function generator; and an onset information detector configured to detect the onset information by removing a plurality of adjacent onsets from the onset candidate group extracted by the onset candidate group extractor.
13. The audio correction apparatus of claim 12, further comprising: a harmonic component determiner configured to determine whether the previous frame has the harmonic component, wherein, in response to the harmonic component determiner determining that the harmonic component of the previous frame exists, the coefficient calculator is configured to calculate a high cepstral coefficient, and wherein, in response to the harmonic component determiner determining that no harmonic component of the previous frame exists, the coefficient calculator is configured to calculate a low cepstral coefficient.
14. The audio correction apparatus of claim 11, wherein the pitch detector is configured to detect the pitch information between the detected onset components using a correntropy pitch detection method.
15. The audio correction apparatus of claim 11, wherein the aligner is configured to: compare the audio data with the reference audio data, and align the compared audio data with the reference audio data using a dynamic time warping method.
16. A non-transitory computer readable medium storing executable instructions, which in response to being executed by a processor, cause the processor to perform the following operations comprising: receiving audio data; detecting onset information by analyzing harmonic components of the received audio data; detecting pitch information of the received audio data based on the detected onset information; comparing the received audio data with reference audio data; aligning the received audio data with the reference audio data based on the detected onset information and the detected pitch information; and correcting the aligned audio data to match the reference audio data, wherein the processor detects the onset information based on selecting one of the analyzed harmonic components of the received audio data for a current frame based on a pitch component of a previous frame.
17. An audio correction method comprising: receiving audio data; detecting onset information in the received audio data by analyzing harmonic components of the received audio data; detecting pitch information of the received audio data based on the detected onset information; aligning the received audio data with reference audio data based on the detected onset information and the detected pitch information; and correcting the aligned audio data to match the reference audio data, wherein the detecting the onset information for a current frame is based on selecting one of the analyzed harmonic components for the current frame based on a pitch component of a previous frame.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) These and/or other aspects will become apparent and more readily appreciated from the following description of the exemplary embodiments, taken in conjunction with the accompanying drawings in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTIONS OF EXEMPLARY EMBODIMENTS
(10) Hereinafter, exemplary embodiments will be explained in detail with reference to the accompanying drawings.
(11) First, the audio correction apparatus receives an input of audio data (in operation S110). According to an exemplary embodiment, the audio data may be data which includes a song which is sung by a person or a sound which is made by a musical instrument.
(12) The audio correction apparatus may detect onset information by analyzing harmonic components (in operation S120). The onset refers to a point where a musical note generally starts. However, the onset on a human voice may not be clear like glissandos, portamenti, and slur. Therefore, according to an exemplary embodiment, an onset included in a song which is sung by a person may refer to a point where a vowel starts.
(13) In particular, the audio correction apparatus may detect the onset information using a Harmonic Cepstrum Regularity (HCR) method. The HCR method detects onset information by performing cepstral analysis with respect to audio data and analyzing harmonic components of the cepstral-analyzed audio data.
(14) The method for the audio correction apparatus to detect the onset information by analyzing the harmonic components according to an exemplary embodiment will be explained in detail with reference to
(15) First, the audio correction apparatus performs cepstral analysis with respect to the input audio data (in operation S121). Specifically, the audio correction apparatus may perform a pre-process such as pre-emphasis with respect to the input audio data. In addition, the audio correction apparatus performs fast Fourier transform (FFT) with respect to the input audio data. In addition, the audio correction apparatus may calculate the logarithm of the transformed audio data, and may perform the cepstral analysis by performing discrete cosine transform (DCT) with respect to the audio data.
(16) In addition, the audio correction apparatus selects a harmonic component of a current frame (in operation S122). Specifically, the audio correction apparatus may detect pitch information of a previous frame and select a harmonic quefrency which is a harmonic component of a current frame using the pitch information of the previous frame.
(17) In addition, the audio correction apparatus calculates a cepstral coefficient with respect to a plurality of harmonic components using the harmonic component of the current frame and the harmonic component of the previous frame (in operation S123). According to an exemplary embodiment, when there is a harmonic component of a previous frame, the audio correction apparatus calculates a high cepstral coefficient, and, when there is no harmonic component of a previous frame, the audio correction apparatus may calculate a low cepstral coefficient.
(18) In addition, the audio correction apparatus generates a detection function by calculating a sum of the cepstral coefficients for the plurality of harmonic components (in operation S124). Specifically, the audio correction apparatus receives an input of audio data including a voice signal, as shown in
(19) In addition, the audio correction apparatus extracts an onset candidate group by detecting the peak of the generated detection function (in operation S125). Specifically, when another harmonic component appears in the middle of existence of harmonic components, that is, at a point where an onset occurs, the cepstral coefficient abruptly changes. Therefore, the audio correction apparatus may extract a peak point where the detection function, which is the sum of the cepstral coefficients of the plurality of harmonic components, is abruptly changed. According to an exemplary embodiment, the extracted peak point may be set as the onset candidate group.
(20) In addition, the audio correction apparatus detects onset information between the onset candidate groups (in operation S126). Specifically, from among the onset candidate groups extracted in operation S125, a plurality of onset candidate groups may be extracted from adjacent sections. The plurality of onset candidate groups extracted from the adjacent sections may be onsets which occur when the human voice trembles or other noises come in. Therefore, the audio correction apparatus may remove the other onset candidate groups except for only one onset candidate group from among the plurality of onset candidate groups of the adjacent sections, and detects only the one onset candidate group as onset information.
(21) By detecting the onset through the cepstral analysis, as described above, according to an exemplary embodiment, an exact onset can be detected from audio data in which onsets are not clearly distinguished like in a song which is sung by a person or a sound which is made by a string instrument.
(22) Table 1 presented below shows a result of detecting an onset using the HCR method, according to an exemplary embodiment:
(23) TABLE-US-00001 TABLE 1 Source Precision Recall F-measure Male 1 0.57 0.87 0.68 Male 2 0.69 0.92 0.79 Male 3 0.62 1.00 0.76 Male 4 0.60 0.90 0.72 Male 5 0.67 0.91 0.77 Female 1 0.46 0.87 0.60 Female 2 0.63 0.79 0.70
(24) As described above, it can be seen that F-measures of various sources are calculated as 0.60-0.79. That is, considering that F-measure detected by various related-art algorithms is 0.19-0.56, an onset can be detected more exactly using the HCR method according to an exemplary embodiment.
(25) Referring back to
(26) In an exemplary embodiment, the audio correction apparatus divides a signal between the onsets (in operation S131). Specifically, the audio correction apparatus may divide a signal between the plurality of onsets based on the onset detected in operation S120.
(27) In addition, the audio correction apparatus may perform gammatone filtering with respect to the input signal (in operation S132). Specifically, the audio correction apparatus applies 64 gammatone filters to the input signal. In an exemplary embodiment, the frequency of the plurality of gammatone filters is divided according to a bandwidth. In addition, the intermediate frequency of the filter is divided by the same interval, and the bandwidth is set between 80 Hz and 400 Hz.
(28) In addition, the audio correction apparatus generates a correntropy function with respect to the input signal (in operation S133). It is common that the correntropy can obtain higher-dimensional statistics than in the related-art auto-correlation. Therefore, according to an exemplary embodiment, when a human voice is corrected, a frequency resolution is higher than in the related-art auto-correlation. The audio correction apparatus may obtain a correntropy function, as shown in Equation 1 presented below:
V(t,s)=E[k(x(t),x(s))]Equation 1
x(t) and x(s) indicate an input signal when time is t and s respectively.
(29) In this case, k(*,*) may be a kernel function which has a positive value and a symmetric characteristic. According to an exemplary embodiment, the kernel function may use Gaussian kernel. The correntropy function which is substituted with the equation of the Gaussian kernel and the Gaussian kernel may be expressed by Equation 2 and 3 presented below:
(30)
(31) In addition, the audio correction apparatus detects the peak of the correntropy function (in operation S134). Specifically, when the correntropy is calculated, the audio correction apparatus may output a higher frequency resolution with respect to the input audio data than in the auto-correction, and detect a sharper peak than the frequency of the corresponding signal. According to an exemplary embodiment, the audio correction apparatus may measure the frequency which is greater than or equal to a predetermined threshold value from among the calculated peaks as a pitch of the input voice signal. More specifically,
(32) In addition, the audio correction apparatus may detect a pitch sequence based on the detected pitch (in operation S135). Specifically, the audio correction apparatus may detect pitch information with respect to the plurality of onsets and may detect a pitch sequence for every onset.
(33) In the above-described exemplary embodiment, the pitch is detected using the correntropy pitch detection method. However, this is merely an example and not by way of a limitation, and the pitch of the audio data may be detected using other methods (for example, the auto-correlation method).
(34) Referring back to
(35) In particular, the audio correction apparatus may align the audio data with the reference audio data using a dynamic time warping (DTW) method. Specifically, the dynamic time warping method is an algorithm for finding an optimum warping path by comparing similarity between the two sequences.
(36) Specifically, the audio correction apparatus may detect sequence X with respect to the audio data input using operations S120 and S130, as shown in
(37) In particular, according to an exemplary embodiment, the audio correction apparatus may detect an optimum path for pitch information, as shown with a dotted line in
(38) According to an exemplary embodiment, the audio correction apparatus may calculate an onset correction ratio and a pitch correction ratio of the audio data with respect to the reference audio data while calculating the optimum path. The onset correction ratio may be a ratio for correcting the length of time of the input audio data (time stretching ratio), and the pitch correction ratio may be a ratio for correcting the frequency of the input audio data (pitch shifting ratio).
(39) Referring back to
(40) In particular, the audio correction apparatus may correct the onset information of the audio data using a phase vocoder. Specifically, the phase vocoder may correct the onset information of the audio data through analysis, modification, and synthesis. In an exemplary embodiment, the onset information correction in the phase vocoder may stretch or reduce the time of the input audio data by differently setting an analysis hopsize and a synthesis hopsize.
(41) In addition, the audio correction apparatus may correct the pitch information of the audio data using the phase vocoder. According to an exemplary embodiment, the audio correction apparatus may correct the pitch information of the audio data using a change in the pitch which occurs when a time scale is changed through re-sampling. Specifically, the audio correction apparatus performs time stretching 152 with respect to the input audio data 151, as shown in
(42) In addition, when the audio correction apparatus corrects the pitch through re-sampling, the input audio data may be multiplied with an alignment coefficient_P, which is pre-determined to maintain a formant even after re-sampling, in advance, in order to prevent the formant from being changed. The alignment coefficient P may be calculated by Equation 4 presented below:
(43)
(44) In this case, A(k) is a formant envelope.
(45) In addition, in the case of a general phase vocoder, distortion such as ringing may be caused. This is a problem which is caused by phase discontinuity of a time axis which occurs by correcting phase discontinuity of a frequency axis. To solve this problem, according to an exemplary embodiment, the audio correction apparatus may correct the audio data by preserving the formant of the audio data using a synchronized overlap add (SOLA) algorithm. Specifically, the audio correction apparatus may perform phase vocoding with respect to some initial frames, and then, may remove the discontinuity which occurs on the time axis by synchronizing the input audio data with data which undergoes the phase vocoding.
(46) According to the above-described audio correction method of an exemplary embodiment, the onset can be detected from the audio data in which the onsets are not clearly distinguished, such as a song which is sung by a person or a sound of a string instrument, and thus, the audio data can be corrected more exactly or precisely.
(47) Hereinafter, an audio correction apparatus 800 according to an exemplary embodiment will be explained in detail with reference to
(48) The inputter 810 receives an input of audio data. According to an exemplary embodiment, the audio data may be a song which is sung by a person or a sound of a string instrument. An inputter may be a microphone with a sensor configured to detect audio signals.
(49) The onset detector 820 may detect an onset by analyzing harmonic components of the input audio data. Specifically, the onset detector 820 may detect onset information by performing cepstral analysis with respect to the audio data and then analyzing the harmonic components of the cepstral-analyzed audio data. In particular, the onset detector 820 performs cepstral analysis with respect to the audio data as shown in
(50) The pitch detector 830 detects pitch information of the audio data based on the detected onset information. According to an exemplary embodiment, the pitch detector 830 may detect pitch information between the onset components using a correntropy pitch detection method. However, this is merely an example and not by way of a limitation, and the pitch information may be detected using other methods.
(51) The aligner 840 compares the input audio data and reference audio data and aligns the input audio data with reference audio data based on the detected onset information and pitch information. In this case, the aligner 840 may compare the input audio data and the reference audio data and align the input audio data with the reference audio data using a dynamic time warping method. According to an exemplary embodiment, the aligner 840 may calculate an onset correction ratio and a pitch correction ratio of the input audio data with respect to the reference audio data.
(52) The corrector 850 may correct the input audio data aligned with the reference audio data to match the reference audio data. In particular, the corrector 850 may correct the input audio data according to the calculated onset correction ratio and pitch correction ratio. In addition, the corrector 850 may correct the input audio data using an SOLA algorithm to prevent a change of a formant which may be caused when the onset and pitch are corrected. In an exemplary embodiment, the onset detector 820, the pitch detector 830, the aligner 840, and the corrector 850 may be implemented by a hardware processor or a combination of processors. The corrected input audio data may be output via speakers (not shown).
(53) The above-described audio correction apparatus 800 can detect the onset from the audio data in which the onsets are not clearly distinguished, such as a song which is sung by a person or a sound of a string instrument, and thus can correct the audio data more exactly and/or precisely.
(54) In particular, when the audio correction apparatus 800 is implemented by using a user terminal such as a smartphone, exemplary embodiments may be applicable to various scenarios. For example, the user may select a song that the user wants to sing. The audio correction apparatus 800 obtains reference MIDI data of the song selected by the user. When a record button is selected by the user, the audio correction apparatus 800 displays a score and guides the user to sing the song more exactly or precisely i.e., more closely to how it should be sung. When the recording of the user's song is completed, the audio correction apparatus 800 corrects the user's song, according to an exemplary embodiment described above with reference to
(55) The audio correction method of the audio correction apparatus 800 according to the above-described various exemplary embodiments may be implemented as a program and provided to the audio correction apparatus 800. In particular, the program including the sensing method of the mobile device 100 may be stored in a non-transitory computer readable medium and provided for use by the device.
(56) The non-transitory computer readable medium refers to a medium that stores data semi-permanently rather than storing data for a very short time, such as a register, a cache, and a memory, and is readable by an apparatus. Specifically, the above-described various applications or programs may be stored in a non-transitory computer readable medium such as a compact disc (CD), a digital versatile disk (DVD), a hard disk, a Blu-ray disk, a universal serial bus (USB), a memory card, and a read only memory (ROM), and may be provided for use by a device.
(57) The foregoing exemplary embodiments are merely exemplary and are not to be construed as limiting the present inventive concept. The exemplary embodiments can be readily applied to other types of apparatuses. Also, the description of the exemplary embodiments is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.