VOICE CONVERSION DEVICE, VOICE CONVERSION METHOD, AND VOICE CONVERSION PROGRAM
20230086642 · 2023-03-23
Inventors
- Shinnosuke Takamichi (Tokyo, JP)
- Yuki SAITO (Tokyo, JP)
- Takaaki Saeki (Tokyo, JP)
- Hiroshi Saruwatari (Tokyo, JP)
Cpc classification
G10L25/18
PHYSICS
International classification
G10L13/033
PHYSICS
G10L25/18
PHYSICS
Abstract
The present invention provides a voice conversion apparatus and the like using a differential spectral method which is capable of implementing both high voice quality and real-time performance even in wideband. A voice conversion apparatus 10 includes: an acquiring unit 11 configured to acquire a signal of a voice of a subject; a dividing unit 12 configured to divide the signal into sub-band signals corresponding to a plurality of frequency bands; a converting unit configured to convert one or a plurality of sub-band signals corresponding to one or a plurality of lower frequency bands, out of the sub-band signals corresponding to the plurality of frequency bands; and a synthesizing unit 16 configured to generate a synthesized voice by synthesizing the one or plurality of sub-band signals after conversion and the remaining sub-band signals that are not converted.
Claims
1. A voice conversion apparatus comprising: an acquiring unit configured to acquire a signal of a voice of a subject; a dividing unit configured to divide the signal into sub-band signals corresponding to a plurality of frequency bands; a converting unit configured to convert one or a plurality of sub-band signals corresponding to one or a plurality of lower frequency bands, out of the sub-band signals corresponding to the plurality of frequency bands; and a synthesizing unit configured to generate a synthesized voice by synthesizing the one or plurality of sub-band signals after conversion and a remaining sub-band signal that is not converted.
2. The voice conversion apparatus according to claim 1, wherein a sampling frequency of the signal is 44.1 kHz or more, and the one or plurality of sub-band signals corresponding to the one or plurality of lower frequency bands include at least sub-band signals corresponding to 2 kHz to 4 kHz frequency bands.
3. The voice conversion apparatus according to claim 1 or 2, wherein the converting unit further comprises: a filter calculating unit configured to calculate a spectrum of a filter by converting a feature value indicating a tone of voice of the one or plurality of sub-band signals corresponding to the one or plurality of lower frequency bands using a learned conversion model, and multiplying the feature value after conversion by a learned lifter; a shortened filter calculating unit configured to calculate a shortened filter by performing inverse Fourier transform on the spectrum of the filter, and applying a predetermined window function thereto; and a generating unit configured to generate a converted voice of the one or plurality of sub-band signals corresponding to the one or plurality of lower frequency bands by multiplying the spectrum of the signal by the spectrum determined by performing Fourier transform on the shortened filter, and performing inverse transform thereon.
4. The voice conversion apparatus according to claim 3, further comprising a learning unit configured to calculate a feature value indicating a tone of the converted voice by multiplying the spectrum of the one or plurality of sub-band signals corresponding to the one or plurality of lower frequency bands by the spectrum determined by performing Fourier transform on the shortened filter, updating a parameter of the conversion model and the lifter so as to minimize an error between the feature value and a feature value indicating a tone of a target voice, and generating the learned conversion model and the learned lifter.
5. The voice conversion apparatus according to claim 4, wherein the conversion model is constructed by a neural network, and the learning unit updates the parameter by an error back propagation method, and generates the learned conversion model and the learned lifter.
6. A voice conversion method executed by a processor included in a voice conversion apparatus, comprising steps of: acquiring a signal of a voice of a subject; dividing the signal into sub-band signals corresponding to a plurality of frequency bands; converting one or a plurality of sub-band signals corresponding to one or a plurality of lower frequency bands, out of the sub-band signals corresponding to the plurality of frequency bands; and generating a synthesized voice by synthesizing the one or plurality of sub-band signals after the conversion and a remaining sub-band signal that is not converted.
7. A voice conversion program that causes a processor included in the voice conversion apparatus to function as: an acquiring unit configured to acquire a signal of a voice of a subject; a dividing unit configured to divide the signal into sub-band signals corresponding to a plurality of frequency bands; a converting unit configured to convert one or a plurality of sub-band signals corresponding to one or a plurality of lower frequency bands, out of the sub-band signals corresponding to the plurality of frequency bands; and a synthesizing unit configured to generate a synthesized voice by synthesizing the one or plurality of sub-band signals after conversion and a remaining sub-band signal that is not converted.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
DESCRIPTION OF EMBODIMENTS
[0031] Embodiments of the present invention will be described with reference to the accompanying drawings. In each drawing, composing elements denoted with a same reference signa has an identical or similar configuration.
[0032]
[0033] The acquiring unit 11 acquires a signal of a voice of a subject. The acquiring unit 11 acquires the voice of the subject, which has been converted into an electric signal by a microphone 20, for a predetermined period.
[0034] The dividing unit 12 divides a signal of a voice in a single frequency band (also called “full band signal” or “wide band signal”) acquired by the acquiring unit 11 into sub-band signals corresponding to a plurality of frequency bands. Specifically, the dividing unit 12 divides a band of the voice of the conversion source speaker by the sub-band multi-rate processing.
[0035] The dividing unit 12 divides a band of the voice of the subject into N number of sub-band signals, modulates each of the N number of sub-band signals to generate base band signals of N number of sub-bands, and shifts frequency. For example, as indicated in the following Expression (1), the dividing unit 12 may generate a base band signal x.sub.n(t) of the n-th sub-band, from the signal x(t) of the voice of the subject in the t (1≤t≤T) th frame, out of the total number of frames T in a predetermined period.
x.sub.n(t)=x(t)W.sub.N.sup.−t(n-1/2) Expression (1)
[0036] Here n=1, 2, . . . , N, and W.sub.N=exp(j2π/2N), for example.
[0037] The dividing unit 12 may limit the base band signal x.sub.n(t), which is a base band signal of the n-th sub-band, to a predetermined band (e.g. [−π/2N, π/2N]) by applying a low pass filter f(t), which is common to the full band (that is, common to the N number of sub-bands). For example, a signal of which band of the base band signal x.sub.n(t) of the n-th sub-band is limited to a predetermined band is given by the following Expression (2).
x.sub.n,pp(t)=f(t)*x.sub.n(t) Expression (2)
[0038] Here * is a convolution operator. The signal x.sub.n,pp(t) is acquired as a complex value.
[0039] The dividing unit 12 also converts the signal x.sub.n,pp(t), which is acquired as the complex value, into a real value x.sub.n,SSB(t). For example, the dividing unit 12 may acquire the real value x.sub.n,SSB(t) by the following Expression (3) using the single sideband (SSB) modulation method.
x.sub.n,SSB(t)=x.sub.n,pp(t)W.sub.N.sup.t/2+x.sub.n,pp*(t)W.sub.N.sup.−t/2 Expression (3)
[0040] Here ⋅* indicates a complex conjugate.
[0041] The dividing unit 12 also generates the n-th sub-band signal x.sub.n(k) by decimating the real value x.sub.n,SSB(t) at a decimation rate M. The n-th sub-band signal x.sub.n(k) is given by the following Expression (4), for example.
x.sub.n(k)=x.sub.n,SSB(kM) Expression (4)
[0042] Out of the N number of sub-band signals generated by the dividing unit 12, one or a plurality of sub-band signals corresponding to one or a plurality of low frequency bands are called “lower frequency sub-band signals”, and one or a plurality of sub-band signals corresponding to one or a plurality of higher frequency bands, other than the lower frequency sub-band signals, are called “higher frequency sub-band signals”. The lower frequency sub-band signals may also be called a “sub-band signal in a low frequency band”, a “low band sub-band signal”, a “low frequency sub-band signal” or the like. In the same manner, the higher frequency sub-band signals may also be called a “sub-band signal in a high frequency band”, a “high band sub-band signal”, a “high frequency band sub-band signal” or the like.
[0043] The filter calculating unit 13 converts the feature value, which expresses the tone of voice, of the lower frequency sub-band signals using a learned conversion model 13a, and multiplies the feature value after the conversion by a learned lifter 13b, so as to calculate a spectrum of a filter (also called “differential filter”). Here the feature value that expresses the tone of voice may be a mel-frequency cepstrum of the voice. By using the mel-frequency cepstrum for the feature value, the tone of voice of the subject can be appropriately captured.
[0044] The filter calculating unit 13 calculates a real cepstrum series C.sub.t.sup.(X) in low order (e.g. 10 to 100 degrees) from the complex spectral series F.sub.t.sup.(X) determined by performing Fourier transform on the lower frequency sub-band signals in the t (1≤t≤T) th frame in a predetermined period. Then the filter calculating unit 13 converts the real cepstrum series COQ using the learned conversion model 13a, so as to calculate the feature value after conversion CP.
[0045] Further, the filter calculating unit 13 multiplies the feature value after conversion C.sub.t.sup.(D) by the learned lifter 13b, so as to calculate a spectrum of the filter. Specifically, the filter calculating unit 13 calculates a product uC.sub.t.sup.(D) (where u is the learned lifter 13b), and performs inverse Fourier transform thereon, so as to determine an exponential function (exp), whereby the complex spectral series F.sub.t.sup.(D) of the filter is calculated.
[0046] The value of the learned lifter 13b used by the voice conversion apparatus 10 according to the present embodiment is a value determined by a later mentioned learning processing. In the learning processing, the value of the lifter 13b is updated along with the parameters of the conversion model 13a, and is determined such that the target voice is better reproduced by the synthesized voice.
[0047] The shortened filter calculating unit 14 performs inverse Fourier transform on the complex spectral series F.sub.t.sup.(D) of the filter, and applies a predetermined window function thereto, so as to calculate the shortened filter. Specifically, the shortened filter calculating unit 14 performs inverse Fourier transform on the complex spectral series F.sub.t.sup.(D) of the filter, so as to determine a value f.sub.t.sup.(D) in a time domain (also called a “differential filter” in the time domain). For example, as indicated in Expression (5), the shortened filter calculating unit 14 calculates the complex spectral series F.sub.t.sup.(l) of the shortened filter of which tap length is l, by cutting the value f.sub.t.sup.(D) with applying the window function w, so that the value f.sub.t.sup.(D) becomes 1 before the time l, and becomes 0 after the time l, and performing Fourier transform thereon.
[0048] In Expression (5), N denotes the number of frequency pins, T denotes a total number of frames in a predetermined period, and l denotes a tap length (l-th frame).
[0049] The generating unit 15 multiplies the spectrum of the lower frequency band sub-band signal by the spectrum generated by performing Fourier transform on the shortened filter, and performs inverse Fourier transform thereon, so as to generate a converted voice. The generating unit 15 calculates a product F.sub.t.sup.(Y) of the spectrum F.sub.t.sup.(l) generated by performing Fourier transform on the shortened filter and the spectrum F.sub.t.sup.(X) of the lower frequency band sub-band signal, and performs inverse Fourier transform on the spectrum F.sub.t.sup.(Y), to as to generate the converted voice of the lower frequency band sub-band signal. The filter calculating unit 13, the shortened filter calculating unit 14 and the generating unit 15 may be called a “converting unit”.
[0050] The synthesizing unit 16 synthesizes: the signals of the converted voice of the lower frequency sub-band signals generated by the generating unit 15 (that is, sub-band signals after conversion); and the higher frequency sub-band signals separated by the dividing unit 12 (that is, the remaining sub-band signals that are not converted).
[0051] The synthesizing unit 16 upsamples the n (1≤n≤N) th sub-band signals X.sub.n(t) at the decimation rate M, for example, as indicated in Expression (6), and acquires the real value X.sub.n,SSB(t) of the converted voice signal. The n-th sub-band signal X.sub.n(t) is a signal of the converted voice after converting the lower frequency band sub-band signal x.sub.n(k) generated by the dividing unit 12, or a same signal as the higher frequency band sub-band signal x.sub.n(k) generated by the dividing unit 12 (unconverted signal). For example, in the case of assigning an index n to the plurality of sub-bands in the full band in ascending order from the lower frequency band, the sub-band signal X.sub.1(t) of the sub-band of a predetermined number (e.g. 1) from n=1 is a signal of the converted voice after the lower frequency band sub-band signal x.sub.1(k) is converted. On the other hand, the sub-band signals X.sub.2(t), X.sub.3(t), . . . , X.sub.N(t) of n=2, 3, . . . , N may be the same signals as the higher frequency sub-band signals x.sub.2(k), x.sub.3(k), . . . , x.sub.N(k) (unconverted signals).
[0052] Further, in order to avoid aliasing, the synthesizing unit 16 frequency-shifts the real value X.sub.n,SSB(t) to the base band, limits the band using the low pass filter g(t), and acquires the complex value X.sub.n,pp(t), for example, as indicated in Expression (7).
X.sub.n,pp(t)=g(t)*(X.sub.n,SSB(t)W.sub.N.sup.−t/2) Expression (7)
[0053] Furthermore, the synthesizing unit 16 acquires the converted voice X(t) in full band, for example, as indicated in Expression (8).
[0054] The learning unit 17 calculates the feature value expressing the tone of the converted voice by multiplying the spectrum of the lower frequency band sub-band signal by the spectrum determined by performing Fourier transform on the shortened filter, updates the parameters of the conversion model and the lifter so as to minimize error between this feature value and the feature value expressing the tone of the target voice, and generates the learned conversion model and the learned lifter thereby. In the present embodiment, the conversion model 13a is constructed by a neural network. For example, the conversion model 13a may be constructed by a multi-layer perceptron (MLP) and a feedforward neural network, and may use a gated linear unit constituted of a Sigmoid function and a tanh function as the activation functions of a hidden layer, and apply batch normalization before each activation function.
[0055] The learning unit 17 calculates the spectrum Ft.sup.(l) generated by performing Fourier transform on the shortened filter, using the conversion model 13a and the lifter 13b of which parameters are not determined, calculates the spectrum F.sub.t.sup.(Y) by multiplying the spectrum F.sub.t.sup.(X) of the lower frequency band sub-band signal by the spectrum F.sub.t.sup.(l), and calculates the mel-frequency cepstrum C.sub.t.sup.(Y) as the feature value. Then the learning unit 17 calculates the error between the calculated cepstrum C.sub.t.sup.(Y) and the cepstrum C.sub.t.sup.(T) of the target voice, which is learning data, by L.sub.t=C.sub.t.sup.(T)−C.sub.t.sup.(Y)).sup.T(C.sub.t.sup.(T)−C.sub.t.sup.(Y))/T. Hereafter the value of √L is called rooted mean squared error (RMSE).
[0056] The learning unit 17 performs partial differentiation on the error L.sub.t=(C.sub.t.sup.(T)−C.sub.t.sup.(Y)).sup.T(C.sub.t.sup.(T)−C.sup.t.sub.(Y))/T using the parameters of the conversion model and the lifter, and updates the parameters of the conversion model and the lifter by the error back propagation method. The learning processing may be performed using the adaptive moment estimation (Adam), for example. By generating the learned conversion model 13a and the learned lifter 13b like this, the influence of cutting the filter to generate the shortened filter is suppressed, and high quality voice conversion can be performed even with a shortened filter.
[0057] According to the voice conversion apparatus 10 according to the present embodiment, for lower frequency sub-band signals generated by dividing a signal of a voice of a subject into a plurality of sub-band signals, the feature value is converted using the learned conversion model 13a, and the shortened filter is calculated using the learned lifter 13b. Therefore even for wideband voice quality conversion, a drop in the modeling performance due to the random fluctuation in the higher frequency band can be prevented, and the effect of improving the quality of the converted voice by band expansion can be properly acquired. Further, an increase in the calculation volume caused by band expansion can be lessened by learning the lifter 13b only for the lower frequency sub-band signals. Therefore the voice conversion based on the differential spectrum method, which is capable of implementing both high voice quality and real-time performance, can be performed in the wideband voice quality conversion.
[0058]
[0059] The CPU 10a is a control unit that performs control in executing programs stored in the RAM 10b or ROM 10c, and performs the arithmetic operation and processing of data. The CPU 10a is also an arithmetic unit to execute programs (voice conversion program) that calculate a plurality of feature values related to the voice of the subject, converts these plurality of feature values into a plurality of converted feature values corresponding to the target voice, and generates synthesized voice based on the plurality of converted feature values. The CPU 10a receives various data from the input unit 10e and the communication unit 10d, and displays the arithmetic operation result of the data on the display unit 10f, or stores the result in RAM 10b.
[0060] The RAM 10b is a storage unit in which data can be overwritten, and may be a semiconductor storage element, for example. The RAM 10b may store programs executed by the CPU 10a and such data as voice of the subject and the target voice. These are examples, and data other than these data may be stored in RAM 10b, or part of these data may not be stored therein.
[0061] The ROM 10c is a storage device from which data can be read, and may be constituted of a semiconductor storage element, for example. The ROM 10c may store a voice conversion program and data that cannot be overwritten.
[0062] The communication unit 10d is an interface to connect the voice conversion apparatus 10 to another apparatus. The communication unit 10d may be connected to a communication network, such as the Internet.
[0063] The input unit 10e is for receiving data inputted by the user, and may include a keyboard and a touch panel, for example.
[0064] The display unit 10f is for visually displaying an operation result by the CPU 10a, and may be constructed by a liquid crystal display (LCD). The display unit 10f may display a waveform of the voice of the subject, or display a waveform of a synthesized voice.
[0065] The voice conversion program may be stored in and provided via a computer-readable storage medium, such as the RAM 10b and ROM 10c, or may be provided via a computer network connected to the communication unit 10d. In the voice conversion apparatus 10, various operations, described with reference to
[0066]
[0067] As indicated in
[0068] The generating unit 15 of the voice conversion apparatus 10 applies a shortened filter, calculated by the shortened filter calculating unit 14, to the spectrum of the lower frequency band sub-band signal (0 to 8 kHz), out of the three sub-band signals generated by the dividing unit 12, so as to generate a converted voice. The voice conversion apparatus 10, on the other hand, does not use the shortened filter for the two higher frequency sub-band signals (8 to 16 kHz, 16 to 24 kHz), and leaves these signals unconverted.
[0069] The synthesizing unit 16 of the voice conversion apparatus 10 resynthesizes the converted voice of the lower frequency band sub-band signal (0 to 8 kHz) and two unconverted higher frequency sub-band signals (8 to 16 kHz and 16 to 24 kHz), so as to generate a full band synthesized voice. The synthesizing unit 16 outputs the generated synthesized voice (sub-band decoding).
[0070]
[0071] The voice conversion apparatus 10 multiplies the converted feature value C.sub.t.sup.(D) by a learned lifter 13b (u), and performs Fourier transform thereon, so as to calculate the complex spectral series F.sub.t.sup.(D) of the filter.
[0072] Then as a value f.sub.t.sup.(D) in the time domain determined by performing inverse Fourier transform on the complex spectral series F.sub.t.sup.(D) of the filter, the voice conversion apparatus 10 performs Fourier transform on f.sub.t.sup.(l) determined by applying a window function, which cuts off (perform truncation) so that the value f.sub.t.sup.(D) becomes 1 before the time l and becomes 0 after the time l, whereby the complex spectral series F.sub.t.sup.(l) of the shortened filter.
[0073] The voice conversion apparatus 10 multiplies the spectrum F.sub.t.sup.(X) of the lower frequency band sub-band signal by the complex spectral series F.sub.t.sup.(l) of the shortened filter calculated like this, so as to calculate the spectrum F.sub.t.sup.(Y) of the converted voice. The voice conversion apparatus 10 generates the converted voice C.sub.t.sup.(Y) by performing inverse Fourier transform on the spectrum F.sub.t.sup.(Y) of the converted voice.
[0074] In the case of performing the learning processing of the conversion model 13a and the lifter 13b, the real cepstrum series C.sub.t.sup.(Y) is calculated from the spectrum F.sub.t.sup.(Y) of the converted voice, and errors from the cepstrum C.sub.t(T) of the target voice, which is learning data, is calculated by L.sub.t=(C.sub.t.sup.(T)−C.sub.t.sup.(Y)).sup.T(C.sub.t.sup.(T)−C.sub.t.sup.(Y))/T. Then the parameters of the conversion model 13a and the lifter 13b are updated by an error back propagation method.
[0075]
[0076] In
[0077] As indicated in
[0078] In
[0079]
[0080] As indicated in
[0081] In this way, it is evaluated that the synthesized voice generated by the voice conversion apparatus 10 according to the present embodiment sounds more natural than the synthesized voice generated by an apparatus according to the conventional method. The p value related to this evaluation is smaller than 10.sup.−10.
[0082]
[0083] The voice conversion apparatus 10 divides the signal of the voice of the subject (full band signal) acquired in S101 into a plurality of sub-band signals (S102). Further, the voice conversion apparatus 10 initializes the index n of the sub-band to a predetermined value (e.g. 1).
[0084] The voice conversion apparatus 10 determines whether the sub-band signal of the sub-band #n (sub-band signal #n) is a lower frequency band sub-band signal or not (S103). If the sub-band signal #n is not a lower frequency band sub-band signal (if this signal is a higher frequency band sub-band signal) (S103: No), this operation advances to S109, skipping steps S103 to S108.
[0085] If the sub-band signal #n is a lower frequency band sub-band signal (S103: Yes), the voice conversion apparatus 10 performs Fourier transform on this sub-band signal #n, and calculates the mel-frequency cepstrum (feature value) (S104), then the feature value is converted using the learned conversion model 13a (S105).
[0086] Further, the voice conversion apparatus 10 multiplies the feature value after conversion by the learned lifter 13b to calculate the spectrum of the filter (S106), performs inverse Fourier transform on the spectrum of the filter, and applies a predetermined window function thereto, whereby the shortened filter is calculated (S107).
[0087] Then the voice conversion apparatus 10 multiplies the spectrum of the sub-band signal #n by the spectrum determined by performing Fourier transform on the shortened filter, and performs inverse Fourier transform thereon, so as to generate the converted voice of the sub-band signal #n (S108).
[0088] The voice conversion apparatus 10 counts the index n of the sub-band (S109), and determines whether the counted n is larger than the total number N of the sub-bands (S110). If the counted n is the total number N of the sub-bands or less (S110: No), this operation returns to S103.
[0089] If n counted in S109 is larger than the total number of sub-bands (S110: Yes), the voice conversion apparatus 10 generates a full band converted voice by synthesizing N number of sub-band signals, and outputs the generated full band converted voice from the speaker (S111).
[0090] In the case where the voice conversion processing is not ended (S112: No), the voice conversion apparatus 10 executes the processing steps S101 to S111 again. In the case where the voice conversion processing is ended (S112: Yes), on the other hand, the voice conversion apparatus 10 ends the processing.
[0091]
[0092] The voice conversion apparatus 10 divides the signal of the voice of the subject (full band signal) acquired in S201 into a plurality of sub-band signals (S202). Further, the voice conversion apparatus 10 initializes the index n of the sub-band to a predetermined value (e.g. 1).
[0093] The voice conversion apparatus 10 determines whether the sub-band signal of the sub-band #n (sub-band signal #n) is a lower frequency band sub-band signal (S203). If the sub-band signal #n is not a lower frequency band sub-band signal (if this signal is a higher frequency band sub-band signal) (S203: No), this operation advances to S212, skipping steps S204 to S111.
[0094] If the sub-band signal #n is a lower frequency band sub-band signal (S203: Yes), the voice conversion apparatus 10 performs Fourier transform on the signal of the voice of the subject, calculates the mel-frequency cepstrum (feature value) (S204), then converts the feature value using the conversion model 13a which is in the learning step (S205).
[0095] Further, the voice conversion apparatus 10 multiplies the feature value after conversion by the lifter 13b which is in the learning step, to calculate the spectrum of the filter (S206), performs inverse Fourier transform on the spectrum of the filter, and applies a predetermined window function thereto, whereby the shortened filter is calculated (S207).
[0096] Then the voice conversion apparatus 10 multiplies the spectrum of the sub-band signal #n by the spectrum determined by performing Fourier transform on the shortened filter to perform inverse Fourier transform, so as to generate the converted voice of the sub-band signal #n (S208).
[0097] Then the voice conversion apparatus 10 calculates the mel-frequency cepstrum (feature value) of the converted voice of the sub-band signal #n (S209), and calculates errors between the feature value of the synthesized voice and the feature value of the target voice (S210). Then the voice conversion apparatus 10 updates the parameters of the conversion model 13a and the lifter 13b by the error back propagation method (S211).
[0098] The voice conversion apparatus 10 counts the index n of the sub-band (S212), and determines whether the counted n is larger than the total number N of a sub-band (S213). If the counted n is the total number N of the sub-bands or less (S213: No), this operation returns to S203. If then counted in S212 is larger than the total number N of sub-bands (S213: Yes), the voice conversion apparatus 10 determines whether the learning end condition is satisfied (S214).
[0099] If the learning end condition is not satisfied (S214: No), the voice conversion apparatus 10 executes the processing steps S201 to S213 again. If the learning end condition is satisfied (S214: Yes), on the other hand, the voice conversion apparatus 10 ends the processing. The learning end condition may be “an error between the feature value of the synthesized voice and the feature value of the target voice is a predetermined value or less”, or “a number of epochs of the learning processing reached a predetermined number of times”, for example.
[0100] As described above, according to the voice conversion apparatus 10 of the present embodiment, only one or more lower frequency sub-band signals, out of the plurality of sub-band signals generated by dividing the full band signal of the voice of the subject, are converted, whereby the influence of random fluctuation in the higher frequency band can be reduced, and the calculation volume due to conversion can be reduced. Therefor even in wideband, voice conversion based on the differential spectrum method, which is capable of implementing both high voice quality and real-time performance, can be performed.
[0101] The above described embodiments are for assisting in understanding of the present invention, and are not intended to limit an interpretation of the present invention. Each composing element, disposition, material, condition, shape, size and the like of the embodiments are not limited to the described examples, but may be properly changed. Composing elements described in different embodiments may be partially replaced or combined with each other.
REFERENCE SIGNS LIST
[0102] 10 Voice conversion apparatus [0103] 10a CPU [0104] 10b RAM [0105] 10c ROM [0106] 10d Communication unit [0107] 10e Input unit [0108] 10f Display unit [0109] 11 Acquiring unit [0110] 12 Dividing unit [0111] 13 Filter calculating unit [0112] 13a Conversion model [0113] 13b Lifter [0114] 14 Shortened filter calculating unit [0115] 15 Generating unit [0116] 16 Synthesizing unit [0117] 17 Learning unit [0118] 20 Microphone [0119] 30 Speaker