LEARNING APPARATUS, ESTIMATION APPARATUS, METHODS AND PROGRAMS FOR THE SAME
20230013385 · 2023-01-19
Assignee
Inventors
- Yuki KITAGISHI (Tokyo, JP)
- Takeshi Mori (Tokyo, JP)
- Hosana KAMIYAMA (Tokyo, JP)
- Atsushi Ando (Tokyo, JP)
- Satoshi KOBASHIKAWA (Tokyo, JP)
Cpc classification
G10L17/26
PHYSICS
G10L17/02
PHYSICS
International classification
G10L17/02
PHYSICS
Abstract
A learning apparatus includes: a speaker vector learning unit configured to learn a speaker vector extraction parameter λ based on one or more items of learning speech voice data in a speaker vector voice database; a non-speaker-individuality sound model learning unit configured to create a probability distribution model using a frequency component of one or more items of non-speaker-individuality sound data in a non-speaker-individuality sound database and calculate an internal parameter of the probability distribution model; and an age level estimation model learning unit configured to extract a speaker vector from voice data in an age level estimation model-learning voice database using the speaker vector extraction parameter λ, calculate a non-speaker-individuality sound likelihood vector from voice data in the age level estimation model-learning voice database using the internal parameters μ and Σ, and learn, with input of the speaker vector and the non-speaker-individuality sound likelihood vector, a parameter Ω of an age level estimation model that outputs an estimated value of an age level of a corresponding speaker.
Claims
1. A learning apparatus comprising a processor configured to execute a method comprising: learning a speaker vector extraction parameter λ based on one or more items of learning speech voice data in a speaker vector voice database; creating a probability distribution model using a frequency component of one or more items of non-speaker-individuality sound data in a non-speaker-individuality sound database; calculating an internal parameter of the probability distribution model; extracting a speaker vector from voice data in an age level estimation model-learning voice database using the speaker vector extraction parameter λ, calculate a non-speaker-individuality sound likelihood vector from voice data in the age level estimation model-learning voice database using the internal parameters μ and Σ; and learning, with input of a speaker vector and a non-speaker-individuality sound likelihood vector, a parameter Ω of an age level estimation model that outputs an estimated value of an age level of a corresponding speaker.
2. An estimation apparatus comprising a processor configured to execute a method comprising: extracting a speaker vector V(x(unk)) from speech data to be estimated using a speaker vector extraction parameter λ; calculating a non-speaker-individuality sound likelihood vector P(freq(x(unk))) from the speech data to be estimated, using internal parameters μ and Σ; determining posterior probability from the speaker vector V(x(unk)) and the non-speaker-individuality sound likelihood vector P(freq(x(unk))) using a parameter Ω, wherein a combination of the speaker vector extraction parameter Ω, the internal parameters μ and Σ, and the parameter Ω, is based on a learnt age level estimation model; determining a dimension that maximizes the posterior probability; and using an age level corresponding to the dimension as an estimation result.
3. A computer implemented method for learning, comprising: learning a speaker vector extraction parameter λ based on one or more items of learning speech voice data in a speaker vector voice database; creating a probability distribution model using a frequency component of one or more items of non-speaker-individuality sound data in a non-speaker-individuality sound database; calculating an internal parameter of the probability distribution model; extracting a speaker vector from voice data in an age level estimation model-learning voice database using the speaker vector extraction parameter λ; calculating a non-speaker-individuality sound likelihood vector from voice data in the age level estimation model-learning voice database using the internal parameters μ and Σ; and learning, with input of a speaker vector and a non-speaker-individuality sound likelihood vector, a parameter Ω of an age level estimation model that outputs an estimated value of an age level of a corresponding speaker.
4. The computer implemented according to claim 3, further comprising: extracting a speaker vector V(x(unk)) from speech data to be estimated using the speaker vector extraction parameter λ; calculating a non-speaker-individuality sound likelihood vector P(freq(x(unk))) from the speech data to be estimated, using the internal parameters μ and Σ; and determining posterior probability from the speaker vector V(x(unk)) and the non-speaker-individuality sound likelihood vector P(freq(x(unk))) using the parameter Ω; determining a dimension that maximizes the posterior probability; and using an age level corresponding to the dimension as an estimation result.
5. (canceled)
6. The learning apparatus according to claim 1, wherein the age level estimation model uses machine learning based at least on a neural network.
7. The learning apparatus according to claim 1, wherein the non-speaker-individuality sound data include data associated with a water sound produced in part based on an amount and viscosity of saliva in an oral cavity, an amount of saliva secretion, and a continuous speech duration.
8. The learning apparatus according to claim 1, wherein the age level estimation model estimates an age level of a speaker speaking a speech, and wherein the speech data includes data associated with the speech spoken by the speaker.
9. The estimation apparatus according to claim 2, wherein the age level estimation model uses machine learning based at least on a neural network.
10. The estimation apparatus according to claim 2, wherein the non-speaker-individuality sound data include data associated with a water sound produced in part based on an amount and viscosity of saliva in an oral cavity, an amount of saliva secretion, and a continuous speech duration.
11. The estimation apparatus according to claim 2, wherein the age level estimation model estimates an age level of a speaker speaking a speech, and wherein the speech data includes data associated with the speech spoken by the speaker.
12. The computer implemented method according to claim 3, wherein the age level estimation model uses machine learning based at least on a neural network.
13. The computer implemented method according to claim 3, wherein the non-speaker-individuality sound data include data associated with a water sound produced in part based on an amount and viscosity of saliva in an oral cavity, an amount of saliva secretion, and a continuous speech duration.
14. The computer implemented method according to claim 3, wherein the age level estimation model estimates an age level of a speaker speaking a speech, and wherein the speech data includes data associated with the speech spoken by the speaker.
15. The computer implemented method according to claim 4, wherein the age level estimation model uses machine learning based at least on a neural network.
16. The computer implemented method according to claim 4, wherein the non-speaker-individuality sound data include data associated with a water sound produced in part based on an amount and viscosity of saliva in an oral cavity, an amount of saliva secretion, and a continuous speech duration.
17. The computer implemented method according to claim 4, wherein the age level estimation model estimates an age level of a speaker speaking a speech, and wherein the speech data includes data associated with the speech spoken by the speaker.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
DESCRIPTION OF EMBODIMENTS
[0022] An embodiment of the present invention will be described below. Note that in the drawings used in the following description, components having the same functions or steps that perform the same processes are denoted by the same reference numerals as the corresponding components or processes, and redundant description thereof will be omitted. In the following description, processes performed for each individual element of a vector or a matrix are applied to all the elements of the vector or the matrix unless otherwise noted.
Point of First Embodiment
[0023] A point of a first embodiment is to implement more accurate estimation of age levels of speakers by catching non-speaker-individuality sounds that occur characteristically of a certain age group during speaking and that are not completely catchable by conventional age level estimation techniques, which are based on speaker vectors, and using the non-speaker-individuality sounds jointly with speaker vectors.
First Embodiment
[0024]
[0025] The estimation system includes a learning apparatus 100 and an estimation apparatus 200.
[0026]
[0027] The learning apparatus 100 includes a database storage unit 110, a speaker vector learning unit 120, a non-speaker-individuality sound model learning unit 130, and an age level estimation model learning unit 140.
[0028] The learning apparatus 100 accepts input of speech voice data x(i) and x(k) for learning and non-speaker-individuality sound data z(j) for learning and stores the data in the database storage unit 110 prior to learning. Using information from the database storage unit 110, the learning apparatus 100 learns a speaker vector extraction parameter λ, internal parameters μ and Σ of a probability distribution model, and a parameter Ω of an age level estimation model and outputs the learned parameters λ, μ, Σ, and Ω.
[0029]
[0030] The estimation apparatus 200 includes a speaker vector extraction unit 210, a non-speaker-individuality sound frequency vector estimation unit 220, and an age level estimation unit 230.
[0031] Prior to age level estimation, the estimation apparatus 200 receives the parameters λ, μ, Σ, and Ω learned in advance.
[0032] The estimation apparatus 200 accepts input of speech voice data x(unk) to be estimated, estimates the age level of the speaker of the speech voice data x(unk), and outputs an estimation result age(x(unk)).
[0033] The learning apparatus 100 and the estimation apparatus 200 are, for example, special apparatuses each made up of a known or special-purpose computer equipped with a central processing unit (CPU) or a main storage device (RAM: Random Access Memory) and loaded with a special program. The learning apparatus 100 and the estimation apparatus 200 execute respective processes, for example, under the control of the central processing unit. Data input to the learning apparatus 100 and the estimation apparatus 200 as well as data obtained by the respective processes are, for example, stored in the main storage, read into the central processing unit from the main storage device as required, and used for other processes. Processing units of the learning apparatus 100 and estimation apparatus 200 may be at least partly made up of hardware such as integrated circuits. Storage units of the learning apparatus 100 and estimation apparatus 200 can each be made up, for example, of a main storage device such as a Random Access Memory (RAM) or middleware such as a relational database or a key-value store. However, the storage units do not necessarily have to be provided in the learning apparatus 100 or the estimation apparatus 200. Each of the storage units may be provided externally to the learning apparatus 100 or the estimation apparatus 200 by being made up of an auxiliary storage device, which is made up of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory.
[0034] First, processes of components of the learning apparatus 100 will be described.
Database Storage Unit 110
[0035] The database storage unit 110 stores a speaker vector voice database containing the speech voice data x(i) for learning, a non-speaker-individuality sound database containing the non-speaker-individuality sound data z(j) for learning, and an age level estimation model learning database containing the speech voice data x(k) and speaker age data age(k) for learning. Hereinafter databases will be referred to as DBs.
Speaker Vector Voice DB
[0036]
Non-Speaker-Individuality Sound DB
[0037]
Age Level Estimation Model Learning DB
[0038]
[0039] Because there can be plural speeches of one speaker, there are plural items of speech voice data associated with an identical speaker number in the DB. For example, the bit rate of each item of voice data is similar to the bit rate of data in the speaker vector voice DB.
Speaker Vector Learning Unit 120
[0040] The speaker vector learning unit 120 fetches all the learning speech voice data x(i) from the speaker vector voice DB, learns the speaker vector extraction parameter λ based on the fetched learning speech voice data x(i) (i=0, 1, . . . , L) (S120), and outputs the learned speaker vector extraction parameter λ.
[0041] For example, the speaker vector learning unit 120 calculates a feature value for use to find a speaker vector, from the learning speech voice data x(i) and learns the speaker vector extraction parameter λ using the feature value. Note that the speaker vector extraction parameter λ is a parameter used to extract the speaker vector from the feature value calculated from the speech voice data.
[0042] For example, as a feature value and extraction technique for extracting the speaker vector, known techniques are used. For example, an i-vector, an x-vector, or the like are used as the feature value.
Non-Speaker-Individuality Sound Model Learning Unit 130
[0043] The non-speaker-individuality sound model learning unit 130 fetches all the non-speaker-individuality sound data z(j) from the non-speaker-individuality sound DB, creates a probability distribution model using frequency components of the fetched non-speaker-individuality sound data z(j), calculates internal parameters μ and Σ of the probability distribution model (S130), and outputs the internal parameters μ and Σ.
[0044] For example, first the non-speaker-individuality sound model learning unit 130 calculates the frequency components from the non-speaker-individuality sound data z(j). To calculate a spectrogram, the non-speaker-individuality sound model learning unit 130 applies, for example, band-pass filtering to each item of non-speaker-individuality sound data z(j) in a range of 200 Hz to 3.7 kHz, and then calculates the frequency components. For example, the frequency components are 512-dimensional and ranges from 200 Hz to 3.7 kHz. The non-speaker-individuality sound model learning unit 130 calculates frequency components freq(z(j)).sub.t with a frame length of 10 ms and a shift width of 5 ms from the non-speaker-individuality sound data z(j), where t is a frame number.
[0045] Next, the non-speaker-individuality sound model learning unit 130 creates a probability distribution model using the frequency components freq(z(j)).sub.t of all the frames calculated from respective items of the non-speaker-individuality sound data z(j). For example, if Gaussian Mixture Model (GMM) is used, parameters μ and Σ of a 512-dimensional probability distribution model capable of calculating non-speaker-individuality sound likelihood p(freq(z(j)).sub.t) such as shown below are found.
[0046] The parameters μ and Σ can be found from all the frequency components freq(z.sub.j).sub.t using the following expression.
[0047] N is the sum total of all the frames of the non-speaker-individuality sound data used for learning. Regarding the non-speaker-individuality sound data z(j), a concatenation of all the frames of non-speaker-individuality sound likelihood p(freq(z(j)).sub.t) results in a non-speaker-individuality sound likelihood vector P(freq(z(j))).
Age Level Estimation Model Learning Unit 140
[0048] The age level estimation model learning unit 140 fetches all the speech voice data x(k) for learning and speaker age data age(k) from the age level estimation model learning DB. Besides, the age level estimation model learning unit 140 receives the learned speaker vector extraction parameter λ and the internal parameters μ and Σ.
[0049] The age level estimation model learning unit 140 extracts speaker vectors V(x(k)) from the speech voice data x(k) for learning using the learned speaker vector extraction parameter λ.
[0050] The age level estimation model learning unit 140 calculates non-speaker-individuality sound likelihood vectors P(freq(x(k))) from the speech voice data x(k) for learning using the learned internal parameters μ and Σ.
[0051] Using the speaker vectors V(x(k)), the non-speaker-individuality sound likelihood vectors P(freq(x(k))), and the corresponding speaker age data age(k), the age level estimation model learning unit 140 learns the parameter Ω of the age level estimation model (S140), and outputs the learned parameter Ω. Note that the age level estimation model accepts input of a speaker vector and a non-speaker-individuality sound likelihood vector and outputs an estimated value of the age level of the corresponding speaker.
[0052] Learning of the age level estimation model uses machine learning based on neural networks, SVMs, or the like. As an input feature, a one-dimensional feature vector FEAT(x(k)) resulting from combining the speaker vector V(x(k)) and the non-speaker-individuality sound likelihood vector P(freq(x(k))) is used. Using the age level data age(k) of the speaker as data to be estimated (output value) regarding FEAT(x(k)), the age level estimation model learning unit 140 learns the parameter Ω of the age level estimation model and updates the parameter Ω repeatedly in such a way as to minimize estimation errors. For example, a classification problem of classifying speakers' age levels into four classes C[C.sub.1=Child, C.sub.2=Young, C.sub.3=Adult, and C.sub.4=Senior] is set up. As a classifier for use to deal with this problem, for example, a neural network that accepts input of the feature vectors FEAT(x(k)) and outputs posterior probabilities p(C.sub.i|age(k)) of the respective classes is suitable. When the model is a neural network, a typical neural-network learning method (error back-propagation method) is used to update weights.
[0053] Next, processes of components of the estimation apparatus 200 will be described using
Speaker Vector Extraction Unit 210
[0054] Prior to an age level estimation process, the speaker vector extraction unit 210 receives a learned speaker vector extraction parameter λ.
[0055] The speaker vector extraction unit 210 accepts input of speech data x(unk) to be estimated, extracts a speaker vector V(x(unk)) from the speech data x(unk) by a technique similar to that of the age level estimation model learning unit 140 using the learned speaker vector extraction parameter λ (S210), and outputs the extracted speaker vector V(x(unk)). Note that x(unk) is data not used in the learning process and if the learning process is assumed to be a development process, the data x(unk) is the data given in an actual use scene.
Non-Speaker-Individuality Sound Frequency Vector Estimation Unit 220
[0056] Prior to the age level estimation process, the non-speaker-individuality sound frequency vector estimation unit 220 receives the learned internal parameters μ and Σ.
[0057] The non-speaker-individuality sound frequency vector estimation unit 220 accepts input of speech data x(unk) to be estimated, calculates a non-speaker-individuality sound likelihood vector P(freq(x(unk))) from the speech data x(unk) to be estimated, using the internal parameters μ and Σ of the probability distribution model by a technique similar to that of the age level estimation model learning unit 140 (S220), and outputs the calculated non-speaker-individuality sound likelihood vector P(freq(x(unk))).
Age Level Estimation Unit 230
[0058] The age level estimation unit 230 combines the speaker vector V(x(unk)) and the non-speaker-individuality sound likelihood vector P(freq(x(unk))) into a one-dimensional feature vector FEAT(x(unk)) and finds a posterior probability using a learned parameter Ω. For example, if a classification problem of classifying age levels into four classes is set up, the posterior probability is formulated as follows.
p(C.sub.i|age(x(unk)))=FEAT(x(unk))Ω [Math. 3]
[0059] Next, as indicated by the following expression, the age level estimation unit 230 finds a dimension that maximizes posterior probability p(C.sub.1|age(x(unk))) and outputs an age level corresponding to the dimension as an estimation result age(x(unk)) (S230).
age(x(unk))=argmax(p(C.sub.i|age(x(unk)))) [Math. 4]
Effect
[0060] The above configuration makes it possible to estimate speaker ages with higher accuracy than conventional age level estimation techniques that are based solely on speaker vectors.
Other Variations
[0061] The present invention is not limited to the above embodiment and variation. For example, the various processes described above may be performed not only in time series in the order described above, but also in parallel or separately, as required or depending on the processing power of the apparatus that performs the processes. Besides, various changes may be made as required without departing from the gist of the present invention.
Program and Recording Medium
[0062] The various processes described above can be implemented by loading a program that executes the steps of the method described above into a recording unit 2020 of a computer shown in
[0063] The program describing process details can be recorded on a computer-readable recording medium. Examples of the computer-readable recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory.
[0064] The program can be distributed, for example, by selling, assigning, or lending a portable recording medium such as a DVD or a CD-ROM on which the program has been recorded. Furthermore, the program can be distributed by storing the program in a storage device of a server computer and transferring the program from the server computer to other computers through a network.
[0065] First, a computer that executes the program once stores the program in a storage device of the computer, for example, by acquiring the program recorded in a portable recording medium or transferred from a server computer. Then, in performing a process, the computer reads the program out of its own recording medium and performs the process according to the read program. As another execution mode of the program, the computer may read the program directly from a portable recording medium and perform a process according to the program, or each time a program is transferred to the computer from a server computer, the computer may perform a process sequentially according to the received program. Alternatively, the process may be performed by a so-called Application Service Provider (ASP) service whereby a server computer transfers no program to the computer and achieves processing functions solely via program execution instructions and result acquisition. Note that the programs according to the present mode include information equivalent to a program and used for processing by an electronic computer (e.g., data that is not provided as direct instructions to the computer, but that prescribes processing of the computer).
[0066] Although, according to the present mode, the present apparatus is implemented through execution of a predetermined program on a computer, at least part of the process details may be implemented in a hardware.