SPEECH EMBEDDING APPARATUS, AND METHOD
20230109177 · 2023-04-06
Assignee
Inventors
Cpc classification
G10L17/02
PHYSICS
International classification
Abstract
A frame processor 81 calculates, from a first sequence of feature vectors, a second sequence of frame-level feature vectors. A posterior estimator 82 calculates posterior probabilities for each vector included in the second sequence to a cluster. A statistics calculator 83 calculates a sufficient statistic used for extracting an i-vector by using the second sequence, the posterior probabilities, a mean vector of each cluster calculated at the time of learning of the frame processor 81 and the posterior estimator 82, and a global covariance matrix calculated based on the mean vector.
Claims
1. A speech embedding apparatus comprising: a frame processor which calculates, from a first sequence of feature vectors, a second sequence of frame-level feature vectors; a posterior estimator which calculates posterior probabilities for each vector included in the second sequence to a cluster; and a statistics calculator which calculates a sufficient statistic used for extracting an i-vector by using the second sequence, the posterior probabilities, a mean vector of each cluster calculated at the time of learning of the frame processor and the posterior estimator, and a global covariance matrix calculated based on the mean vector.
2. The speech embedding apparatus according to claim 1, wherein, the frame processor calculates the second sequence by implementing a neural network including multiple layers learnt in advance.
3. The speech embedding apparatus according to claim 2, wherein, the neural network includes time-delay neural network layers, convolutional neural network layers, recurrent neural network layers, their variants or their combination.
4. The speech embedding apparatus according to any one of claims 1 to 3, wherein, the time resolution of the second sequence is the same as the time resolution of the first sequence or larger.
5. The speech embedding apparatus according to any one of claims 1 to 4, wherein, the posterior estimator calculates the posterior probabilities using the values calculated from fully connected layers of a neural network learnt in advance.
6. The speech embedding apparatus according to any one of claims 1 to 5, wherein, the statistics calculator calculates a zero-order statistic and a first-order statistic as the sufficient statistic.
7. The speech embedding apparatus according to any one of claims 1 to 6, further comprising an i-vector extractor which extracts an i-vector using the calculated sufficient statistic.
8. A speech embedding method comprising: calculating, from a first sequence of feature vectors, a second sequence of frame-level feature vectors; calculating posterior probabilities for each vector included in the second sequence to a cluster; and calculating a sufficient statistic used for extracting an i-vector by using the second sequence, the posterior probabilities, a calculated mean vector of each cluster, and a global covariance matrix calculated based on the mean vector.
9. The speech embedding method according to claim 8, wherein, the second sequence is calculated by implementing a neural network including multiple layers learnt in advance.
10. A non-transitory computer readable recording medium storing a speech embedding program, when executed by a processor, that performs a method for: calculating, from a first sequence of feature vectors, a second sequence of frame-level feature vectors; calculating posterior probabilities for each vector included in the second sequence to a cluster; and calculating a sufficient statistic used for extracting an i-vector by using the second sequence, the posterior probabilities, a calculated mean vector of each cluster, and a global covariance matrix calculated based on the mean vector.
11. The non-transitory computer readable recording medium according to claim 10, wherein, the second sequence is calculated by implementing a neural network including multiple layers learnt in advance.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0033]
[0034] It depicts an exemplary block diagram illustrating the structure of an exemplary embodiment of a speech embedding apparatus according to the present invention.
[0035]
[0036] It depicts an explanatory diagram illustrating an example of a process of extracting an i-vector.
[0037]
[0038] It depicts a flowchart illustrating the process of the exemplary embodiment of the speech embedding apparatus according to the present invention.
[0039]
[0040] It depicts a block diagram illustrating an outline of the speech embedding apparatus according to the present invention.
[0041]
[0042] It depicts a schematic block diagram illustrating a configuration of a computer according to at least one of the exemplary embodiments.
[0043]
[0044] It depicts an explanatory example illustrating a general extraction example of the i-vector.
DESCRIPTION OF EMBODIMENTS
[0045] The following describes an exemplary embodiment of the present invention with reference to drawings.
[0046]
[0047] The frame processor 10 receives a sequence of feature vectors o.sub.t={o.sub.1, o.sub.2, . . . , o.sub.[tau]} as shown in
[0048] Then, the frame processor 10 calculates a sequence of frame-level feature vectors x.sub.t={x.sub.1, x.sub.2, . . . , x.sub.[kappa]} from the received sequence of feature vectors o.sub.t. In the following description, the received feature vector sequence o.sub.t is referred to as a first sequence, and the calculated frame-level feature vector sequence x.sub.t is referred to as a second sequence.
[0049] The frame processor 10 may calculate the second sequence (that is, the sequence of frame-level feature vectors) x.sub.t by implementing, for example, a neural network including multiple layers learnt in advance. The learning method of the frame processor 10 will be described later. When the neural network implemented by the frame processor 10 is described as f.sub.NeuralNet, the second sequence x.sub.t is calculated, for example, by Equation 6 described below.
[Math. 4]
x.sub.t=f.sub.NeuralNet(o.sub.t) (Equation 6)
[0050] The form of the neural networks implemented by the frame processor 10 are arbitrary. The neural networks may be TDNN layers, convolutional neural network (CNN) layers, recurrent neural network (RNN) layers, their variants, or their combination.
[0051] In the present exemplary embodiment, the time resolution of the second sequence may be the same as the time resolution of the first sequence or larger, that is [kappa]<=[tau].
[0052] The posterior estimator 20 calculates posterior probabilities for each element x.sub.t included in the second sequence x.sub.[kappa] to a cluster. The cluster is generated when the frame processor 10 and the posterior estimator 20 are learnt. Hereinafter, the number of clusters is denoted as C, and the posterior probability of the element x.sub.t with respect to the cluster c is denoted as [gamma].sub.c,t.
[0053] The posterior estimator 20 may calculate the posterior probabilities by implementing, for example, a neural network learnt in advance. The learning method of the posterior estimator 20 will be described later. When the neural network implemented by the posterior estimator 20 is described as g.sub.NeuralNet, the posterior probabilities are calculated, for example, by Equation 7 described below. In Equation 7, {v.sub.c, b.sub.c}.sub.c=1.sup.C is a fully connected layer implementation of an affine transformation.
[0054] As described above, the posterior estimator 20 may calculate the posterior probabilities [gamma].sub.c,t for the c-th cluster of the feature vector (sequence of the feature vector) x.sub.t using the values calculated from the fully connected layers of the neural network learnt in advance.
[0055] The storage unit 30 stores a set of the {[mu].sub.c}.sub.c=1.sup.C of the average [mu].sub.c of each cluster c and a global covariance matrix [Sigma] calculated based on the average [mu].sub.c of each cluster c. Here, the average [mu].sub.c of the cluster c can be said to be the mean vector of each cluster, and can be said to indicate the centroid of the c-th cluster. The global covariance matrix [Sigma] is a covariance matrix shared by each cluster. Moreover, the mean vector of each cluster is calculated at the time of learning of the frame processor 10 and the posterior estimator 20.
[0056] In the following description, information in which the set of the {[mu].sub.c}.sub.c=1.sup.C of the average [mu].sub.c of each cluster c and a global covariance matrix [Sigma] may be described as a Dictionary (corresponding to Dictionary 31 in
[0057] Here, a method of learning the frame processor 10, the posterior estimator 20, and the Dictionary (that is, {[mu].sub.c}.sub.c=1.sup.C and [Sigma]) stored in the storage unit 30 according to the present exemplary embodiment will be described. The frame processor 10, the posterior estimator 20, and the Dictionary are trained jointly to maximize speaker discrimination in advance.
[0058] The frame processor 10 and the posterior estimator 20 are implemented by a neural network or the like, and the Dictionary learnt together with them is used for a sufficient statistic calculation process described later. Therefore, a configuration including the frame processor 10, the posterior estimator 20, and the Dictionary 31 may be referred to as a deep-structured front-end (Corresponding to Deep-structured front-end 200 in
[0059] The learning method of the deep-structured front-end is not particularly limited. For example, the frame processor 10, the posterior estimator 20, and the Dictionary may be trained jointly as in the NetVLAD framework disclosed in Non Patent Literature 4. In particular, the frame processor 10, the posterior estimator 20, and the Dictionary may be trained to minimize classification loss following the step as disclosed in Non Patent Literature 4.
[0060] Note that the posterior estimator 20 of the present exemplary embodiment uses the neural network g.sub.NeuralNet(x.sub.t), while the NetVLAD framework disclosed in Non Patent Literature 4 uses the identity function (g.sub.NeuralNet(x.sub.t)=x.sub.t). Furthermore, in the NetVLAD framework disclosed in Non Patent Literature 4, a covariance matrix is not used, but in the present exemplary embodiment, the Dictionary includes the mean vectors and a global covariance matrix.
[0061] The empirical estimate of the global covariance matrix is calculated from the second sequences x.sub.[kappa]. Here, it is assumed that all sequences have the same length [kappa] and there are N sequences in the training set. In this case, the covariance matrix [Sigma] may be calculated, for example, by Equation 8 described below.
[0062] The statistics calculator 40 uses the second sequence x.sub.[kappa], the posterior probability [gamma].sub.c,t, the mean vector [mu].sub.0 of each cluster, and the global covariance matrix [Sigma] to calculate a sufficient statistic used for extracting an i-vector. Specifically, the statistics calculator 40 calculates the zero-order statistic and the first-order statistic as the sufficient statistic. The statistics calculator 40 may calculate the zero-order statistic and the first-order statistic, for example, by Equations 9 and 10 described below.
[Math. 7]
N.sub.c=Σ.sub.t=1.sup.κγ.sub.c,t (Equation 9)
F.sub.c=Σ.sub.c.sup.−1/2[Σ.sub.t=1.sup.τγ.sub.c,t(x.sub.t−μ.sub.c)] (Equation 10)
[0063] The i-vector extractor 50 extracts the i-vector based on the calculated sufficient statistics. Specifically, the i-vector extractor 50 extracts the i-vector using the total variability matrix {T.sub.c}.sub.c=1.sup.C of the c-th cluster as a parameter. For example, the i-vector extractor 50 may extract the i-vector using the zero-order statistic and the first-order statistic according to Equations 11 and 12 shown below.
[Math. 8]
ϕ=L.sup.−1[Σ.sub.c=1.sup.CT.sub.c.sup.TF.sub.c] (Equation 11)
L.sup.−1=[Σ.sub.c=1.sup.CN.sub.cT.sub.c.sup.TT.sub.c+I].sup.−1 (Equation 12)
[0064] The total variability matrix of the cluster in the present exemplary embodiment corresponds to a total variability matrix of a generative Gaussian. Note that the training mechanism may follow the standard i-vector mechanism as disclosed in Non Patent Literatures 1, for example. In the present exemplary embodiment, since the i-vector is extracted using the neural network technology, the extracted i-vector can also be called a neural i-vector.
[0065] The probabilistic model generator 60 generates a probabilistic model. By sampling from this probabilistic model, new data can be generated. Let [phi] be the (neural) i-vector. The probabilistic model generator 60 may form the probabilistic model as shown in Equation 13 shown below.
[Math. 9]
p(x.sub.i|ϕ)=Σ.sub.c=1.sup.Cω.sub.cN(x.sub.t|μ.sub.c+Σ.sup.1/2T.sub.cϕ,Σ) (Equation 13)
[0066] where
[0067] The frame processor 10, the posterior estimator 20, the statistics calculator 40, the i-vector extractor 50 and the probabilistic model generator 60 are implemented by a CPU of a computer operating according to a program (speech embedding program). For example, the program may be stored in the storage unit 130, with the CPU reading the program and, according to the program, operating as the frame processor 10, the posterior estimator 20, the statistics calculator 40, the i-vector extractor 50 and the probabilistic model generator 60. The functions of the speech embedding apparatus 100 may be provided in the form of SaaS (Software as a Service).
[0068] The frame processor 10, the posterior estimator 20, the statistics calculator 40, the i-vector extractor 50 and the probabilistic model generator 60 may each be implemented by dedicated hardware. All or part of the components of each device may be implemented by general-purpose or dedicated circuitry, processors, or combinations thereof. They may be configured with a single chip, or configured with a plurality of chips connected via a bus. All or part of the components of each device may be implemented by a combination of the above-mentioned circuitry or the like and program.
[0069] In the case where all or part of the components of each device is implemented by a plurality of information processing devices, circuitry, or the like, the plurality of information processing devices, circuitry, or the like may be centralized or distributed. For example, the information processing devices, circuitry, or the like may be implemented in a form in which they are connected via a communication network, such as a client-and-server system or a cloud computing system.
[0070] Next, an operation example of the speech embedding apparatus according to the present exemplary embodiment will be described.
[0071] The frame processor 10 calculates the second sequence x.sub.[kappa] from the first sequence o.sub.[tau] (Step S11). The posterior estimator 20 calculates the posterior probabilities [gamma].sub.c,t for each element x.sub.t included in the second sequence x.sub.[kappa] to a cluster c (Step S12). The statistics calculator 40 calculates a sufficient statistic by using the second sequence x.sub.[kappa], the posterior probability [gamma].sub.c,t, the mean vector [mu].sub.c of each cluster, and the global covariance matrix [Sigma].
[0072] As described above, according to the present exemplary embodiment, the frame processor 10 calculates the second sequence x.sub.[kappa] from the first sequence o.sub.[tau], the posterior estimator 20 calculates the posterior probabilities [gamma].sub.c,t for each element x.sub.t included in the second sequence x.sub.[kappa] to a cluster c, and the statistics calculator 40 calculates a sufficient statistic by using the second sequence x.sub.[kappa], the posterior probability [gamma].sub.c,t, the mean vector [mu].sub.c of each cluster, and the global covariance matrix [Sigma]. Therefore, it is possible to extract features in a mode that requires generative modeling, while improving the performance of speech verification.
[0073] Next, an outline of the present invention will be described.
[0074] With such a configuration, it is possible to extract features in a mode that requires generative modeling, while improving the performance of speech processing application (e.g., speaker recognition).
[0075] Also, the frame processor 81 may calculate the second sequence by implementing a neural network including multiple layers learnt in advance.
[0076] Specifically, the neural network may include time-delay neural network layers, convolutional neural network layers, recurrent neural network layers, their variants, or their combination.
[0077] Also, the time resolution of the second sequence may be the same as the time resolution of the first sequence or larger.
[0078] Also, the posterior estimator 82 may calculate the posterior probabilities using the values calculated from fully connected layers of a neural network learnt in advance.
[0079] Also, the statistics calculator 83 may calculate a zero-order statistic and a first-order statistic as the sufficient statistic.
[0080] Also, the speech embedding apparatus 80 may include an i-vector extractor (for example, i-vector extractor 50) which extracts an i-vector using the calculated sufficient statistic.
[0081]
[0082] Each of the above-described speech embedding apparatus is mounted on the computer 1000. The operation of the respective processing units described above is stored in the auxiliary storage device 1003 in the form of a program (a speech embedding program). The CPU 1001 reads the program from the auxiliary storage device 1003, deploys the program in the main memory 1002, and executes the above processing according to the program.
[0083] Note that at least in one of the exemplary embodiments, the auxiliary storage device 1003 is an exemplary non-transitory physical medium. Other examples of non-transitory physical medium include a magnetic disc, a magneto-optical disk, a CD-ROM, a DVD-ROM, and a semiconductor memory that are connected via the interface 1004. In the case where the program is distributed to the computer 1000 by a communication line, the computer 1000 distributed with the program may deploy the program in the main memory 1002 to execute the processing described above.
[0084] Incidentally, the program may implement a part of the functions described above. The program may implement the aforementioned functions in combination with another program stored in the auxiliary storage device 1003 in advance, that is, the program may be a differential file (differential program).
[0085] While the invention has been particularly shown and described with reference to example embodiments thereof, the invention is not limited to these embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the claims.
[0086] The whole or part of the example embodiments disclosed above can be described as, but not limited to, the following supplementary notes.
[0087] (Supplementary note 1) A speech embedding apparatus comprising: a frame processor which calculates, from a first sequence of feature vectors, a second sequence of frame-level feature vectors; a posterior estimator which calculates posterior probabilities for each vector included in the second sequence to a cluster; and a statistics calculator which calculates a sufficient statistic used for extracting an i-vector by using the second sequence, the posterior probabilities, a mean vector of each cluster calculated at the time of learning of the frame processor and the posterior estimator, and a global covariance matrix calculated based on the mean vector.
[0088] (Supplementary note 2) The speech embedding apparatus according to claim 1, wherein, the frame processor calculates the second sequence by implementing a neural network including multiple layers learnt in advance.
[0089] (Supplementary note 3) The speech embedding apparatus according to claim 2, wherein, the neural network includes time-delay neural network layers, convolutional neural network layers, recurrent neural network layers, their variants, or their combination.
[0090] (Supplementary note 4) The speech embedding apparatus according to any one of claims 1 to 3, wherein, the time resolution of the second sequence is the same as the time resolution of the first sequence or larger.
[0091] (Supplementary note 5) The speech embedding apparatus according to any one of claims 1 to 4, wherein, the posterior estimator calculates the posterior probabilities using the values calculated from fully connected layers of a neural network learnt in advance.
[0092] (Supplementary note 6) The speech embedding apparatus according to any one of claims 1 to 5, wherein, the statistics calculator calculates a zero-order statistic and a first-order statistic as the sufficient statistic.
[0093] (Supplementary note 7) The speech embedding apparatus according to any one of claims 1 to 6, further comprising an i-vector extractor which extracts an i-vector using the calculated sufficient statistic.
[0094] (Supplementary note 8) A speech embedding method comprising: calculating, from a first sequence of feature vectors, a second sequence of frame-level feature vectors; calculating posterior probabilities for each vector included in the second sequence to a cluster; and calculating a sufficient statistic used for extracting an i-vector by using the second sequence, the posterior probabilities, a calculated mean vector of each cluster, and a global covariance matrix calculated based on the mean vector.
[0095] (Supplementary note 9) The speech embedding method according to claim 8, wherein, the second sequence is calculated by implementing a neural network including multiple layers learnt in advance.
[0096] (Supplementary note 10) A non-transitory computer readable recording medium storing a speech embedding program, when executed by a processor, that performs a method for: calculating, from a first sequence of feature vectors, a second sequence of frame-level feature vectors; calculating posterior probabilities for each vector included in the second sequence to a cluster; and calculating a sufficient statistic used for extracting an i-vector by using the second sequence, the posterior probabilities, a calculated mean vector of each cluster, and a global covariance matrix calculated based on the mean vector.
[0097] (Supplementary note 11) The non-transitory computer readable recording medium according to claim 10, wherein, the second sequence is calculated by implementing a neural network including multiple layers learnt in advance.
REFERENCE SIGNS LIST
[0098] 10 Frame processor [0099] 20 Posterior estimator [0100] 30 Storage unit [0101] 31 Dictionary [0102] 40 Statistics calculator [0103] 50 I-vector extractor [0104] 60 Probabilistic model generator [0105] 100 Speech embedding apparatus