Estimation of reliability in speaker recognition

Abstract

A method for estimating the reliability of a result of a speaker recognition system concerning a testing audio and a speaker model, which is based on one, two, three or more model audios, the method using a Bayesian Network to estimate whether the result is reliable. In estimating the reliability of the result of the speaker recognition system one, two, three, four or more than four quality measures of the testing audio and one, two, three, four or more than four quality measures of the model audio(s) are used.

Claims

1. A method for estimating the reliability of a result of a speaker recognition system, the result concerning one, two, three or more testing audio(s) or a testing voice print and a speaker model, which is based on one, two, three or more model audios, the method comprising: using a Bayesian Network to estimate whether the result is reliable, wherein estimating the reliability of the result of the speaker recognition system includes using one, two, three, four or more than four quality measures of the testing audio(s) and one, two, three, four or more than four quality measures of the model audio(s), and wherein using the Bayesian network includes: using as nodes describing observed parameters an observed score and the quality measures, using as nodes describing hidden parameters a hidden score, states of quality, coefficients of the distribution describing the states of quality, mean and precision describing the groups of the quality measures, mean and precision describing the distribution of the offset between observed and hidden score, mean and precision describing the distribution of the hidden score, and a real label of a trial, and using as a node describing a deterministic value a hypothesis prior, wherein: the observed score is dependent on at least one of a group consisting of the states of quality, a clean score, a real trial label, the mean and precision of the distribution describing the offset between the observed score and the hidden score, the real trial label is dependent on the hypothesis prior and the hidden score is dependent on the (hidden) real label of trial and the mean and precision of the distribution describing the clean score, the states of quality depend on the coefficients of the distribution describing the states of quality, the observed quality measures depend on the states of quality and the mean and precision of the distribution describing the groups of the observed quality measures.

2. The method according to claim 1, wherein the distribution describing the states of quality is discreet and/or wherein the mean describing the offset between observed and clean score depends on the precision describing the offset between observed and clean score and/or wherein the mean describing the quality measures optionally depends on the precision describing the quality measures.

3. The method according to claim 1, further comprising training the Bayesian Network before the Bayesian Network is used to estimate the reliability of the result of the speaker recognition system.

4. The method according to claim 1, wherein for training of the Bayesian Network one, two, three, four or more than four quality measures are used.

5. The method according to claim 1, wherein the quality measures comprise one, two, three or four of the following: signal to noise ratio, modulation index, entropy, universal background model log likelihood.

6. The method according to claim 1, wherein the Bayesian Network is trained using an Expectation Maximization algorithm to extract the parameters of the model.

7. The method according to claim 1, wherein the Bayesian Network is trained in one of the following manners: supervised, unsupervised, blind.

8. The method according to claim 1, wherein the Bayesian Network is adapted in order to describe certain circumstances better.

9. The method according to claim 1, wherein the quality measures are provided by one, two or more systems different from the Bayesian Network.

10. The method according to claim 1, wherein the reliability is used to make a decision, optionally comprising one of the following: discarding unreliable trials, transforming a score, fusing the results of two, three or more speaker recognition systems.

11. The method according to claim 1, wherein the speaker recognition system is used for speaker verification and/or speaker identification.

12. A non-transitory computer readable medium comprising computer readable instructions for executing a method according to claim 1 when executed on a computer.

Description

BRIEF DESCRIPTION OF THE SEVERAL VIEW OF THE DRAWINGS

(1) Further details of the invention are explained in the following figures.

(2) FIG. 1 shows a Bayesian Network as used in the prior art;

(3) FIG. 2 shows a Bayesian Network which may be used for a method according to the invention;

(4) FIG. 3 shows possible input parameters to a Bayesian Network;

(5) FIG. 4(a) shows at first training method for training a Bayesian Network;

(6) FIG. 4(b) shows a second training method for training a Bayesian Network;

(7) FIG. 4(c) shows a third training network for training a Bayesian Network;

(8) FIG. 5 shows steps which may be used for adaptation of a Bayesian Network;

(9) FIG. 6 shows the steps of a method for estimating the reliability of a decision of a speaker recognition system;

(10) FIG. 7 shows steps which may be comprised in a method according to the invention; and

(11) FIG. 8 shows a step which may be comprised in a method according to the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

(12) FIG. 1 shows a Bayesian Network used for example in the document “A probabilistic measure of modality reliability in speaker verification”, published in Acoustics, Speech and Signal Processing, 2005, Proceedings, (ICASSP '05), IEEE International Conference in 2005 by J. Richiardi et al. In it, empty nodes denote hidden variables, shaded nodes denote observed variables and a small solid node denotes a deterministic parameter. A node or group of nodes surrounded by a box (called a plate) labelled with N indicates that there are N nodes of that kind, for example N trials. The arcs between the nodes point from the parent to the children variables, representing the conditional dependencies between parents and children. Herein a parent variable corresponds to a variable on which a corresponding variable called children variable depends. The expressions used in FIG. 1 are known e.g. from the Bishop reference cited previously.

(13) The variables used in FIG. 1 are the following. s.sub.i, is the observed speaker verification score, Q.sub.i represents the observed speech quality measures related to one trial (only SNR in the previous mentioned document). θ.sub.iε{T,NT} is the hidden label of trial, where T is the hypothesis that the training and testing audio belong to the same speaker and NT is the hypothesis that the training and testing audio belong to different speakers. {circumflex over (θ)}.sub.i is the observed speaker recognition decision for the i-th trial, which is marked by the i subscript, after applying a threshold ξ.sub.θ. R.sub.iε( custom character ; U) is the hidden reliability of the trial, where is the hypothesis that the decision is reliable and U unreliable. π.sub.θ=(P.sub.T,P.sub.NT) is the deterministic hypothesis prior where P.sub.T is the target prior and P.sub.NT=1−P.sub.T the non-target prior. Finally, π.sub.R=(P custom character ,P.sub.U) is the deterministic reliability prior. Using the Bayesian Network, it is possible to compute the posterior distribution of R.sub.i given the observed and deterministic variables P(R.sub.i|s.sub.i,Q.sub.i,{circumflex over (θ)}.sub.i,π.sub.θ,π.sub.R).

(14) Such a model may have the disadvantage that the parameters of the Bayesian Network may depend on the speaker verification threshold ξ.sub.θ. Thus, a change of the threshold may make re-training necessary, which may not be an option or may not be advantageous in many real cases.

(15) FIG. 2 shows a Bayesian Network which may be used in some embodiments of the invention. In it, empty nodes denoted hidden variables, shaded nodes denote observed variables and a small solid node denotes a deterministic parameter. Again a group of nodes surrounded by a box (called plate) labelled with the letter either N or P indicates that there are N or P group of nodes of that kind. In this example, there may be N trials and P quality measures, wherein N may be 1, 2, 3 or more and wherein P may be 1, 2, 3, 4 or more.

(16) A Bayesian Network which may be used for a method as described before may use or comprise some or all of the following components and variables:

(17) ŝ.sub.i is the observed score provided by the speaker recognition system. The testing audio and the model audio(s) may have been degraded. In general, such an ŝ.sub.i may be a vector of scores from different speaker recognition systems. In other embodiments it may be a scalar variable. The subscript i which may be between 1 and number of trials N (iε[1,N]), may represent the trial.

(18) Whenever a variable has a subscript i, it means that this a realization of a random variable. Such a realization is called as the corresponding variable but has the additional subscript i which the random variable does not have. For example, ŝ.sub.i is a realization of ŝ.

(19) s.sub.i is the clean score which may be a vector of clean scores e.g. of different speaker recognition systems. It may be a scalar variable in other embodiments. The subscript i which may be between 1 and number of trials N (iε[1,N]), may represent the trial. Such a clean score would correspond to the score provided by a speaker recognition system without any degradation of the testing audio and model audio(s). In a general case, such a clean score may be a hidden variable. However, if the method comprises a training of the Bayesian network, depending on the training, the clean score may be observed in the training phase. In particular, if an artificially degraded database is used for example by adding additive noises or convolutional distortion to the signals, a clean score may be observed in the training phase. The distribution of the s under the condition θ may be assumed to be Gaussian. P(s|θ)=N(s; μ.sub.s.sub.θ,Λ.sub.s.sub.θ.sup.−1) where θ is the real trial label, which can be target or non target (θε(T,NT)). μ.sub.s.sub.θ and Λ.sub.s.sub.θ.sup.−1 are the mean and variance (inverse of precision) of the (usually Gaussian) distribution which clean scores associated to θ follow.

(20) Furthermore, the relationship between clean hidden and observed scores may be modelled following the expression ŝ.sub.i=s.sub.i+Δs.sub.i. Δs.sub.i may be the offset (difference) between the observed (noisy) score and the clean (hidden) score.

(21) π.sub.θ is the hypothesis prior π.sub.θ=(P.sub.T,P.sub.NT) with P.sub.T+T.sub.NT=1 and may be deterministic. Herein, P.sub.T may be the target prior and P.sub.NT the non-target prior. The target prior is the prior probability of a target trial. This may be considered as the probability of a target trial without knowing anything about a trial.

(22) z.sub.i are the states of quality (quality states) associated to i-th trial. It is a K-dimensional binary vector with elements z.sub.ik with k between 1 and the number of quality states K (kε[1,K]).

(23) z.sub.i is usually a binary vector. Given z.sub.i usually only one element will be equal to 1, while the others are 0. There may be K quality states. Thus, the element z.sub.ik which is equal to 1 determines the quality state associated to the i-th trial, the k-th in this case.

(24) Although the quality measures are usually continuous variables, the combination of all of them may be discretized and affect the distribution of Δs.sub.1. The distribution of z is given

(25) $by P (z) = {.Math.}_{k = 1}^{K} {(π_{z_{k}})}^{z_{k}} .$

(26) π.sub.z are the coefficients of the optionally discrete distribution describing z. π.sub.z is usually a K-dimensional vector with elements π.sub.z.sub.k, wherein π.sub.z.sub.k is usually the probability of the k-th quality state (which usually is the probability of z.sub.k).

(27) π.sub.z may be a variable of the Bayesian network and is usually obtained during the training phase of the Bayesian Network. There may also be other variables of the Bayesian Network that are trained during the training phase. K-dimensional z determines one quality state. When it is associated with a trial, it is usually called z.sub.i.

(28) Thus, the probability of z is usually π.sub.z.sub.k, wherein z.sub.k is the element of z which is 1. This may e.g. be expressed as given above

(29) $P (z) = {.Math.}_{k = 1}^{K} {(π_{z_{k}})}^{z_{k}} .$

(30) Q.sub.pi are the observed quality measures. It is considered that there are P groups of quality measures that are independent from each other given z.sub.i (pε[1,P]). This may allow forcing independence between variables, e.g. variables that should not be correlated. Herein, i may be the number of the trial, and p may run between 1 and the number of quality measures P. If Q.sub.p is modelled by Gaussians this may be the same as having a Gaussian block diagonal covariance matrix. Herein, Q.sub.p describes the observed quality measures. When they refer to a particular trial, they are referenced as Q.sub.pi.

(31) This set may be denoted as Q.sub.i={Q.sub.pi}.sub.p=1.sup.P.

(32) μ.sub.Q.sub.p and Λ.sub.Q.sub.p are the mean and precision (usually described by a matrix) of the usually Gaussian distributions that describe Q.sub.p. There are K different distributions, as many as quality states so that:

(33) $P (Q .Math. z_{k} = 1) = {.Math.}_{p = 1}^{P} N (Q_{p}; μ_{Q_{pk}}, Λ_{Q_{pk}}^{- 1})$

(34) μ.sub.Δs and Λ.sub.Δs are the mean μ.sub.Δs and precision Λ.sub.Δs (usually described by a matrix) of the usually Gaussian distribution that describes Δs. There are 2K different distributions, one for each quality state and θ.

(35) So, P(ŝ|s,z.sub.k=1,θ)=N(ŝ; s+μ.sub.Δs.sub.kθ,Λ.sub.Δs.sub.kθ.sup.−1).

(36) Therein, there may be N groups of nodes comprising the variables ŝ.sub.i, s.sub.i, Q.sub.pi, z.sub.i and θ.sub.i (one group for each iε[1,N]) and P groups of nodes comprising the variables Q.sub.pi, μ.sub.Q.sub.p and Λ.sub.Q.sub.p (one group for each iε[1,P]). In particular ŝ.sub.i may be dependent on z.sub.i, s.sub.i, θ.sub.i, μ.sub.Δs and Λ.sub.Δs. θ.sub.i may be dependent on the (optionally deterministic) π.sub.θ, s.sub.i may depend on θ.sub.i, Λ.sub.s, μ.sub.s while μ.sub.s may depend on Λ.sub.s. z.sub.i may depend on π.sub.z, μ.sub.Δs may depend on Λ.sub.Δs, Q.sub.pi may be an observed variable dependent on z.sub.i, μ.sub.Q.sub.p and Λ.sub.Q.sub.p, while μ.sub.Q.sub.p may be dependent on Λ.sub.Q.sub.p. ŝ.sub.i and Q.sub.pi may be observed, Λ.sub.Δs, μ.sub.Δs, Λ.sub.s, μ.sub.s, θ.sub.i, s.sub.i, Λ.sub.Q.sub.p, π.sub.z, z.sub.i and μ.sub.Q.sub.p may be hidden variables and π.sub.θ may be deterministic.

(37) In the Bayesian Network described above, p will usually assume values between 1 and P, and i will usually assume values between 1 and N.

(38) Herein, P is the number of quality measures and N the number of trials.

(39) FIG. 3 shows a diagram showing the input and output parameters of the Bayesian Network. In particular, the score of a speaker recognition system and the chosen quality measures are used by the Bayesian Network, for example, to calculate (estimate) the reliability. In other embodiments, these input parameters may be used for the training of the Bayesian Network. In particular, in this shown case, the quality parameter, signal to noise ratio, modulation index, entropy and Universal Background Model log likelihood are explicitly mentioned. It is indicated in the figure that other quality parameters may be additionally used.

(40) In other embodiments, only one, two, three or four of the mentioned quality measures may be used or any number of the shown quality measures may be used in combination with any other quality measures not shown here.

(41) As a result, the reliability P(R.sub.i= custom character |ŝ.sub.i,Q.sub.i) of the result of the recognition system may be estimated (calculated), usually for a particular testing audio and particular model audio(s). The result may for example be that the probability of the decision of a trial which has e.g. been found by comparing the observed score calculated by the speaker recognition system with a threshold is reliable.

(42) To calculate that reliability, additionally the speaker recognition threshold used by the speaker recognition system and/or a reliability threshold will usually have to be provided as input parameters for the Bayesian Network as well (not shown).

(43) FIG. 4 shows three different training methods which may be used to train the Bayesian Network.

(44) In particular, the Bayesian Network may be trained using stereo develop data (data wherein the degraded and clean data is present) in a supervised training. In it, Δs and z are observed during the training. The parameters are extracted using expectation maximization or any other suitable algorithm (FIG. 4 (a)).

(45) FIG. 4 (b) shows a different training approach for a Bayesian Network. In it, stereo data develop (comprising clean data and degraded data) is used in an unsupervised training. In such training, Δs may be observed during the training while z may be hidden during the training. Again, the parameters of the model may be extracted using a suitable algorithm, like for example expectation maximization algorithm.

(46) FIG. 4 (c) shows blind training of the Bayesian Network. In particular, it may not be necessary to provide stereo data. The data used for the training the Bayesian Network in blind training is usually degraded. Any degradation not seen in the develop speech signals will usually not be modelled by the Bayesian Network. This is usually also true for other training methods, for example as described with regard to FIGS. 4a and 4b. Usually, the accuracy of the Bayesian Network depends on the mismatch between developed data used to train to the Bayesian Network and testing data. With low mismatch, the accuracy of the Bayesian Network will be high, and vice versa.

(47) In blind training, Δs and z are hidden variables on the training. The parameters are extracted using a suitable algorithm like for example expectation maximization algorithm.

(48) FIG. 5 shows steps which may be used in a method according to the invention for the adaptation of a Bayesian Network (its parameters). Starting from adaptation data and using the parameters of a Bayesian Network which has already been trained, the Bayesian Network (its parameters) can be adapted. The adaptation data may comprise the observed score(s) (ŝ.sub.i) provided by the speaker recognition system from the adaptation data and one, two, three or more quality measures from audios used for the adaptation. Usually, the adaptation data comprises all quality measures derived from the one or more audio(s) used for the adaptation that are considered in the Bayesian Network and the observed score(s) provided by the speaker recognition system. Usually, the quality measure(s) and/or the score(s) are not computed from the audio(s) in the adaptation training, but e.g. before the adaptation training. During the adaptation training Δs and z may be hidden.

(49) Such an adaptation may for example be done using maximum a posteriori algorithm (MAP).

(50) With such an approach, after the adaptation, an adapted set of parameters of the Bayesian Network may be present. Thus, the Bayesian Network may then be used with the adapted parameters.

(51) Such an adaptation process may be particularly useful, if only a small set of model audios are present for the situation for which the model should be trained. Then, the result which may be achieved by using an already trained Bayesian Network and adapting its parameters/adapting the Bayesian Network, may be more reliable than starting the training process with the (limited) amount of data available for the particular situation from scratch.

(52) FIG. 6 shows the steps of an embodiment of the method of invention. In particular, using quality measures of the testing and model audios and the score of a speaker recognition system which may both be derived indirectly or directly from the testing and model audios, the Bayesian Network with trained parameters may compute the reliability and make a decision based on that reliability. Usually, “speaker” recognition threshold and/or a reliability threshold are needed to make a final decision.

(53) As explained above, such a decision may for example be a discarding of a trial if the decision is unreliable, a transformation of the score, for example by using one of the functions described above for that purpose or fusing of several systems (all of these are not shown in FIG. 6).

(54) For example, a score obtained by the speaker recognition system may be transformed into a transformed likelihood ratio or a transformed log likelihood ratio dependent on reliability to obtain a (calibrated) transformed (log) likelihood ratio. Thus, from a speaker recognition system providing a raw score which is not given as a (log) likelihood ratio, the score may be transformed into a (calibrated) transformed likelihood ratio (or a (calibrated) transformed log likelihood ratio LLR or a (calibrated) transformed score in a different format than a (log) likelihood ratio), or from a speaker recognition system providing a (calibrated) likelihood ratio (or a (calibrated) log likelihood ratio) the LR (LLR) may be transformed in view of the reliabilities estimated by the Bayesian Network to result in a (calibrated) transformed LLR or (calibrated) transformed LR (not shown in FIG. 6).

(55) FIG. 7 shows how a final score may be calculated using the reliability of the scores and the scores of the several speaker recognition systems 1 to M (wherein M is the number of different speaker recognition systems and may be 1, 2, 3, 4 or more) in a diagram. This final score may correspond to a decision mentioned for example in FIG. 6. In particular, starting out from the data which is usually a testing audio and a model audio(s), several speaker recognition systems in this case, 1 to M calculate score 1 to M. Herein each speaker recognition system then provides its score to the Bayesian Network. By using the quality measures of testing audio and model audio(s) and the score of the speaker recognition systems, the Bayesian Network then proceeds in making a decision. The quality measures are usually extracted from the data by an external module. This module, however, may also be integrated with the Bayesian Network in other embodiments. The decision may, for example, be a final score which may be considered against the threshold. For making such a decision, another Bayesian Network may be used.

(56) In other embodiments, some other module different from the Bayesian Network may make the decision using input from the Bayesian Network. For example, the scores may be fused by an external module according to their reliability which may be obtained with the explained Bayesian Network.

(57) In particular, a final score may be some combination of weighted scores wherein the scores with a higher reliability are weighted more than the scores with a lower reliability.

(58) In particular, in such a fusion, one Bayesian Network may calculate the reliability for the trials provided by all speaker recognition systems, or two, three, or more Bayesian Networks may be used. In particular, for each score of a speaker recognition system, one Bayesian Network may be used to calculate reliability and then the decision may be taken in the following step (not shown in FIG. 7). Usually, when the speaker identification system is changed, the Bayesian Network has to be retrained. Thus, in some embodiments, two, three, or more Bayesian Networks may be used. In other embodiments, only one Bayesian Network may be used.

(59) FIG. 8 also shows a step which may be comprised in a method according to the invention. A Bayesian Network may use the input quality measures and the result of a speaker recognition system, for example, an observed score ŝ.sub.i, e.g. a (calibrated) log likelihood ratio (LLR) or a (calibrated) likelihood ratio (LR) as input. It may then calculate the reliability of the result of the speaker recognition system.

(60) Based on the reliability that is calculated, a decision may then be made. This may for example be made by calculating a (calibrated) transformed likelihood ratio or a (calibrated) transformed log likelihood ratio or a (calibrated) transformed score based on the reliability and the result of the speaker recognition system. However, usually, when the result of the speaker recognition system is a likelihood ratio or a log likelihood ratio, no transformed score in a format different than a (log) likelihood ratio can be calculated.

(61) If a likelihood ratio or log likelihood ratio is the result of a speaker recognition system, using the reliability a (calibrated) transformed likelihood ratio or (calibrated) transformed log likelihood ratio may be calculated as output.

(62) Starting from a likelihood ratio as result of a speaker recognition system, a (calibrated) transformed likelihood ratio or a (calibrated) transformed log likelihood ratio may be calculated. Accordingly, from a log likelihood ratio, a (calibrated) transformed likelihood ratio or a (calibrated) transformed log likelihood ratio may be calculated as a result.

(63) Alternatively a (calibrated) transformed score in a different format than a (log) likelihood ratio may be calculated using ŝ.sub.i in a different format than a (log) likelihood ratio.

(64) The transformed likelihood ratio and/or the transformed log likelihood ratio or the transformed score may or may not be calibrated. The log likelihood ratio or the likelihood ratio or the score provided by a speaker recognition system may also be calibrated or may not be calibrated.

(65) The steps of calculating a decision (for example a transformed likelihood ratio or transformed log likelihood ratio or transformed score) based on the result of the speaker recognition system (which may for example be a score ŝ.sub.i in a format different than a (log) likelihood ratio or log likelihood ratio or likelihood ratio) using the reliability estimated by the Bayesian Network may be done by a different module or system than a Bayesian Network, wherein the reliability may be provided by the Bayesian Network and the result of the speaker recognition system may be provided by the speaker recognition system as input for the module or system.

ANNEX I

(66) The posterior probability of the hidden score, given the observed score and the quantity measures, P(s|ŝ, Q) can be expressed as (a method for calculating the posterior probability of the hidden score given the observed score and the quantity measures may also e.g. be found in J. Villalba: A Bayesian Network for Reliability Estimation: Unveiling the Score Hidden under the Noise, Technical Report, University of Zaragoza, Zaragoza (Spain), 2012):

(67) $P (s .Math. \hat{s}, Q) = \underset{θ \in {T, NT}}{.Math.} {.Math.}_{k = 1}^{K} P (s, θ, z_{k} = 1 .Math. \hat{s}, Q) = \underset{θ \in {T, NT}}{.Math.} {.Math.}_{k = 1}^{K} P (s .Math. \hat{s}, Q, θ, z_{k} = 1) P (θ, z_{k} = 1 .Math. \hat{s}, Q)$

(68) where P(s|ŝ,Q,θ,z.sub.k=1) can be demonstrated that follows a Gaussian distribution

(69) N (s; μ′.sub.s.sub.kθ,Λ′.sub.s.sub.kθ.sup.−1), where the mean and the precision are respectively:
Λ′.sub.s.sub.kθ.sup.−1=Λ.sub.Δs.sub.kθ+Λ.sub.s.sub.θ
μ′.sub.s.sub.kθ=Λ′.sub.s.sub.kθ.sup.−1(Λ.sub.Δs.sub.kθ(ŝ−μ.sub.Δs.sub.kθ)+Λ.sub.s.sub.θμ.sub.s.sub.θ)

(70) On the other hand, using Bayes rule,

(71) 0 $P (θ, z_{k} = 1 .Math. \hat{s}, θ) = \frac{P (\hat{s} .Math. Q, z_{k} = 1) P (Q .Math. z_{k} = 1) P (θ) π_{z_{k}}}{{.Math.}_{θ \in {T, NT}} {.Math.}_{k = 1}^{K} P (\hat{s} .Math. θ, z_{k} = 1) P (Q .Math. z_{k} = 1) P (θ) π_{z_{k}}}$

(72) Where:

(73) $P (Q .Math. z_{k} = 1) = {.Math.}_{p = 1}^{P} N (Q_{p}; μ_{Q_{pk}}, Λ_{Q_{pk}}^{- 1})$ $P (\hat{s} .Math. θ, z_{k} = 1) = N (\hat{s}; μ_{{\hat{s}}_{k θ}}^{'}, Λ_{{\hat{s}}_{k θ}}^{' - 1})$ $μ_{{\hat{s}}_{k θ}}^{'} = μ_{s_{θ}} + μ_{Δ s_{k θ}}$ $Λ_{{\hat{s}}_{k θ}}^{' - 1} = Λ_{s_{θ}} Λ_{s_{k θ}}^{' - 1} Λ_{Δ s_{k θ}}$

ANNEX II

(74) EM algorithm is an iterative method that estimates the parameters of a statistical model that has some latent variables by using maximum likelihood as objective. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter estimates are then used to determine the distribution of the latent variables in the next E step. (A method of using an EM algorithm to extract the parameters of a statistical model may also e.g. be found in J. Villalba: A Bayesian Network for Reliability Estimation: Unveiling the Score Hidden under the Noise, Technical Report, University of Zaragoza, Zaragoza (Spain), 2012.)

(75) Step E

(76) It is the defined variable γ(z.sub.k)=P(z.sub.k=1|ŝ,Q,θ) which can be computed as:

(77) $γ (z_{k}) = \frac{π_{z_{k}} P (\hat{s} .Math. z_{k} = 1, θ) P (Q .Math. z_{k} = 1)}{{.Math.}_{k = 1}^{K} π_{z_{k}} P (\hat{s} .Math. z_{k} = 1, θ) P (Q .Math. z_{k} = 1)}$ $P (Q .Math. z_{k} = 1) = {.Math.}_{p = 1}^{P} N (Q_{p}; μ_{Q_{pk}}, Λ_{Q_{pk}}^{- 1})$ $P (\hat{s} .Math. θ, z_{k} = 1) = N (\hat{s}; μ_{{\hat{s}}_{k θ}}^{'}, Λ_{{\hat{s}}_{k θ}}^{' - 1})$ $μ_{{\hat{s}}_{k θ}}^{'} = μ_{s_{θ}} + μ_{Δ s_{k_{θ}}}$ $Λ_{{\hat{s}}_{k θ}}^{'} = Λ_{s_{θ}} Λ_{s_{k_{θ}}}^{' - 1} Λ_{Δ s_{k_{θ}}}$

(78) Step M

(79) Step M provides the new estimation of the model parameters once the step E has been carried out:

(80) $π_{z_{k}} = \frac{{.Math.}_{i = 1}^{N} γ (z_{ik})}{{.Math.}_{k = 1}^{K} {.Math.}_{i = 1}^{N} γ (z_{ik})}$ $μ_{Q_{pk}} = \frac{{.Math.}_{i = 1}^{N} γ (z_{ik}) Q_{pi}}{{.Math.}_{i = 1}^{N} γ (z_{ik})}$ $Λ_{Q_{pk}}^{- 1} = \frac{{.Math.}_{i = 1}^{N} γ (z_{ik}) (Q_{pi} - μ_{Q_{pk}}) {(Q_{pi} - μ_{Q_{pk}})}^{T}}{{.Math.}_{i = 1}^{N} γ (z_{ik})}$ $μ_{s_{θ}} = \frac{{.Math.}_{i = 1}^{N} t_{i θ} E [s_{i}]}{{.Math.}_{i = 1}^{N} t_{i θ}}$ $Λ_{s_{θ}}^{- 1} = \frac{{.Math.}_{i = 1}^{N} t_{i θ} E [s_{i} s_{i}^{T}]}{{.Math.}_{i = 1}^{N} t_{i θ}} - μ_{s_{θ}} μ_{s_{θ}}^{T}$ $μ_{Δ s_{k θ}} = \frac{{.Math.}_{i = 1}^{N} t_{i θ} γ (z_{ik}) ({\hat{s}}_{i} - μ_{s i_{k θ}}^{'})}{{.Math.}_{i = 1}^{N} t_{i θ} γ (z_{ik})}$ $Λ_{Δ s_{k θ}}^{- 1} = \frac{{.Math.}_{i = 1}^{N} t_{i θ} γ (z_{ik}) ({\hat{s}}_{i} - μ_{s i_{k θ}}^{'}) {({\hat{s}}_{i} - μ_{s i_{k θ}}^{'})}^{T}}{{.Math.}_{i = 1}^{N} t_{i θ} γ (z_{ik})} + Λ_{S_{k θ}}^{' - 1} - μ_{Δ s_{k θ}} μ_{Δ s_{k θ}}^{T}$ $μ_{{si}_{k θ}}^{'} = Λ_{s_{k θ}}^{' - 1} (Λ_{Δ s_{k θ}} ({\hat{s}}_{i} - μ_{Δ s_{k θ}}) + Λ_{s_{θ}} μ_{s_{θ}}$

(81) Λ′.sub.s.sub.kθ=Λ.sub.Δs.sub.kθ+Λ.sub.s.sub.θ where t.sub.iθ=1 if θ.sub.i=θ, and t.sub.iθ=0 if θ.sub.i≠θ. E is the expectation operator.

ANNEX III

(82) Maximum A Posteriori algorithm is used to adapt the means and covariances of P(Q|z), and P(Δs|θ,z) and P(s|θ) with few target data. Given the corresponding means and covariances initially included in the Bayesian Network (μ.sub.0,Q.sub.pk, Σ.sub.0,Q.sub.pk, μ.sub.0,Δs.sub.kθ, Σ.sub.0,Δs.sub.kθ, μ.sub.0,s.sub.kθ and Σ.sub.0,s.sub.kθ), which have been obtained with the develop data; and the means and covariances extracted by the Bayesian Network training procedure with the target data (look at Annex II, μ.sub.ML,Q.sub.pk, Σ.sub.ML,Q.sub.pk, μ.sub.ML,Δs.sub.kθ, Σ.sub.ML,Δs.sub.kθ, μ.sub.ML,s.sub.kθ and Σ.sub.ML,s.sub.kθ), adapted parameters are obtained by linear regression according the amount of target data:

(83) $μ_{Q_{pk}} = \frac{1}{β_{k}} (β_{0} μ_{0, Q_{pk}} + N_{k} μ_{ML, Q_{pk}})$ ${.Math.}_{Q_{pk}} = \frac{1}{ρ_{k}} ((ρ_{0} {.Math.}_{0, Q_{pk}} + N_{k} {.Math.}_{ML, Q_{pk}} + \frac{β_{0} N_{k}}{β_{k}} (μ_{ML, Q_{pk}} - μ_{0, Q_{pk}}) {(μ_{ML, Q_{pk}} - μ_{0, Q_{pk}})}^{T}) μ_{Δ s_{k θ}} = \frac{1}{β_{k}} (β_{0} μ_{0, Δ s_{k θ}} + N_{k} μ_{ML, Δ s_{k θ}}) {.Math.}_{Δ s_{k θ}} = \frac{1}{ρ_{k}} ((ρ_{0} {.Math.}_{0, Δ s_{k θ}} + N_{k} {.Math.}_{ML, Δ s_{k θ}} + \frac{β_{0} N_{k}}{β_{k}} (μ_{ML, Δ s_{k θ}} - μ_{0, Δ s_{k θ}}) {(μ_{ML, Δ s_{k θ}} - μ_{0, Δ s_{k θ}})}^{T}) μ_{s_{k θ}} = \frac{1}{β_{k}} (β_{0} μ_{0, s_{k θ}} + N_{k} μ_{ML, s_{k θ}}) {.Math.}_{s_{k θ}} = \frac{1}{ρ_{k}} ((ρ_{0} {.Math.}_{0, s_{k θ}} + N_{k} {.Math.}_{ML, s_{k θ}} + \frac{β_{0} N_{k}}{β_{k}} (μ_{ML, s_{k θ}} - μ_{0, s_{k θ}}) {(μ_{ML, s_{k θ}} - μ_{0, s_{k θ}})}^{T})$

(84) Where β.sub.0, ρ.sub.0 are the relevant factors for the means and covariances, and N.sub.k is the number of trials belong to a quality state k in the target data. Also,
β.sub.k=N.sub.k+β.sub.0
ρ.sub.k=N.sub.k+ρ.sub.0

Estimation of reliability in speaker recognition

Assignee

Inventors

Cpc classification

Classification Explorer

G10L17/10

PHYSICS

Classification Explorer

G10L17/12

PHYSICS

Classification Explorer

G10L17/20

PHYSICS

Classification Explorer

G10L15/01

PHYSICS

Classification Explorer

G10L17/06

PHYSICS

Classification Explorer

G10L15/083

PHYSICS

Classification Explorer

G10L17/04

PHYSICS

International classification

Classification Explorer

G10L15/00

PHYSICS

Classification Explorer

G10L15/01

PHYSICS

Classification Explorer

G10L17/12

PHYSICS

Classification Explorer

G10L15/08

PHYSICS

Classification Explorer

G10L17/06

PHYSICS

Classification Explorer

G10L17/04

PHYSICS

Classification Explorer

G10L17/10

PHYSICS

Classification Explorer

G10L15/14

PHYSICS

Abstract

Claims

Description