SPEAKER EMBEDDING APPARATUS AND METHOD

20220270614 · 2022-08-25

Assignee

Inventors

Cpc classification

International classification

Abstract

An input unit 81 inputs an observation at current time step. A frame alignment unit 82 computes a frame alignment at a current time step by using the input observation. An i-vector computation unit 83 computes an i-vector and a precision matrix by using the computed frame alignment, the input observation, and a product obtained when computing the i-vector at the previous time step. An output unit 84 outputs the computed i-vector and precision matrix.

Claims

1. A speaker embedding apparatus using an i-vector comprising: a memory storing instructions; and one or more processors configured to execute the instructions to: input an observation at a current time step; compute a frame alignment at the current time step by using the input observation; compute an i-vector and a precision matrix by using the computed frame alignment, the input observation, and a product obtained when computing an i-vector at a previous time step; and output the computed i-vector and the precision matrix.

2. The speaker embedding apparatus according to claim 1, wherein the processor further executes instructions to update the i-vector and the precision matrix by using the i-vector and its precision matrix at the previous time step, the frame alignment, and the observation.

3. The speaker embedding apparatus according to claim 1, wherein the processor further executes instructions to wherein, the i-vector computation unit updates update the i-vector and the precision matrix by using zero-order statistics and first-order statistics at the previous time step, the frame alignment, and the observation.

4. The speaker embedding apparatus according to claim 1, wherein the processor further executes instructions to update the i-vector and the precision matrix without directly using past observations other than the observation at current time step.

5. The speaker embedding apparatus according to claim 1, wherein the processor further executes instructions to compute the i-vector and the precision matrix by recursively updating the product obtained at the time step of computation of the i-vector at the previous time step.

6. A speaker embedding method using an i-vector comprising: inputting an observation at a current time step; computing a frame alignment at the current time step by using the input observation; computing an i-vector and a precision matrix by using the computed frame alignment, the input observation, and a product obtained when computing an i-vector at a previous time step; and outputting the computed i-vector and the precision matrix.

7. The speaker embedding method according to claim 6, wherein the i-vector and the precision matrix are updated by using the i-vector and its precision matrix at the previous time step, the frame alignment, and the observation.

8. The speaker embedding method according to claim 6, wherein the i-vector and the precision matrix are updated by using zero-order statistics and first-order statistics at the previous time step, the frame alignment, and the observation.

9. A non-transitory computer readable recording medium storing a speaker embedding program using an i-vector, when executed by a processor, that performs a method for: inputting an observation at a current time step; computing a frame alignment at the current time step by using the input observation; computing an i-vector and a precision matrix by using the computed frame alignment, the input observation, and a product obtained when computing an i-vector at a previous time step; and outputting the computed i-vector and the precision matrix.

10. The non-transitory computer readable recording medium according to claim 9, wherein the i-vector and the precision matrix are updated by using the i-vector and its precision matrix at the previous time step, the frame alignment, and the observation.

11. The non-transitory computer readable recording medium according to claim 9, wherein the i-vector and the precision matrix are updated by using zero-order statistics and first-order statistics at the previous time step, the frame alignment, and the observation.

Description

BRIEF DESCRIPTION OF DRAWINGS

[0022] FIG. 1 It depicts an exemplary block diagram illustrating the structure of the first exemplary embodiment of a speaker embedding apparatus according to the present invention.

[0023] FIG. 2 It depicts an exemplary explanatory diagram illustrating the process of first exemplary embodiment of the speaker embedding apparatus according to the present invention.

[0024] FIG. 3 It depicts a flowchart illustrating the process of first exemplary embodiment of the speaker embedding apparatus according to the present invention.

[0025] FIG. 4 It depicts an exemplary block diagram illustrating the structure of the second exemplary embodiment of a speaker embedding apparatus according to the present invention.

[0026] FIG. 5 It depicts an exemplary explanatory diagram illustrating the process of second exemplary embodiment of the speaker embedding apparatus according to the present invention.

[0027] FIG. 6 It depicts a flowchart illustrating the process of second exemplary embodiment of the speaker embedding apparatus according to the present invention.

[0028] FIG. 7 It depicts a block diagram illustrating an outline of the speaker embedding apparatus according to the present invention.

[0029] FIG. 8 It depicts a schematic block diagram illustrating the configuration example of the computer according to the exemplary embodiment of the present invention.

[0030] FIG. 9 It depicts an exemplary explanatory illustrating a general extraction example of the i-vector.

DESCRIPTION OF EMBODIMENTS

[0031] The following describes an exemplary embodiment of the present invention with reference to drawings. In the present invention, when new observation is given, products obtained at the time step of computation of the i-vector are recursively and continuously updated so that raw data (feature vectors) need not be held. The products are the i-vectors themselves or intermediate representations. Examples of intermediate representation include statistics such as zero-order statistics and first-order statistics.

[0032] In the present invention, since it is not necessary to keep raw data, it is possible to reduce the storage capacity. Also, since i-vector provides a highest level of abstraction compared to acoustical features and has a better irreversible properties, it is also possible to meet the requirement of data privacy. Also, according to the present invention, an exact solution can be obtained instead of an approximate solution as compared with a general offline i-vector.

First Exemplary Embodiment

[0033] In the first exemplary embodiment, a method of performing speaker embedding by recursively updating the i-vector will be described. FIG. 1 depicts an exemplary block diagram illustrating the structure of a first exemplary embodiment of a speaker embedding apparatus according to the present invention. The speaker embedding apparatus 100 according to the present exemplary embodiment includes a storage unit 110, an input unit 120, a computation unit 130 and an output unit 140.

[0034] The speaker embedding apparatus 100 is connected to a recognition device 10, and the recognition device 10 performs speaker recognition (verification) using the processing result by the speaker embedding apparatus 100. Therefore, a system including the speaker embedding apparatus 100 of the present exemplary embodiment and the recognition unit 10 can be referred to as a speaker recognition system (speaker verification system).

[0035] The storage unit 110 stores the computation result by the computation unit 130 described later. In addition, the storage unit 110 may store observations input by the input unit 120 described later. Note that the speaker embedding apparatus 100 according to the present exemplary embodiment updates the i-vector at the current time step using the products obtained at the time step of computation of the previous i-vector. Therefore, the speaker embedding apparatus 100 does not have to store all the past observations. The storage unit 110 also stores various parameters used for computation by the computation unit 130 described later. The storage unit 110 is realized by, for example, a magnetic disk or the like.

[0036] The input unit 120 receives an input of observations used by the computation unit 130 described later for updating the i-vector. Specifically, the input unit 120 receives the observation o.sub.t at the current time step t. The input unit 120 may also receive input of various parameters used for computation by the computation unit 130 described later.

[0037] The computation unit 130 updates the i-vector using the observation o.sub.t at the current time step t and the products obtained at the time step of computation of the i-vector at the previous time step t−1. In this exemplary embodiment, the computation unit 130 uses the observation o.sub.t at the current time step t and the i-vector [phi].sub.t-1 and its precision matrix L.sub.t-1 at the previous time step t−1 to compute the i-vector [phi].sub.t and precision matrix L.sub.t.

[0038] Specifically, first, the computation unit 130 computes an alignment [gamma].sub.C, t of the feature vector o.sub.t to each of the C Gaussian components. In the Gaussian Mixture Model—universal background model (GMM-UBM) approach, [gamma].sub.C, t can be said to be the posterior probability that the feature vector o.sub.t is generated from the c-th element distribution of UBM. The computation unit 130 may computate the alignment [gamma].sub.C, t according to Equation 1 described above.

[0039] Next, the computation unit 130 computes the i-vector [phi].sub.t and its precision matrix L.sub.t. Specifically, the computation unit 130 updates i-vector [phi].sub.t and its precision matrix L.sub.t using the i-vector [phi].sub.t-1 and its precision matrix L.sub.t-1 estimated (computed) at previous time step t−1, and the observation o.sub.t and its alignment [gamma].sub.C, t at current time step t computed above.

[0040] The computation unit 130 updates the i-vector [phi].sub.t and its precision matrix L.sub.t using Equation 6 and Equation 7 described below.


[Math. 4]


ϕ.sub.t=L.sub.t.sup.−1[Σ.sub.c=1.sup.Cγ.sub.c,tT.sub.c.sup.TΣ.sub.c.sup.−1(o.sub.t−μ.sub.c)+L.sub.t-1ϕ.sub.t-1]  (Equation 6)


L.sub.t=[Σ.sub.c=1.sup.Cγ.sub.c,tT.sub.c.sup.TΣ.sub.c.sup.−1T.sub.c+L.sub.t-1]  (Equation 7)

[0041] C, [omega].sub.C, [mu].sub.C, [Sigma].sub.C, and T.sub.C are the same as the parameters described above. The observation (feature vector) o.sub.t and the number [tau] of feature vectors in the set are also the same as the contents described above. [phi].sub.t-1 represents the i-vector estimated at the previous time step t−1, and L.sub.t-1 is the precision matrix of the i-vector estimated at the previous time t−1.

[0042] Thereafter, the input unit 120 and the computation unit 130 repeat the above processing each time step a new observation is received. FIG. 2 is an exemplary explanatory diagram illustrating the process of first exemplary embodiment of the speaker embedding apparatus 100 according to the present invention. First, at time step t=1, when the input unit 120 receives the observation o.sub.1, the computation unit 130 computes a frame alignment [gamma].sub.c, 1 based on the above-described parameters and an observation o.sub.1. Then, the computation unit 130 updates the i-vector and the precision matrix. In the initial state, it is initialized as [phi].sub.0=0 and L.sub.0=I, and the computation unit 130 updates i-vector [phi].sub.1 and its precision matrix L.sub.1 by using {o.sub.1, [gamma].sub.c, 1}, [phi].sub.0 and L.sub.0.

[0043] Next, at time step t=2, when the input unit 120 receives the observation o.sub.2, the computation unit 130 computes a frame alignment [gamma].sub.c, 2 based on the above-described parameters and the observation o.sub.2. Then, the computation unit 130 updates i-vector [phi].sub.2 and its precision matrix L.sub.2 by using [phi].sub.1 and L.sub.1 estimated at previous time step t=1, and {o.sub.2, [gamma].sub.c, 2}. The same applies to time step t=3. Thereafter, each time step the input unit 120 receives an observation, the above process is recursively repeated.

[0044] That is, when the input unit 120 receives an observation o.sub.[tau] at the current time step t=[tau], the computation unit 130 computes the frame alignment [gamma].sub.c, [tau] based on the above-described parameters and the observation o.sub.[tau]. Then, the computation unit 130 updates the i-vector [phi].sub.[tau] and its precision matrix L.sub.[tau] by using [phi].sub.[tau]-1 and its precision matrix L.sub.[tau]-1 estimated at previous time step t=[tau]−1, and {o.sub.[tau], [gamma].sub.c, [tau]}.

[0045] The output unit 140 outputs the updated i-vector [phi].sub.[tau] and its precision matrix L.sub.[tau]. The output unit 140 may output, for example, the i-vector [phi].sub.[tau] and its precision matrix L.sub.[tau] to the recognition device 10. The recognition device 10 may perform recognition (verification) processing using the updated i-vector [phi].sub.[tau] and its precision matrix L.sub.[tau].

[0046] The input unit 120, the computation unit 130 and the output unit 140 are implemented by a CPU of a computer operating according to a program (speaker embedding program). For example, the program may be stored in the storage unit 110, with the CPU reading the program and, according to the program, operating as the input unit 120, the computation unit 130 and the output unit 140. The functions of the speaker embedding apparatus may be provided in the form of SaaS (Software as a Service).

[0047] The input unit 120, the computation unit 130 and the output unit 140 may each be implemented by dedicated hardware. All or part of the components of each device may be implemented by general-purpose or dedicated circuitry, processors, or combinations thereof. They may be configured with a single chip, or configured with a plurality of chips connected via a bus. All or part of the components of each device may be implemented by a combination of the above-mentioned circuitry or the like and program.

[0048] In the case where all or part of the components of each device is implemented by a plurality of information processing devices, circuitry, or the like, the plurality of information processing devices, circuitry, or the like may be centralized or distributed. For example, the information processing devices, circuitry, or the like may be implemented in a form in which they are connected via a communication network, such as a client-and-server system or a cloud computing system.

[0049] Next, an operation example of the speaker embedding apparatus 100 according to the present exemplary embodiment will be described. FIG. 3 is a flowchart illustrating the process of first exemplary embodiment of the speaker embedding apparatus 100 according to the present invention. First, the input unit 120 inputs initial conditions [phi].sub.0=0 and L.sub.0=I, and parameters {C, [omega].sub.C, [mu].sub.C, [Sigma].sub.C, and T.sub.C} (step S11). The initial conditions and parameters may be stored in advance in storage unit 110.

[0050] Subsequently, the processing from step S12 to step S15 is repeated for each observation o.sub.t which is element of {o.sub.1, o.sub.2, . . . , o.sub.[tau]}. The input unit 120 receives an input of the observation o.sub.t (step S12). The computation unit 130 computes the frame alignment [gamma].sub.c, t by using the Equation 1 described above (step S13). Then, computation unit 130 updates the precision matrix from L.sub.t-1 to L.sub.t by using Equation 7 described above (step S14), and updates the i-vector from [phi].sub.t-1 to [phi].sub.t by using Equation 6 described above (step S15). The computation unit 130 may store the computed i-vector and precision matrix in the storage unit 110.

[0051] Then, the output unit 140 outputs the computed sequence of i-vectors {[phi].sub.1, [phi].sub.2, . . . , [phi].sub.[tau]} and their precision matrices {L.sub.1, L.sub.2, . . . , L.sub.[tau]} (step S16).

[0052] Next, it will be described that the i-vector is appropriately updated by the speaker embedding apparatus 100 according to the present exemplary embodiment. The term L.sub.t-1[phi].sub.t-1 included in the above Equation 6 can be expanded as the following Equation 8.


[Math. 5]


L.sub.t-1L.sub.t-1.sup.−1(Σ.sub.c=1.sup.Cγ.sub.c,t-1T.sub.c.sup.TΣ.sub.c.sup.−1(o.sub.t-1−μ.sub.c)+L.sub.t-2ϕ.sub.t-2)  (Equation 8)

[0053] Since L.sub.t-1L.sub.t-1.sup.−1 becomes an identity matrix, the equation in parentheses remains. By repeating this process, Equation 9 described below can be derived.

[00002] [ Math . 6 ] ϕ t = L t - 1 [ .Math. c = 1 C γ c , t T c T Σ c - 1 ( o t - μ c ) + L t - 1 ϕ t - 1 ] = L t - 1 [ .Math. c = 1 C γ c , t T c T Σ c - 1 ( o t - μ c ) + ( .Math. c = 1 C γ c , t - 1 T c T Σ c - 1 ( o t - 1 - μ c ) + L t - 2 ϕ t - 2 ) ] = L t - 1 [ .Math. c = 1 C γ c , t T c T Σ c - 1 ( o t - μ c ) + .Math. + .Math. c = 1 C γ c , l T c T Σ c - 1 ( o t - 1 - μ c ) + L o ϕ o ] = L t - 1 [ .Math. c = 1 C γ c , t T c T Σ c - 1 ( o t - μ c ) + .Math. + .Math. c = 1 C γ c , l T c T Σ c - 1 ( o t - 1 - μ c ) ] = L t - 1 [ .Math. c = 1 C .Math. l = 1 t γ c , l T c T Σ c - 1 ( o l - μ c ) ] = L t - 1 [ .Math. c = 1 C T c T Σ c - 1 .Math. l = 1 t γ c , l ( o l - μ c ) ] ( Equation 9 )

[0054] The above Equation 9 is equal to the general offline computed i-vector described by the above Equation 4.

[0055] Similarly, the term L.sub.t-1 included in the above Equation 7 can be expanded as the following Equation 10.


[Math. 7]


Σ.sub.c=1.sup.Cγ.sub.c,t-1T.sub.c.sup.TΣ.sub.c.sup.−1T.sub.c+L.sub.t-2  (Equation 10)

[0056] By repeating this expansion process, Equation 11 described below can be derived.

[00003] [ Math . 8 ] L t - 1 = [ .Math. c = 1 C γ c , t T c T Σ c - 1 T c + L t - 1 ] - 1 = [ .Math. c = 1 C γ c , t T c T Σ c - 1 T c + .Math. c = 1 C γ c , t - 1 T c T Σ c - 1 T c + L - 2 ] - 1 = [ .Math. c = 1 C γ c , t T c T Σ c - 1 T c + .Math. c = 1 C γ c , t - 1 T c T Σ c - 1 T c + .Math. + .Math. c = 1 C γ c , l T c T Σ c - 1 T c + L o ] - 1 = [ .Math. c = 1 C γ c , t T c T Σ c - 1 T c + .Math. c = 1 C γ c , t - 1 T c T Σ c - 1 T c + .Math. +  .Math. c = 1 C γ c , l T c T Σ c - 1 T c + I ] - 1 = [ .Math. c = 1 C .Math. l = 1 t γ c , l T c T Σ c - 1 T c + I ] - 1 =  [ .Math. c = 1 C N c T c T Σ c - 1 T c + I ] - 1 ( Equation 11 )

[0057] Equation 11 is equal to the general offline computed precision matrix described in Equation 5 above.

[0058] As described above, according to the present exemplary embodiment, the input unit 120 inputs the observation o.sub.t at current time step t, the computation unit 130 computes the frame alignment [gamma] at a current time step t by using the input observation o.sub.t. Furthermore, the computation unit 130 computes the i-vector and a precision matrix by using the computed frame alignment [gamma], the input observation o.sub.t, and a product obtained when computing the i-vector at the previous time step t−1, and the output unit outputs the computed i-vector and precision matrix. Specifically, the computation unit 130 updates the i-vector [phi].sub.t and the precision matrix L.sub.t by using the i-vector [phi].sub.t-1 and its precision matrix L.sub.t-1 at the previous time step t−1, the frame alignment [gamma] and the observation o.sub.t. Therefore, it is possible to realize speaker embedding in real time while reducing the storage capacity.

[0059] That is, in the present exemplary embodiment, the computation unit 130 updates the i-vector and the precision matrix without directly using past observations other than the observation o.sub.t at current time step t. In other words, to estimate the i-vector [phi].sub.t and its precision matrix L.sup.t at the current time step t, only the feature vector o.sub.t at the current time step t, and the i-vector [phi].sub.t-1 and its covariance matrix L.sub.t.sup.−1 at the previous time step t−1 are required. Therefore, there is no need to store past raw features, and the storage capacity can be reduced.

Second Exemplary Embodiment

[0060] In the second exemplary embodiment, a method of performing speaker embedding by recursively updating an intermediate representation will be described. FIG. 4 depicts an exemplary block diagram illustrating the structure of a second exemplary embodiment of a speaker embedding apparatus according to the present invention. The speaker embedding apparatus 200 according to the present exemplary embodiment includes a storage unit 210, an input unit 220, a computation unit 230 and an output unit 240.

[0061] The speaker embedding apparatus 200 is also connected to the recognition device 10, and the recognition device 10 performs speaker recognition (verification) using the processing result by the speaker embedding apparatus 200. Therefore, a system including the speaker embedding apparatus 200 of the present exemplary embodiment and the recognition unit 10 can be referred to as a speaker recognition system (speaker verification system).

[0062] The storage unit 210 stores the computation result by the computation unit 230 described later. In addition, the storage unit 210 may store observations input by the input unit 220 described later. Note that the speaker embedding apparatus 200 according to the present exemplary embodiment also updates the i-vector at the current time step using the products obtained at the time step of computation of the previous i-vector. Therefore, the speaker embedding apparatus 200 does not have to store all the past observations. The storage unit 210 also stores various parameters used for computation by the computation unit 230 described later. The storage unit 210 is realized by, for example, a magnetic disk or the like.

[0063] The input unit 220 receives an input of observations used by the computation unit 230 described later for updating the i-vector. Specifically, the input unit 220 receives the observations o.sub.t at the current time step t. The input unit 220 may also receive input of various parameters used for computation by the computation unit 230 described later.

[0064] The computation unit 230 updates the i-vector using the observation o.sub.t at the current time step t and the products obtained at the time step of computation of the i-vector at the previous time step t−1. In this exemplary embodiment, the computation unit 230 uses the observation o.sub.t at the current time step t and a zero-order statistics and a first-order statistics at the previous time step t−1 to compute the i-vector [phi].sub.t and precision matrix L.sub.t.

[0065] Specifically, first, the computation unit 230 computes an alignment [gamma].sub.C, t of the feature vector o.sub.t to each of the C Gaussian components by the Equation 1 described above, similarly to the computation unit 130 of the first exemplary embodiment.

[0066] Next, the computation unit 230 computes the zero-order statistics and the first-order statistics. Specifically, the computation unit 230 updates the zero-order statistics and the first-order statistics using the zero-order statistics and the first-order statistics estimated (computed) at previous time step t−1, and the observation o.sub.t and its alignment [gamma].sub.C, t at current time step t computed above.

[0067] The computation unit 230 updates the zero-order statistics N.sub.C(t) and the first-order statistics F.sub.C(t) using Equation 12 and Equation 13 described below.


[Math. 9]


N.sub.c(t)=N.sub.c(t−1)+γ.sub.c,t  (Equation 12)


F.sub.c(t)=F.sub.c(t−1)+γ.sub.c,t(o.sub.t−μ.sub.c)  (Equation 13)

[0068] Then, the computation unit 230 infers the i-vector [phi]t and its precision matrix L.sub.t using the updated zero-order statistics and first-order statistics. The computation unit 230 may estimate the i-vector [phi].sub.t and its precision matrix L.sub.t using Equation 4 and Equation 5 described above.

[0069] Thereafter, the input unit 220 and the computation unit 230 repeat the above processing each time step a new observation is received. FIG. 5 is an exemplary explanatory diagram illustrating the process of second exemplary embodiment of the speaker embedding apparatus 200 according to the present invention. First, at time step t=1, when the input unit 220 receives the observation o.sub.1, the computation unit 230 computes a frame alignment [gamma].sub.c, 1 based on the above-described parameters and an observation o.sub.1. Then, the computation unit 230 updates the zero-order statistics and the first-order statistics. In the initial state, it is initialized as N.sub.C(0)=0 and F.sub.C(0)=I for each C, and the computation unit 230 updates the zero-order statistics N.sub.C(1) and the first-order statistics F.sub.C(1) by using {o.sub.1, [gamma].sub.c, 1}, N.sub.C(0) and F.sub.C(0).

[0070] Then, the computation 230 infers the i-vector [phi].sub.1 and its precision matrix L.sub.1 by using the updated zero-order statistic N.sub.C(1) and the first-order statistic F.sub.C(1).

[0071] Next, at time step t=2, when the input unit 220 receives the observation o.sub.2, the computation unit 230 computes a frame alignment [gamma].sub.c, 2 based on the above-described parameters and the observation o.sub.2. The computation unit 230 updates the zero-order statistics N.sub.C(2) and the first-order statistics F.sub.C(2) by using zero-order statistics N.sub.C(1) and the first-order statistics F.sub.C(1) updated at previous time step t=1, and {o.sub.2, [gamma].sub.c, 2}. Then, the computation 230 infers the i-vector [phi].sub.2 and its precision matrix L.sub.2 by using the updated zero-order statistic N.sub.C (2) and the first-order statistic F.sub.C (2). The same applies to time step t=3. Thereafter, each time step the input unit 220 receives an observation, the above process is recursively repeated.

[0072] That is, when the input unit 220 receives an observation o.sub.[tau] at the current time step t=[tau], the computation unit 230 computes the frame alignment [gamma].sub.c, [tau] based on the above-described parameters and the observation o.sub.[tau]. The computation unit 230 updates the zero-order statistics N.sub.C([tau]) and the first-order statistics F.sub.C([tau]) by using zero-order statistics N.sub.C([tau]−1) and the first-order statistics F.sub.C([tau]−1) updated at previous time step t=[tau]−1, and {o.sub.[tau], [gamma].sub.c, [tau]}. Then, the computation 230 infers the i-vector [phi].sub.[tau] and its precision matrix L.sub.[tau] by using the updated zero-order statistic N.sub.C ([tau]) and first-order statistic F.sub.C ([tau]).

[0073] The output unit 240 outputs the updated i-vector [phi].sub.[tau] and its precision matrix L.sub.[tau]. The output unit 240 may output, as in the first exemplary embodiment, the i-vector [phi].sub.[tau] and its precision matrix L.sub.[tau].sup.−1 to the recognition device 10. The recognition device 10 may perform recognition (verification) processing using the updated i-vector [phi].sub.[tau] and its precision matrix L.sub.[tau].

[0074] The input unit 220, the computation unit 230 and the output unit 240 are implemented by a CPU of a computer operating according to a program (speaker embedding program).

[0075] Next, an operation example of the speaker embedding apparatus 200 according to the present exemplary embodiment will be described. FIG. 6 is a flowchart illustrating the process of second exemplary embodiment of the speaker embedding apparatus 200 according to the present invention. First, the input unit 220 inputs initial conditions N.sub.C(0)=0 and F.sub.C(0)=I, and parameters {C, [omega].sub.C, [mu].sub.C, [Sigma].sub.C, and T.sub.C} (step S21). The initial conditions and parameters may be stored in advance in storage unit 210.

[0076] Subsequently, the processing from step S22 to step S27 is repeated for each observation o.sub.t which is element of {o.sub.1, o.sub.2, . . . , o.sub.[tau]}. The input unit 220 receives an input of the observation o.sub.t (step S22). The computation unit 230 computes the frame alignment [gamma].sub.c, t by using the Equation 1 described above (step S23). Then, computation unit 230 updates the zero-order statistic N.sub.C (t−1) to N.sub.C (t) by using Equation 12 described above (step S24), and updates the first-order statistic F.sub.C(t−1) to F.sub.C(t) by using Equation 13 described above (step S25).

[0077] The computation unit 230 infers the precision matrix L.sub.t using Equation 5 described above (step S26), and infers the i-vector [phi].sub.t using Equation 4 described above (step S27). The computation unit 230 may store the computed i-vector and precision matrix in the storage unit 210.

[0078] Then, the output unit 240 outputs the inferred sequence of i-vectors {[phi].sub.2, [phi].sub.1, [phi].sub.2, . . . , [phi].sub.[tau]} and their precision matrices {L.sub.1, L.sub.2, . . . , L.sub.[tau]} (step S28).

[0079] Next, it will be described that the i-vector is appropriately inferred by the speaker embedding apparatus 200 according to the present exemplary embodiment. The above Equation 2 can be expanded as the following Equation 14.


[Math. 10]


N.sub.c=Σ.sub.t=1.sup.τ-1γ.sub.c,t+γ.sub.c,τ  (Equation 14)

[0080] The first term corresponds to the zero-order statistic at t=[tau]−1 and the second term can be calculated from the observation o.sub.t at t=[tau].

[0081] Similarly, the above Equation 3 can be expanded as the following Equation 15.


[Math. 11]


F.sub.C=Σ.sub.t=1.sup.τ-1γ.sub.c,t(o.sub.t−μ.sub.C)+γ.sub.C,τ(o.sub.τ−μ.sub.C)  (Equation 15)

[0082] The first term corresponds to the first-order statistic at t=[tau]−1 and the second term can be calculated from the observation o.sub.t at t=[tau].

[0083] Therefore, Equations 14 and 15 become equal to the general offline computed zero-order statistics and first-order statistics described in Equations 2 and 3 respectively.

[0084] As described above, according to the present exemplary embodiment, the computation unit 230 updates the i-vector [phi].sub.t and the precision matrix L.sub.t by using the zero-order statistics and first-order statistics at the previous time step t−1, the frame alignment [gamma] and the observation o.sub.t. Therefore, as in the first exemplary embodiment, it is possible to realize speaker embedding in real time while reducing the storage capacity.

[0085] That is, in the present exemplary embodiment, the computation unit 230 also updates the i-vector and the precision matrix without directly using past observations other than the observation o.sub.t at current time step t. In other words, to estimate the i-vector [phi].sub.t and its precision matrix L.sup.t at the current time step t, only the feature vector o.sub.t at the current time step t, and the zero-order statistics and the first-order statistics at the previous time step t−1 are required. Therefore, there is no need to store past raw features, and the storage capacity can be reduced.

[0086] Next, an outline of the present invention will be described. FIG. 7 depicts a block diagram illustrating an outline of the speaker embedding apparatus according to the present invention. The speaker embedding apparatus 80 (for example, speaker embedding apparatus 100, 200) using an i-vector, the speaker embedding apparatus including: an input unit 81 (for example, the input unit 120, 220) which inputs an observation (for example, observation o.sub.t) at current time step (for example, time step t); a frame alignment unit 82 (for example, the computation unit 130, 230) which computes a frame alignment (for example, frame alignment [gamma]) at a current time step by using the input observation; an i-vector computation unit 83 (for example, the computation unit 130, 230) which computes an i-vector (for example, i-vector [phi]) and a precision matrix (for example, L) by using the computed frame alignment, the input observation, and a product (for example, i-vector, precision matrix, zero-order statistics, and first-order statistics) obtained when computing the i-vector at the previous time step (for example, time step t−1); and an output unit 84 (for example, the output unit 140, 240) which outputs the computed i-vector and precision matrix.

[0087] With such a configuration, it is possible to realize speaker embedding in real time while reducing the storage capacity.

[0088] At that time, the i-vector computation unit 83 may update the i-vector and the precision matrix by using the i-vector (for example, i-vector [phi].sub.t-1) and its precision matrix (for example, precision matrix L.sub.t-1) at the previous time step (for example, time step t−1), the frame alignment and the observation.

[0089] Also, the i-vector computation unit 83 may update the i-vector and the precision matrix by using zero-order statistics (for example, N.sub.C(t)) and first-order statistics (for example, F.sub.C(t)) at the previous time step (for example, time step t−1), the frame alignment and the observation.

[0090] Specifically, the i-vector computation unit 83 may update the i-vector and the precision matrix without directly using past observations other than the observation at current time step.

[0091] Also, the i-vector computation unit 83 may compute the i-vector and the precision matrix by recursively updating the product obtained at the time step of computation of the i-vector at previous time step.

[0092] FIG. 8 depicts a schematic block diagram illustrating a configuration of a computer according to at least one of the exemplary embodiments. A computer 1000 includes a CPU 1001, a main storage device 1002, an auxiliary storage device 1003, and an interface 1004.

[0093] Each of the above-described speaker embedding apparatus is mounted on the computer 1000. The operation of the respective processing units described above is stored in the auxiliary storage device 1003 in the form of a program (a speaker embedding program). The CPU 1001 reads the program from the auxiliary storage device 1003, deploys the program in the main storage device 1002, and executes the above processing according to the program.

[0094] Note that at least in one of the exemplary embodiments, the auxiliary storage device 1003 is an exemplary non-transitory physical medium. Other examples of non-transitory physical medium include a magnetic disc, a magneto-optical disk, a CD-ROM, a DVD-ROM, and a semiconductor memory that are connected via the interface 1004. In the case where the program is distributed to the computer 1000 by a communication line, the computer 1000 distributed with the program may deploy the program in the main storage device 1002 to execute the processing described above.

[0095] Incidentally, the program may implement a part of the functions described above. The program may implement the aforementioned functions in combination with another program stored in the auxiliary storage device 1003 in advance, that is, the program may be a differential file (differential program).

[0096] While the invention has been particularly shown and described with reference to example embodiments thereof, the invention is not limited to these embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the claims.

[0097] The whole or part of the example embodiments disclosed above can be described as, but not limited to, the following supplementary notes.

[0098] (Supplementary note 1) A speaker embedding apparatus using an i-vector comprising: an input unit which inputs an observation at current time step; a frame alignment unit which computes a frame alignment at a current time step by using the input observation; an i-vector computation unit which computes an i-vector and a precision matrix by using the computed frame alignment, the input observation, and a product obtained when computing the i-vector at the previous time step; and an output unit which outputs the computed i-vector and precision matrix.

[0099] (Supplementary note 2) The speaker embedding apparatus according to supplementary note 1, wherein, the i-vector computation unit updates the i-vector and the precision matrix by using the i-vector and its precision matrix at the previous time step, the frame alignment and the observation.

[0100] (Supplementary note 3) The speaker embedding apparatus according to supplementary note 1, wherein, the i-vector computation unit updates the i-vector and the precision matrix by using zero-order statistics and first-order statistics at the previous time step, the frame alignment and the observation.

[0101] (Supplementary note 4) The speaker embedding apparatus according to any one of supplementary notes 1 to 3, wherein, the i-vector computation unit updates the i-vector and the precision matrix without directly using past observations other than the observation at current time step.

[0102] (Supplementary note 5) The speaker embedding apparatus according to any one of supplementary notes 1 to 4, wherein, the i-vector computation unit computes the i-vector and the precision matrix by recursively updating the product obtained at the time step of computation of the i-vector at previous time step.

[0103] (Supplementary note 6) A speaker embedding method using an i-vector comprising: inputting an observation at current time step; computing a frame alignment at a current time step by using the input observation; computing an i-vector and a precision matrix by using the computed frame alignment, the input observation, and a product obtained when computing the i-vector at the previous time step; and outputting the computed i-vector and precision matrix.

[0104] (Supplementary note 7) The speaker embedding method according to supplementary note 6, wherein the i-vector and the precision matrix are updated by using the i-vector and its precision matrix at the previous time step, the frame alignment and the observation.

[0105] (Supplementary note 8) The speaker embedding method according to supplementary note 6, wherein the i-vector and the precision matrix are updated by using zero-order statistics and first-order statistics at the previous time step, the frame alignment and the observation.

[0106] (Supplementary note 9) A non-transitory computer readable recording medium storing a speaker embedding program using an i-vector, when executed by a processor, that performs a method for: inputting an observation at current time step; computing a frame alignment at a current time step by using the input observation; computing an i-vector and a precision matrix by using the computed frame alignment, the input observation, and a product obtained when computing the i-vector at the previous time step; and outputting the computed i-vector and precision matrix.

[0107] (Supplementary note 10) The non-transitory computer readable recording medium according to supplementary note 9, wherein the i-vector and the precision matrix are updated by using the i-vector and its precision matrix at the previous time step, the frame alignment and the observation.

[0108] (Supplementary note 11) The non-transitory computer readable recording medium according to supplementary note 9, wherein the i-vector and the precision matrix are updated by using zero-order statistics and first-order statistics at the previous time step, the frame alignment and the observation.