SPEECH RECOGNITION METHOD AND DEVICE

Abstract

This patent disclosure relates to a voice technology and discloses a voice recognition method and electronic device. In some embodiments of this disclosure, soft clustering calculation is performed in advance according to N gausses obtained by model training, to obtain M soft clustering gausses; when voice recognition is performed, voice is converted to obtain an eigenvector, and top L soft clustering gausses with highest scores are calculated according to the eigenvector, wherein the L is less than the M; and member gausses among the L soft clustering gausses are used as gausses that need to participate in calculation in an acoustic model in a voice recognition process to calculate likelihood of the acoustic model.

Claims

1. A voice recognition method, applied to a terminal, comprising the following steps: performing soft clustering calculation in advance according to N gausses obtained by model training, to obtain M soft clustering gausses; when voice recognition is performed, converting voice to obtain an eigenvector and calculating top L soft clustering gausses with highest scores according to the eigenvector, wherein the L is less than the M; and using member gausses among the L soft clustering gausses as gausses that need to participate in calculation in an acoustic model in a voice recognition process to calculate likelihood of the acoustic model.

2. The voice recognition method according to claim 1, wherein the step of performing soft clustering calculation according to N gausses obtained by model training comprises the following sub-steps: allocating the N gausses to clustering gausses according to preset weights; and reestimating the clustering gausses according to update weights of the gausses to the clustering gausses to which the gausses belong, to obtain the M soft clustering gausses.

3. The voice recognition method according to claim 1, wherein in the step of performing soft clustering calculation according to N gausses obtained by model training, any following algorithm is used to calculate the soft clustering: a K mean value algorithm, a C mean value algorithm, and a self-organization map algorithm.

4. The voice recognition method according to claim 3, comprising: calculating a minimum clustering price of the clustering gausses when the K mean value algorithm is used to reestimate the clustering gausses; taking a derivative of the minimum clustering price and acquiring an update weight of each member gauss to each clustering gauss; calculating mean values and variances of the clustering gausses according to the acquired update weight of each member gauss to each clustering gauss, to obtain the reestimated clustering gausses; and using the estimated clustering gausses as the M soft clustering gausses.

5. The voice recognition method according to claim 4, wherein the minimum clustering price Q is calculated according to the following formula: $Q = {.Math.}_{n = 1}^{N} .Math. .Math. ({.Math.}_{i = 1}^{m} .Math. .Math. g (i, n) .Math. WSKLD (i, n) + .Math. {.Math.}_{m = 1}^{M} .Math. .Math. g (i, n) .Math. \log .Math. \frac{1}{g (i, n)})$ wherein g(i, n) represents an update weight of the n.sup.th gauss to the i.sup.th clustering gauss, is a preset clustering hardness parameter, and WSKLD represents weighted symmetric KL divergence used as a distance criterion between gausses.

6. The voice recognition method according to claim 1, wherein a value of the L is a minimum value satisfying the following condition: ${.Math.}_{i = 1}^{L} .Math. .Math. {p (G_{i} Y)}^{} > 0.95 .Math. {.Math.}_{j = 1}^{M * 0.2} .Math. .Math. {p (G_{j} Y)}^{}$ wherein p(G.sub.i|Y)p(G.sub.i+1|Y) the Y represents the eigenvector, wherein is a compression index for a posterior probability of a gauss, G.sub.i represents the i.sup.th clustering gauss, and p(G.sub.i|Y) represents a posterior probability of the i.sup.th clustering gauss.

7. The voice recognition method according to claim 1, wherein the step of calculating top L soft clustering gausses with highest scores according to the eigenvector comprises the following sub-steps: acquiring scores of soft clustering gausses according to the following formula: $f_{m} (Y) = \frac{1}{{(2 .Math.)}^{d / 2} .Math. {.Math. \underset{m}{.Math.} .Math. .Math.}^{1 / 2}} .Math. \exp (- \frac{1}{2} .Math. {(Y -_{m})}^{} .Math. {.Math.}_{m}^{- 1} .Math. .Math. (Y -_{m}))$ wherein the Y represents the eigenvector, .sub.m represents a mean value of the m.sup.th soft clustering gauss, and .sub.m represents a variance of the m.sup.th soft clustering gauss.

8. The voice recognition method according to claim 1, wherein in the step of converting voice to obtain an eigenvector, each voice frame is converted into the eigenvector.

9-10. (canceled)

11. A non-volatile computer storage medium, which stores a computer executable instruction, that when executed by an electronic device, cause the electronic device to: perform soft clustering calculation in advance according to N gausses obtained by model training, to obtain M soft clustering gausses; when voice recognition is performed, convert voice to obtain an eigenvector and calculating top L soft clustering gausses with highest scores according to the eigenvector, wherein L is less than M; and use member gausses among the L soft clustering gausses as gausses that need to participate in calculation in an acoustic model in a voice recognition process to calculate likelihood of the acoustic model.

12. The non-volatile computer storage medium according to claim 11, wherein the instructions to perform soft clustering calculation according to N gausses obtained by model training cause the electronic device to: allocate the N gausses to a clustering gauss according to preset weights; and reestimate the clustering gausses according to update weights of gausses to the clustering gausses to which the gausses belong, to obtain the M soft clustering gausses.

13. The non-volatile computer storage medium according to claim 11, wherein the instructions to perform soft clustering calculation according to N gausses obtained by model training, any following algorithm is used to calculate the soft clustering: a K mean value algorithm, a C mean value algorithm, and a self-organization map algorithm.

14. The non-volatile computer storage medium according to claim 13, wherein a minimum clustering price of the clustering gausses is calculated when the K mean value algorithm is used to reestimate the clustering gausses; a derivative of the minimum clustering price is taken and an update weight of each member gauss to each clustering gauss is acquired; mean values and variances of the clustering gausses are calculated according to the acquired update weight of each member gauss to each clustering gauss, to obtain the reestimated clustering gausses; and the estimated clustering gausses are used as the M soft clustering gausses.

15. The non-volatile computer storage medium according to claim 14, wherein the minimum clustering price Q is calculated according to the following formula: $Q = {.Math.}_{n = 1}^{N} .Math. .Math. ({.Math.}_{i = 1}^{m} .Math. .Math. g (i, n) .Math. WSKLD (i, n) + .Math. {.Math.}_{m = 1}^{M} .Math. .Math. g (i, n) .Math. \log .Math. \frac{1}{g (i, n)})$ wherein g(i, n) represents an update weight of the n.sup.th gauss to the i.sup.th clustering gauss, is a preset clustering hardness parameter, and WSKLD represents weighted symmetric KL divergence used as a distance criterion between gausses.

16. The non-volatile computer storage medium according to claim 11, wherein a value of the L is a minimum value satisfying the following condition: ${.Math.}_{i = 1}^{L} .Math. .Math. {p (G_{i} Y)}^{} > 0.95 .Math. {.Math.}_{j = 1}^{M * 0.2} .Math. .Math. {p (G_{j} Y)}^{}$ wherein p(G.sub.i|Y)p(G.sub.i+1|Y) the Y represents the eigenvector, wherein is a compression index for a posterior probability of a gauss, G.sub.i represents the i.sup.th clustering gauss, and p(G.sub.i|Y) represents a posterior probability of the i.sup.th clustering gauss.

17. An electronic device, comprising: at least one processor; and a memory communicably connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor, wherein execution of the instructions by the at least one processor causes the at least one processor to: perform soft clustering calculation in advance according to N gausses obtained by model training, to obtain M soft clustering gausses; when voice recognition is performed, convert voice to obtain an eigenvector and calculating top L soft clustering gausses with highest scores according to the eigenvector, wherein L is less than M; and use member gausses among the L soft clustering gausses as gausses that need to participate in calculation in an acoustic model in a voice recognition process to calculate likelihood of the acoustic model.

18. The electronic device according to claim 17, wherein the execution of the instructions to perform soft clustering calculation according to N gausses obtained by model training cause the at least one processor to: allocate the N gausses to a clustering gauss according to preset weights; and reestimate the clustering gausses according to update weights of gausses to the clustering gausses to which the gausses belong, to obtain the M soft clustering gausses.

19. The electronic device according to claim 17, wherein in the step of performing soft clustering calculation according to N gausses obtained by model training, any following algorithm is used to calculate the soft clustering: a K mean value algorithm, a C mean value algorithm, and a self-organization map algorithm.

20. The electronic device according to claim 19, wherein a minimum clustering price of the clustering gausses is calculated when the K mean value algorithm is used to reestimate the clustering gausses; a derivative of the minimum clustering price is taken and an update weight of each member gauss to each clustering gauss is acquired; mean values and variances of the clustering gausses are calculated according to the acquired update weight of each member gauss to each clustering gauss, to obtain the reestimated clustering gausses; and the estimated clustering gausses are used as the M soft clustering gausses.

21. The electronic device according to claim 20, wherein the minimum clustering price Q is calculated according to the following formula: $Q = {.Math.}_{n = 1}^{N} .Math. .Math. ({.Math.}_{i = 1}^{m} .Math. .Math. g (i, n) .Math. WSKLD (i, n) + .Math. {.Math.}_{m = 1}^{M} .Math. .Math. g (i, n) .Math. \log .Math. \frac{1}{g (i, n)})$ wherein g(i, n) represents an update weight of the n.sup.th gauss to the i.sup.th clustering gauss, is a preset clustering hardness parameter, and WSKLD represents weighted symmetric KL divergence used as a distance criterion between gausses.

22. The electronic device according to claim 17, wherein a value of the L is a minimum value satisfying the following condition: ${.Math.}_{i = 1}^{L} .Math. .Math. {p (G_{i} Y)}^{} > 0.95 .Math. {.Math.}_{j = 1}^{M * 0.2} .Math. .Math. {p (G_{j} Y)}^{}$ wherein p(G.sub.i|Y)p(G.sub.i+1|Y) the Y represents the eigenvector, wherein is a compression index for a posterior probability of a gauss, G.sub.i represents the i.sup.th clustering gauss, and p(G.sub.i|Y) represents a posterior probability of the i.sup.th clustering gauss.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0019] One or more embodiments are exemplarily described by using figures that are corresponding thereto in the accompanying drawings; the exemplary descriptions do not form a limitation to the embodiments. Elements with same reference signs in the accompanying drawings are similar elements. Unless otherwise particularly stated, the figures in the accompanying drawings do not form a scale limitation.

[0020] FIG. 1 is a schematic diagram of a voice recognition system according to some implementation manners of this disclosure;

[0021] FIG. 2 is a flowchart of calculation of soft clustering according to some implementation manners;

[0022] FIG. 3 is a flowchart of a voice recognition method according to some implementation manners;

[0023] FIG. 4 is a schematic diagram of dynamic Gaussian selection according to some implementation manners;

[0024] FIG. 5 is a schematic structural diagram of a voice recognition apparatus according to some implementation manners; and

[0025] FIG. 6 is a schematic structural diagram of an electronic device according to some implementation manners.

DETAILED DESCRIPTION

[0026] To make the objectives, technical solutions, and advantages of this disclosure clearer, the following describes in detail the implementation manners of this disclosure with reference to the accompanying drawings. However, a person skilled in the art may understand that in the implementation manners of this disclosure, to make readers better understand this disclosure, many technical details are proposed. However, even if no technical details and various changes and modifications based on the following implementation manners are provided, the technical solutions of claims of this disclosure can also be implemented.

[0027] An objective of voice recognition is providing a most possible text when a voice signal is observed. As shown in FIG. 1, an HMM+GMM-based recognition system reads a segment of voice according to frames, and the system changes each frame of a voice signal into an eigenvector. The system evaluates likelihood of each gauss in an acoustic model with reference to each frame of the eigenvector. Besides, a combination of multiple words is assumed, likelihood evaluation is performed on the combination of these words by using a language model; a word combination with a greatest sum of acoustic likelihood and language likelihood is output as a recognition result.

[0028] A first implementation manner of this disclosure relates to a voice recognition method. In this implementation manner, soft clustering calculation needs to be performed in advance according to N gausses obtained by model training, to obtain M soft clustering gausses. When voice recognition is performed, a quantity of member gausses to be calculated is controlled in a dynamic Gaussian selection manner. In this implementation manner, a calculation process of soft clustering is shown in FIG. 2.

[0029] Step 201: Obtain N gausses by model training, such as obtaining 1000 gausses.

[0030] Step 202: Allocate the N gausses to clustering gausses according to preset weights.

[0031] Step 203: Reestimate the clustering gausses according to update weights of the gausses to the clustering gausses to which the gausses belong, to obtain M soft clustering gausses.

[0032] A person skilled in the art may understand that a Gaussian Mixture Model is used to describe a probability distribution of each state of a hidden Markov model (HMM) in voice recognition, and each state uses several gausses to state a probability distribution of itself. One Gaussian distribution has a mean value and a variance of itself. To effectively use Gaussian selection in a recognition system, gausses need to be shared between states. An acoustic model for sharing gausses is called a semi-continuous Markov model. When gausses of the same quantity are used, a semi-continuous gauss improves a description capacity of a model, so as to improve a recognition rate. N (in a local recognition system, N is generally 1000) gausses are obtained by model training, and a distance criterion between gausses are necessarily clearly determined before clustering. In this implementation manner, a weighted symmetric KL divergence (WSKLD) is used as a distance criterion. An SKLD of a distance between a gauss m and a gauss n is:

SKLD(n,m)=trace((.sub.n.sup.1+.sub.m.sup.1)(.sub.n.sub.m)(.sub.n.sub.m)+.sub.n.sup.1.sub.m+.sub.n.sub.m.sup.12I).

[0033] .sub.n.sup.1 is a variance of the gauss n, .sub.m.sup.1 is a variance of the gauss m, .sub.n is a mean value of the gauss n, and .sub.m is a mean value of the gauss m. I is a unit matrix.

[0034] If the gauss model is divided into multiple sub-spaces, and each sub-space has its weight , the WSKLD is:

[00001] $WSKLD (n, m) = {.Math.}_{j = 1}^{N_{strm}} .Math._{j} .Math. {SKLD}_{j} (n, m)$

[0035] N.sub.strm is a quantity of sub-spaces of the gauss model.

[0036] Calculation of soft clustering may use any following algorithm in a specific implementation: a K mean value algorithm, a C mean value algorithm, and a self-organization map algorithm. Specific description is provided by using the K mean value algorithm as an example:

[0037] The algorithm may be described by using the following pseudo code:

[0038] 1. a quantity of clustering gausses is set to 1, and all gausses are used as member gausses to estimate a clustering gauss.

[0039] 2. while m<M (M is a target value of the quantity of the clustering gausses)

[0040] 2a. find a clustering gauss , and the clustering gauss has a maximum WSKLD

[0041] 2b. the gauss is split into two clustering gausses, m++

[0042] 2c. For cycle from 1 to T

[0043] 2c-1 For clustering gauss i, i from 1 to m

[0044] 2c-1-1. For member gauss n, n from 1 to N, where N is a quantity of member gausses

[0045] An update contribution (i,n) of the member gauss to the i.sup.th clustering gauss is calculated.

[0046] 2c-1-2. Based on (i,n), a mean value i and a variance i of the i.sup.th clustering gauss is updated iteratively.

[0047] In the foregoing pseudo code, the target of clustering is making a clustering price Q minimum. A calculation formula of Q is as follows:

[00002] $Q = {.Math.}_{n = 1}^{N} .Math. .Math. ({.Math.}_{i = 1}^{m} .Math. .Math. g (i, n) .Math. WSKLD (i, n) + .Math. {.Math.}_{m = 1}^{M} .Math. .Math. g (i, n) .Math. \log .Math. \frac{1}{g (i, n)})$

[0048] G(i, n) represents an update weight of the n.sup.th gauss to the i.sup.th clustering gauss, is a preset clustering hardness parameter, and WSKLD represents weighted symmetric KL divergence used as a distance criterion between gausses.

[0049] The following parameters may be obtained through iteration: mean values and variances of clustering gausses, and a weight of each member gauss to update of each clustering gauss:

[00003] $[{\hat{}}_{i}, \hat{\underset{i}{.Math.}} .Math., \hat{g} (i, n)] = \underset{{.Math.}_{i = 1}^{M} .Math. .Math. g (i, n) = 1}{argmin} (Q)$

[0050] In an iterative process of acquiring the foregoing parameter, the first step is acquiring an optimal update weight:

[00004] $\hat{g} (i, n) = \frac{\exp (- WSKLD (i, n) /)}{{.Math.}_{j = 1}^{m} .Math. .Math. \exp (- WSKLD (j, n) /)}$

[0051] (i, n) is an update weight.

[0052] The second step is acquiring the optimal mean value and variance based on the optimal weight. A method for updating a mean value of a clustering gauss is as follows:

[00005] ${\hat{}}_{i} = {[{.Math.}_{n = 1}^{N} .Math. .Math. \hat{g} (i, n) .Math. ({.Math.}_{i}^{- 1} .Math. .Math. + {.Math.}_{n}^{- 1} .Math.)]}^{- 1} [{.Math.}_{n = 1}^{N} .Math. .Math. \hat{g} (i, n) .Math. ({.Math.}_{i}^{- 1} .Math. .Math. + {.Math.}_{n}^{- 1} .Math.) .Math. {\hat{}}_{n}]$

[0053] To calculate a variance of the clustering gauss, an auxiliary matrix Z may be constructed.

[00006] $Z = [\begin{matrix} 0 & A_{1} \\ A_{2} & 0 \end{matrix}]$ $A_{1} = {.Math.}_{n = 1}^{N} .Math. .Math. \hat{g} (i, n) [({\hat{}}_{n} - {\hat{}}_{i}) .Math. {({\hat{}}_{n} - {\hat{}}_{i})}^{} + \underset{i}{.Math.} .Math.]$ $A_{2} = {.Math.}_{n = 1}^{N} .Math. .Math. \hat{g} (i, n) .Math. {.Math.}_{i}^{- 1} .Math.$

[0054] Based on a construction of Z, Z has DP positive eigenvalues and corresponding DP negative eigenvalues, where DP is dimension of mean values and variances. In this case, a matrix V of 2DP-by-DP is constructed and is an eigenvector corresponding to DP positive eigenvalues of Z. V is divided into an upper part U and a lower part W:

[00007] $V = [\begin{matrix} U \\ W \end{matrix}]$

[0055] Therefore, a covariance matrix of the clustering gauss is estimated as follows:

{circumflex over ()}.sub.i=UW.sup.1

[0056] After the mean value and the covariance matrix are alternated and iterated for several rounds, the covariance matrix is limited as a diagonal matrix. This forced condition causes clustering not to be converged in few situations but does not influence clustering accuracy, so as to obtain reestimated clustering gausses as M soft clustering gausses.

[0057] That is, in this implementation manner, the recognition system calculates minimum clustering prices of clustering gausses, takes a derivative of each minimum clustering price, to acquire an update weight of each member gauss to each clustering gauss, and then calculates mean values and variances of the clustering gausses according to the update weight, to obtain estimated clustering gausses as M soft clustering gausses.

[0058] Voice is recognized after the M soft clustering gausses are obtained. A specific process is shown in FIG. 3:

[0059] Step 301: A recognition system reads a segment of vice according to frames. For example, a length of each frame is 10 ms.

[0060] Step 302: The recognition system changes each frame of a voice signal into an eigenvector, and the obtained eigenvector is used to evaluate a soft clustering gauss.

[0061] Step 303: Calculate top L soft clustering gausses with highest scores according to the eigenvector (L is less than M).

[0062] Specifically, as shown in FIG. 4, in a voice recognition process, after a segment of voice is converted into an eigenvector Y, all clustering gausses first use the vector for evaluation, and the top L soft clustering gausses with highest scores are selected and put in a clustering gauss selection table. Scores of soft clustering gausses may be acquired according to the following formula:

[00008] $f_{m} (Y) = \frac{1}{{(2 .Math.)}^{d / 2} .Math. {.Math. \underset{m}{.Math.} .Math. .Math.}^{1 / 2}} .Math. \exp (- \frac{1}{2} .Math. {(Y -_{m})}^{} .Math. {.Math.}_{m}^{- 1} .Math. .Math. (Y -_{m}))$

[0063] Y represents the eigenvector, .sub.m represents a mean value of the m.sup.th soft clustering gauss, .sub.m represents a variance of the m.sup.th soft clustering gauss. After the scores of the M clustering gausses are obtained, top L clustering gausses with highest scores are used as selected clustering gausses.

[0064] In this implementation manner, a value of the L is a minimum value satisfying the following condition:

[00009] ${.Math.}_{i = 1}^{L} .Math. .Math. {p (G_{i} Y)}^{} > 0.95 .Math. {.Math.}_{j = 1}^{M * 0.2} .Math. .Math. {p (G_{j} Y)}^{}$

where p(G.sub.i|Y)p(G.sub.i+1|Y)

[0065] Y represents the eigenvector, where is a compression index for a posterior probability of a gauss, G.sub.i represents the i.sup.th clustering gauss, and p(G.sub.i|Y) represents a posterior probability of the i.sup.th clustering gauss.

[0066] Step 304: Use member gausses among the L soft clustering gausses as gausses that need to participate in calculation in an acoustic model in a voice recognition process to calculate likelihood of the acoustic model.

[0067] That is, whether one member gauss is selected and calculated depends on a member gauss-clustering gauss mapping table and a clustering gauss selection list. As shown in FIG. 4, in the clustering gauss selection table, 1 represents that the corresponding clustering gauss is selected at a current moment in a recognition process. A member gauss corresponding to the selected clustering gauss is queried in a clustering-member gauss mapping table and is calculated. Likelihood of unselected member gauss is replaced by a small value.

[0068] Step 305: Determine whether an unread voice frame exists. If a determining result is yes, it indicates a voice frame that needs to be recognized; return to step 301 to read a next voice frame and continue recognition. Otherwise, it indicates that voice recognition is completely finished; and end the process.

[0069] Step 306: Output a recognition result. Specifically, a voice recognition result in this step is a sum of acoustic likelihood and language likelihood. This step is the same as the prior art and is not described in detail herein.

[0070] To verify practicability of the voice recognition method in this implementation manner, on a test set, time and a recognition rate of several issued CPUs are tested, and a result is shown in FIG. 1:

[0071] Hard gauss clustering refers to that each member function only belongs to a clustering gauss, and clustering only uses a mean value as a vector. Soft accurate clustering is a method described in some embodiments of this disclosure. A gauss clustering system is not used as a base line. It can be seen that hard gauss clustering is worse than the method of some embodiments of this disclosure in accuracy. The above two have a same speed. A base line system is worse than some embodiments of this disclosure in speed and accuracy.

TABLE-US-00001 TABLE 1 CPU time Gauss calculation Gauss Word time Decoding time calculation error rate (ms/frame) (ms/frame) percentage Hard gauss 7.02% 1.4 6.1 17% clustering Soft accurate 6.65% 2.4 5.1 11% clustering Not using gauss 6.87% 15.3 6.7 100% clustering

[0072] It is not difficult to find that embodiments of this disclosure use an accurate K mean value (K-Means) method in a system training phase to perform soft clustering on gausses (that is, one member gauss may belong to multiple clustering gausses); a quantity of clusters gradually increases. In addition, each increasing manner reflects a rule for model distribution. During recognition, a quantity of member gausses to be calculated is controlled in a dynamic Gaussian selection manner, improving a speed and precision for evaluation of acoustic model likelihood, and being more accurate and efficient than traditional Gaussian selection.

[0073] A second implementation manner of this disclosure relates to a voice recognition method. The second implementation manner is roughly the same as the first implementation manner and mainly differs from the first implementation manner in that: in the first implementation manner, an accurate K mean value (K-Means) algorithm is used to perform soft clustering on gausses in a system training phase. In the second implementation manner of this disclosure, the C mean value algorithm is used to perform soft clustering on gausses in a system training phase. Because a specific implementation manner of using the C mean value algorithm to perform soft clustering is basically the same as the K mean value algorithm, it is not described in detail in this implementation manner.

[0074] A third implementation manner of this disclosure relates to a voice recognition method. The third implementation manner is roughly the same as the first implementation manner and mainly differs from the first implementation manner in that: in the first implementation manner, an accurate K mean value (K-Means) algorithm is used to perform soft clustering on gausses in a system training phase. In the third implementation manner of this disclosure, the self-organization map algorithm is used to perform soft clustering on gausses in a system training phase. Because a specific implementation manner of using the self-organization map algorithm to perform soft clustering calculation is only slightly different in step 203, and the self-organization map algorithm is a well-known technology of existing clustering algorithms, it is not described in detail in this implementation manner.

[0075] Step division of the above various methods is only used for clear description, and during implementation, steps can be combined into one step or some steps may be split into multiple steps. As long as steps include same logic relationship, they are within the protection scope of the present patent. Adding unrelated amendment in an algorithm or in a process or introducing unrelated design without changing a core design of the algorithm and process thereof all fall within the protection scope of the patent.

[0076] A fourth implementation manner of this disclosure relates to a voice recognition apparatus, as shown in FIG. 5, including:

[0077] a soft clustering acquisition module 510, configured to perform soft clustering calculation in advance according to N gausses obtained by model training, to obtain M soft clustering gausses;

[0078] a vector conversion module 520, configured to, when voice recognition is performed, convert voice to obtain an eigenvector;

[0079] a selection module 530, configured to calculate top L soft clustering gausses with highest scores according to the eigenvector and using member gausses among the top L soft clustering gausses as selected gausses, wherein the L is less than the M; and

[0080] a calculation module 540, configured to use the gausses selected by the selection module as gausses that need to participate in calculation in an acoustic model in a voice recognition process to calculate likelihood of the acoustic model.

[0081] The soft clustering acquisition module 510 includes:

[0082] a weight allocation module, configured to allocate the N gausses to clustering gausses according to preset weights; and

[0083] a reestimation module, configured to reestimate the clustering gausses according to update weights of gausses to the clustering gausses to which the gausses belong, to obtain the M soft clustering gausses.

[0084] It is not difficult to find that this implementation manner is a system embodiment corresponding to the first implementation manner, and this implementation manner may be implemented in a manner of cooperating with the implementation manner. Relevant technical details mentioned in the first implementation manner are still effective in this implementation manner, and in order to reduce repetition, are not described in detail herein. Correspondingly, relevant technical details mentioned in this implementation manner can also be applied to the first implementation manner.

[0085] It worth mentioning that modules involved in this implementation manner are all logic modules. In an actual application, one logic unit may be a physical unit or a part of one physical unit, or may be implemented as a combination of multiple physical units. In addition, in order to highlight the innovation part of this disclosure, this implementation manner does not introduce units that are not close to resolution of the technical problem relationship proposed in this disclosure, which does not indicate that other units do not exist in this implementation manner.

[0086] A fifth implementation manner of this disclosure relates to a non-volatile computer storage medium, which stores a computer executable instruction, where the computer executable instruction can execute the voice recognition method in any one of the foregoing method embodiments.

[0087] A sixth implementation manner of this disclosure relates to an electronic device. A schematic structural diagram of hardware is shown in FIG. 4. The device includes:

[0088] one or more processors 610 and a memory 620, where only one processor 610 is used as an example in FIG. 6.

[0089] The device of the voice recognition method may further include: an input apparatus 630 and an output apparatus 640.

[0090] The processor 610, the memory 620, the input apparatus 630, and the output apparatus 640 can be connected by means of a bus or in other manners. A connection by means of a bus is used as an example in FIG. 6.

[0091] As a non-volatile computer readable storage medium, the memory 620 can be used to store non-volatile software programs, non-volatile computer executable programs and modules, for example, a program instruction/module corresponding to the voice recognition method in the embodiments of this disclosure (for example, the soft clustering acquisition module 510, the vector conversion module 520, the selection module 530, and the calculation module 540). The processor 610 executes various functional applications and data processing of the server, that is, implements the resource searching method of the foregoing method embodiments, by running the non-volatile software programs, instructions, and modules that are stored in the memory 620.

[0092] The memory 620 may include a program storage area and a data storage area, where the program storage area may store an operating system and an application that is needed by at least one function; the data storage area may store data created according to use of the server, and the like. In addition, the memory 620 may include a high-speed random access memory, or may also include a non-volatile memory such as at least one disk storage device, flash storage device, or another non-volatile solid-state storage device. In some embodiments, the memory 620 optionally includes memories that are remotely disposed with respect to the processor 610, and the remote memories may be connected, via a network, to the server. Examples of the foregoing network include but are not limited to: the Internet, an intranet, a local area network, a mobile communications network, or a combination thereof.

[0093] The input apparatus 630 can receive entered digits or character information, and generate key signal inputs relevant to user setting and functional control of the server. The output apparatus 640 may include a display device, for example, a display screen.

[0094] The one or more modules are stored in the memory 620; when the one or more modules are executed by the one or more processors 610, the voice recognition method in any one of the foregoing method embodiments is executed.

[0095] The foregoing product can execute the method provided in the embodiments of this disclosure, and has corresponding functional modules for executing the method and beneficial effects. Refer to the method provided in the embodiments of this disclosure for technical details that are not described in detail in this embodiment.

[0096] The electronic device in this embodiment of this disclosure exists in multiple forms, including but not limited to:

[0097] (1) Mobile communication device: such devices are characterized by having a mobile communication function, and primarily providing voice and data communications; terminals of this type include: a smart phone (for example, an iPhone), a multimedia mobile phone, a feature phone, a low-end mobile phone, and the like;

[0098] (2) Ultra mobile personal computer device: such devices are essentially personal computers, which have computing and processing functions, and generally have the function of mobile Internet access; terminals of this type include: PDA, MID and UMPC devices, and the like, for example, an iPad;

[0099] (3) Portable entertainment device: such devices can display and play multimedia content; devices of this type include: an audio and video player (for example, an iPod), a handheld game console, an e-book, an intelligent toy and a portable vehicle-mounted navigation device;

[0100] (4) Server: a device that provides a computing service; a server includes a processor, a hard disk, a memory, a system bus, and the like; an architecture of a server is similar to a universal computer architecture. However, because a server needs to provide highly reliable services, requirements for the server are high in aspects of the processing capability, stability, reliability, security, extensibility, and manageability; and

[0101] (5) Other electronic apparatuses having a data interaction function.

[0102] The apparatus embodiment described above is merely exemplary, and units described as separated components may be or may not be physically separated; components presented as units may be or may not be physical units, that is, the components may be located in a same place, or may be also distributed on multiple network units. Some or all modules therein may be selected according to an actual requirement to achieve the objective of the solution of this embodiment.

[0103] Through description of the foregoing implementation manners, a person skilled in the art can clearly learn that each implementation manner can be implemented by means of software in combination with a universal hardware platform, and certainly, can be also implemented by using hardware. Based on such understanding, the essence, or in other words, a part that makes contributions to relevant technologies, of the foregoing technical solutions can be embodied in the form of a software product. The computer software product may be stored in a computer readable storage medium, for example, a ROM/RAM, a magnetic disk, or a compact disc, including several instructions for enabling a computer device (which may be a personal computer, a sever, or a network device, and the like) to execute the method in the embodiments or in some parts of the embodiments.

[0104] Finally, it should be noted that: the foregoing embodiments are only used to describe the technical solutions of this disclosure, rather than limit this disclosure. Although this disclosure is described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that he/she can still modify technical solutions disclosed in the foregoing embodiments, or make equivalent replacements to some technical features therein; however, the modifications or replacements do not make the essence of corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of this disclosure.

SPEECH RECOGNITION METHOD AND DEVICE

Inventors

Cpc classification

Classification Explorer

G10L15/142

PHYSICS

Classification Explorer

G10L15/063

PHYSICS

Classification Explorer

G10L15/10

PHYSICS

Classification Explorer

G10L25/39

PHYSICS

Classification Explorer

G10L15/183

PHYSICS

Classification Explorer

G10L2015/0631

PHYSICS

International classification

Classification Explorer

G10L15/10

PHYSICS

Classification Explorer

G10L15/14

PHYSICS

Classification Explorer

G10L15/06

PHYSICS

Classification Explorer

G10L25/39

PHYSICS

Abstract

Claims

Description