Method for verifying the identity of a speaker, system therefore and computer readable medium

09792912 · 2017-10-17

Assignee

Inventors

Cpc classification

International classification

Abstract

The invention refers to a method of verifying the identity of a speaker based on the speakers voice comprising the steps of: receiving (1, 5) a first and a second voice utterance; using biometric voice data to verify (2, 6) that the speakers voice corresponds to the speaker the identity of which is to be verified based on the received first and/or second voice utterance and determine (8) the similarity of the two received voice utterances characterized in that the similarity is determined using biometric voice characteristics of the two voice utterances or data derived from such biometric voice characteristics. The invention further refers to a System (80) for verifying the identity of a speaker based on the speakers voice comprising: a component (81) for receiving a first and a second voice utterance; a component (82) for using biometric voice data to verify that the speakers voice corresponds to the speaker the identity of which is to be verified based on the received first and/or second voice utterance and a component (83) for comparing the two received voice utterances in order to determine the similarity of the two voice utterances characterized in that the similarity is determined using biometric voice characteristics of the two voice utterances or data derived from such biometric voice characteristics.

Claims

1. A method of verifying, using analysis of a speaker's voice, a specific identity of a speaker, the method comprising the steps of: a) receiving, at a first software component of an electronic access control system, a first voice utterance and a second voice utterance; b) verifying, using biometric voice data extracted from the first and second received voice utterances and using a second software component of the electronic access control system, that the speaker's voice corresponds to a specific identity; and c) determining, using a third software component of the electronic access control system, an indicia of similarity of the two received voice utterances, wherein the indicia of similarity is determined using analysis of the extracted biometric voice data of the two received voice utterances; wherein the extracted biometric voice data used for determining the indicia of similarity of the two received voice utterances comprise or are based on a first set of at least n values, wherein the first set of at least n values is determined from a time slice of one of the received voice utterances, the time slice having a length of between 10 and 40 ms, wherein n is a number between 2 and 40; wherein if the speaker's voice has been verified, using biometric voice data, to correspond to the specific identity, the second received voice utterance is requested from the speaker; wherein the speaker is requested to repeat the first voice utterance in order to receive the second voice utterance; and, wherein data derived for the first voice utterance is determined at least 50 times for the first voice utterance and data derived for the second voice utterance is determined at least 50 times for the second voice utterance.

2. The method of claim 1, wherein a set of biometric voice data is extracted from the two received voice utterances and the extracted set of data is used as biometric voice data for verifying that the speaker's voice corresponds to the specific identity and as biometric voice characteristics for determining an indicia of similarity of the two received voice utterances or for deriving data for determining the indicia of similarity of the two received voice utterances.

Description

(1) Preferred embodiments of the invention are disclosed in the figures. The preferred embodiments are not to be understood as exposing a limitation of the invention and rather are provided in order to explain a particular useful way of carrying out the invention.

(2) It is shown in:

(3) FIG. 1 a schematic indication of a method for carrying out the invention;

(4) FIG. 2 a schematic view of how to extract biometric voice characteristics;

(5) FIG. 3 an explanation of dynamic time warping

(6) FIG. 4 a schematic view of how to obtain a derived set of data;

(7) FIG. 5 a schematic view for explaining correlation between data;

(8) FIG. 6 another preferred embodiment for carrying out the method; and

(9) FIG. 7 a schematic view of the system.

(10) FIG. 1 shows an example of a method. Here, in step 1 a first voice utterance is received. Based on this received voice utterance a speaker verification is performed (step 2). Hereby, biometric voice data are extracted from the voice utterance and are compared with the statistical voice model such as the Gaussian Mixed Model or Hidden Markov Model. In step 3 it is decided whether the identity is considered to be verified. If not, the speaker is rejected in item 4, otherwise it is proceeded to item 5. Here, a second voice utterance is received. Before step 5 and after step 3 another step may be given which requests the speaker to provide the second voice utterance. In item 6 a second voice utterance is processed in order to verify the speaker. Hereby, in a similar way as to step 2 biometric voice data are obtained from the voice utterance and are, for example, checked against a statistical voice model such as a Gaussian Mixed Model or a Hidden Markov Model.

(11) In item 7 it is decided whether the identity of the step 6 is considered to be verified or not. If not, the speaker is rejected in item 4, otherwise it is proceeded to the step 8. In this step the similarity between the first and the second voice utterance is determined. If the voice utterance is found to be suspiciously similar, it is proceeded to rejection step 4 otherwise it is proceeded to acceptance.

(12) The determination of the similarity between the first and the second voice utterance can also be performed directly after having received the second voice utterance in step 5. The speaker verification of item 6 may then only be preformed in case that the similarity is found not to be suspiciously similar. Also the speaker verification of step 6 and the determination of similarity in step 8 may be processed in parallel. The results of each of the decisions of items 7 and 9 may be combined in order to decide whether or not the speaker is to be rejected or accepted or other further steps before deciding about acceptance or rejection are carried out.

(13) Further, instead of the acceptance in item 10 other tests may be carried out in order to check for fraud before accepting a speaker, such as a liveliness test (see PCT/EP2008/010478, FIGS. 4 and 5).

(14) In FIG. 2a the intensity 15 of a voice signal as a function of time t is shown. This can be the intensity of the signal of the first or of the second voice utterance. Schematically are shown different time slices T.sub.1, T.sub.2, T.sub.3, which are defined as overlapping time slices. The time slice T.sub.1 extends from time 11 to time 13, T.sub.2 from time 12 to time 14 and T.sub.3 from time 13 to a time later than time 14. As can be seen, the time slices may overlap as the time slices T.sub.1 and T.sub.2, for example overlap in the time between time 12 and 13.

(15) For each time slice, biometric data or biometric characteristics may be calculated. For example, for each time slice the signal portion 15 may be Fourier transformed and the envelope thereof may be determined from which characteristic biometric data may be obtained as shown in FIG. 2b. Here it is shown that for each time slice, for example the set of values here, in particular, vector 16 is obtained, each vector or each set of values having n values, and n may be, for example a number between 10 and 30 or between 15 and 25. From those data, other sets of data may be obtained such as indicated with reference sign 17. Here, for a time slice T.sub.2 the first half of the values are identical to the set of values 16 and the differences between the set of values 16 of time slice T.sub.1 and the corresponding ones of T.sub.2 are added as the second n values. The value ΔC.sub.1.sup.2 corresponds to the difference between C.sub.1.sup.1 and C.sub.1.sup.2. Thereby sets of values or vectors 17 having two times n values are obtained.

(16) In FIG. 3a the temporal evolution of any of the characteristics Cy of FIG. 2 is shown. For each time slot one such value is calculated and corresponds to a data point in FIG. 3a.

(17) For each voice utterance more than a 1,000 or more than 10,000 time slices may be evaluated in giving more than 1,000 or more than 10,000 data points in FIG. 3a.

(18) A temporal evolution of such a characteristic Cy between two different voice utterances may be compared.

(19) In FIG. 3a only one characteristic Cy of the vectors 16 or 17 is shown, however, such comparisons between two voice utterances may be based on at least two, three or five such characteristics C of the sets of values. Preferably, only a sub group of values of the set of values 16 or 17 is used for determining the similarity.

(20) In FIG. 3b a particular example of applying dynamic time warping is shown. Characteristic Cy, for example corresponds to first voice utterance and the characteristic C′y corresponds to the corresponding characteristic of the second voice utterance. As shown in FIG. 3b each of the characteristics has a temporal evolution corresponding to the temporal evolution of the thereof shown for FIG. 3a.

(21) Line 23, in FIG. 3b indicates how one time axis t is matched onto the other time axis t′. Thereby, the time interval from 0 to t.sub.1 and the time interval from 0 to t′.sub.1 are linearly matched to each other, although they have different real time lengths. The time interval from 0 to may be for 200 milliseconds and the time interval to 0 to t′.sub.1 170 milliseconds. The data of one of two voice utterances is temporarily expanded or squeezed together, such that the two intervals correspond to a same interval such as, e.g. 200 milliseconds. The time interval from t.sub.1 to t.sub.2 and t′.sub.1 to t′.sub.2 is in a different way. Line 23 does not have to be composed of straight segments, but may also be a curved line.

(22) With the dynamic time warping as shown in FIG. 3b, each time indication for one voice utterance is mapped to another time indication on the other voice utterance such that at least for one of the two voice utterances some time intervals are stretched or compressed.

(23) In FIG. 3c it is, for example, shown how the curve 22 has been transferred by the dynamic time warping process to the curve 24, while curve 21 corresponding to C′.sub.y has not been changed. The two curves shown in FIG. 3c can be compared with each other in order to determine the similarity between two voice utterances, for example by calculating a correlation of the two curves. As mentioned above, not only the correlation between one value C.sub.y and C′.sub.y may be calculated but two, three, four or more characteristics such that multidimensional correlations are calculated.

(24) A set of values which represent the biometric voice characteristics used for determining the similarity of two voice utterances may be compared to a statistical voice model. This is schematically shown in FIG. 4a. For explanation in FIG. 4a, a statistical model composed of three Gaussians G1, G2, G3 (statistical voice model components) is shown. Please note that in FIG. 4a, for simplicity of explanation one-dimensional Gaussian are used having only one dimension indicating the probability W of the value of Cn for particular value. A statistical voice model typically has more than 500 or 1000 and preferably less than 2000 components. The number of components is named l.

(25) A specific value v of the characteristic Cn is present according to the different Gaussians G1, G2 and G3 with a different probability W. This probability W, according to each of the Gaussians G1, G2, G3 leads to the value m and is shown in vector 24 in FIG. 4b.

(26) Specifically, the probability W(v) that the characteristic Cn has the value v is calculated for expressing the coincidence of a biometric voice characteristic with a statistical voice model. This is an example of deriving a set of values from a set of values of biometric voice characteristics. The derived set of values may have, for example the same number (l) of values as there are numbers (l) of components of the statistical voice model.

(27) Such a derived set 24 may be derived for multiple time slices T. Hence, the temporal evolution of each of the values of m can be calculated similar to the one explained in FIG. 3a. Hereby a matrix of size l times the number of time slices for which a data set was derived is obtained (for each voice utterance). In the matrix only a subset (i.e. not all) of the l values may be further taken into account for determining the similarity, for example only l/2 or l/3 values (times the number of time slices for which a data set was derived).

(28) The temporal evolution of any of the values can be used to determine correlations between two voice utterances. Hereby, also dynamic time warping may be performed.

(29) In FIG. 5 another possible analysis in order to determine the similarity of two voice utterances is shown. In this figure two images are shown. One axis of the images refers to a time axis and the other image refers to coefficient numbering of the set of values 24. This means that one derived set of values 24 corresponds to one column of the data shown in FIG. 5. Instead of using the values of the derived set of values, the biometric voice data of the set of values 16 or 17 can be used.

(30) If the data shown in FIG. 5 are considered to be images such as black and white images the actual value of a number mx would correspond to the brightness. The image would have l number of lines. For each voice utterance the image 40 or 41 calculated. High values of a value m.sub.x would correspond to a bright indication in the image such as shown in stripes 44 and 49, while low values would correspond to darker portions such as shown in regions 42, 46, 47 and 51. Medium values would lead to intermediate brightness indication such as in zones 43, 45, 48 or 50.

(31) The data sets 40 and 41 shown in FIG. 5 can be analyzed in various ways in order to determine the similarity of the two received voice utterances. For example, image correlations may be calculated taken into account all data. Also, only data with a value of m being above a certain threshold much and the values in regions 43, 45, 44 together with the regions 48, 49 and 50 may be analyzed.

(32) Also, dynamic time warping for the data sets 40, 41 may be carried out in order to compare the two data sets.

(33) In FIG. 6 another preferred embodiment of the method is shown. In step 6 speaker verification (corresponding to step 2 in FIG. 1) is carried out. In step 61 it is decided whether the identity is considered to be verified or not, if not the speaker is rejected (62) otherwise a passive test for falsification 63 is carried out. A passive test for falsification does not request any further input from a speaker apart from first received first voice utterance. In this respect it is referred to explanations and definitions of a passive test for falsification disclosed in the above-mentioned PCT application PCT/EP2008/010478.

(34) In case that the passive test for falsification considers the voice utterance to be falsified in item 64, in item 65 a second voice utterance is requested which is received in item 66. Here, the speaker verification of the second received voice utterance is performed in item 67 and evaluated in item 68. If the identity of the speaker can not be verified, the speaker is rejected in item 69. If the identity can be verified, it is proceeded to calculate the determination of an exact match in item 70. The determination of the exact match according to the present method is done by calculating the similarity of the two received voice utterances using biometric voice characteristics. If this test indicates a falsification in item 71 the speaker is rejected in item 72, otherwise it is accepted in item 73.

(35) The herein described determination of the similarity of the two received voice utterances can be carried out as a determination of an exact match in each of the cases mentioned in the above-mentioned PCT application PCT/EP2008/010478. Disclosure of this application is therefore fully included in the present application by reference, each of the methods mentioned in PCT/EP2008/010478 which mentions an exact match is considered to be included and disclosed herein by reference.

(36) FIG. 7 shows a particular embodiment of a system 80 having a component 81 for receiving a first and a second voice utterance through input 84.

(37) Furthermore, a component 82 is shown for using biometric voice data to verify that the speaker's voice corresponds to the speaker the identity of which is to be verified based on the received first and second voice utterance.

(38) Furthermore, a component 83 for comparing the two received voice utterances in order to determine the similarity of the two voice utterances is shown. This component 83 uses biometric voice characteristics of the two voice utterances or data derived from such biometric voice characteristics in order to determine the similarity of the two voice utterances. The result of the verification of the identity is output by means 85.