SPEAKER AUTHENTICATION SYSTEM, METHOD, AND PROGRAM

Abstract

Provided is a speaker authentication system capable of achieving robustness against adversarial examples. A data storage unit 112 stores data related to voice of a speaker. A plurality of voice processing units 11 respectively perform speaker authentication based on input voice and the data stored in the data storage unit 112. A post-processing unit 116 specifies one speaker authentication result based on speaker authentication results obtained respectively by the plurality of the voice processing units 11. A method or parameters of the pre-processing applied to the voice in each voice processing unit 11 are different for each voice processing unit 11.

Claims

1. A speaker authentication system comprising: a data storage unit which stores data related to voice of a speaker, a plurality of voice processing units which perform speaker authentication based on input voice and the data stored in the data storage unit, and a post-processing unit which specifies one speaker authentication result based on speaker authentication results obtained respectively by the plurality of the voice processing units, wherein each voice processing unit includes, a pre-processing unit which performs pre-processing for the voice, a feature extraction unit which extracts features from voice data obtained by the pre-processing, a similarity calculation unit which calculates a similarity between the features and features obtained from the data stored in the data storage unit, and an authentication unit which performs speaker authentication based on the similarity calculated by the similarity calculation unit, and wherein a method or parameters of the pre-processing are different for each pre-processing unit included in each voice processing unit.

2. A speaker authentication system comprising: a data storage unit which stores data related to voice of a speaker, a plurality of voice processing units which calculate a similarity between features obtained from input voice and features obtained from the data stored in the data storage unit, and an authentication unit which performs speaker authentication based on the similarity obtained respectively by the plurality of the voice processing units, wherein each voice processing unit includes, a pre-processing unit which performs pre-processing for voice, a feature extraction unit which extracts features from voice data obtained by the pre-processing, and a similarity calculation unit which calculates the similarity between the features and the features obtained from the data stored in the data storage unit, and wherein a method or parameters of the pre-processing are different for each pre-processing unit included in each voice processing unit.

3. The speaker authentication system according to claim 1, wherein each pre-processing unit performs the pre-processing applying a mel filter after applying a short-time Fourier transform to the input voice, and a dimensionality of the mel filter is different for each pre-processing unit.

4. A speaker authentication method, wherein a plurality of voice processing units respectively perform speaker authentication based on input voice and data stored in a data storage unit which stores the data related to voice of a speaker, and a post-processing unit specifies one speaker authentication result based on speaker authentication results obtained respectively by the plurality of the voice processing units, wherein each voice processing unit performs pre-processing for voice, extracts features from voice data obtained by the pre-processing, calculates a similarity between the features and features obtained from the data stored in the data storage unit, and performs speaker authentication based on the calculated similarity, and wherein a method or parameters of the pre-processing are different for each voice processing unit.

5. (canceled)

6. The speaker authentication method according to claim 4, wherein each voice processing unit performs a processing applying a mel filter after applying a short-time Fourier transform to the input voice, as the pre-processing, and wherein a dimensionality of the mel filter is different for each voice processing unit.

7. (canceled)

8. (canceled)

9. (canceled)

Description

BRIEF DESCRIPTION OF DRAWINGS

[0043] FIG. 1 It depicts a graph showing an experimental result of an experiment to check an attack success rate on adversarial examples in multiple speaker authentication systems with different dimensionality of mel filter in pre-processing.

[0044] FIG. 2 It depicts a block diagram showing a configuration example of a speaker authentication system of an example embodiment of the present invention.

[0045] FIG. 3 It depicts a flowchart showing an example of the processing process of the first example embodiment.

[0046] FIG. 4 It depicts a summarized block diagram showing a configuration example of a computer that realizes a speaker authentication system with each voice processing unit, a data storage unit, and a post-processing unit.

[0047] FIG. 5 It depicts a block diagram showing a configuration example of a speaker authentication system of the second example embodiment of the present invention.

[0048] FIG. 6 It depicts a flowchart showing an example of the processing process of the second example embodiment.

[0049] FIG. 7 It depicts a block diagram showing a specific example of the configuration of a speaker authentication system of the first example embodiment.

[0050] FIG. 8 It depicts a flowchart showing an example of the processing process in the specific example shown in FIG. 7.

[0051] FIG. 9 It depicts a block diagram showing an example of an overview of a speaker authentication system of the present invention.

[0052] FIG. 10 It depicts a block diagram showing another example of an overview of a speaker authentication system of the present invention.

[0053] FIG. 11 It depicts a block diagram showing an example of a general speaker authentication system.

[0054] FIG. 12 It depicts a schematic diagram showing a spoofing attack defense system described in Non-Patent Literature 2.

EXAMPLE EMBODIMENTS

[0055] First, the examination conducted by the inventor of the present invention will be described.

[0056] As mentioned above, in recent years, models learned by machine learning have been increasingly used in speaker authentication systems. One of the security issues with such models is adversarial examples. As already described, an adversarial example is data to which a perturbation has been intentionally added, calculated so that a false positive can be derived by the model. Adversarial sample is a problem that can arise in any model learned by machine learning, and to date, no model has been proposed that is unaffected by adversarial samples. Therefore, a method to ensure robustness against adversarial samples, especially in the image domain, by adding a defense technique against adversarial samples similar to the technique described in Non-Patent Literature 2, has been proposed. However, when heuristic knowledge of the generation method of the adversarial sample is used in the defense technique, it has been reported that the adversarial sample generated by a different generation method can be easily attacked successfully. Therefore, it is highly desirable that defense techniques against adversarial samples do not use heuristic knowledge about adversarial samples.

[0057] One of the properties of adversarial samples is transferability. Transferability is the property that an adversarial sample generated to attack a model can also attack another model that performs the same task as the model. By using transferability, an attacker can attack the model to be attacked by preparing another model that performs the same task as the model and generating adversarial samples against the model, even if the model to be attacked cannot be directly obtained or operated.

[0058] In many speaker authentication systems, the voice to be authenticated is not treated as a voice waveform, but treated in the form of data converted into the frequency domain by performing a short-time Fourier transform or the like in the pre-processing for the voice. In addition, various filters are often applied. One type of filter is the mel filter. The inventor have experimentally shown that when individual pre-processing devices in individual speaker authentication systems apply different dimensional mel filters to voice, even if the attack success rate of adversarial samples is high in one speaker authentication system, the attack success rate of the adversarial sample can be significantly reduced in another speaker authentication system where the dimensionality of the mel filter is different. In other words, the inventor experimentally showed that the transferability can be significantly reduced when the dimensionality of the mel filter in the pre-processing is different.

[0059] FIG. 1 is a graph showing an experimental result of an experiment to check an attack success rate on adversarial examples in multiple speaker authentication systems with different dimensionality of mel filter in pre-processing. In this experiment, three speaker authentication systems were used. The configuration of the three speaker authentication systems is the same, but the dimensionality of the mel filter in the pre-processing is 40, 65, 90 which are different from each other.

[0060] Among the three speaker authentication systems, adversarial samples using the speaker authentication system with a mel filter of 90 dimension are generated, and the change in the attack success rate when the adversarial samples are used to attack the above three speaker authentication systems is shown as a solid line in FIG. 1. The attack success rate of the adversarial samples against the speaker authentication system having a mel filter of 90 dimension is high, but it can be seen from FIG. 1 that the attack success rate decreases as the dimensionality decreases from 90 to 65 and 40.

[0061] Among the three speaker authentication systems, adversarial samples using the speaker authentication system with a mel filter of 40 dimension are generated, and the change in the attack success rate when the adversarial samples are used to attack the three speaker authentication systems is shown as a dashed line in FIG. 1. The attack success rate of the adversarial samples against the speaker authentication system with a mel filter of 40 dimension is high, but it can be seen from FIG. 1 that the attack success rate decreases as the dimensionality increases from 40 to 65 and 90.

[0062] Based on the findings, the inventor made the following invention.

[0063] Hereinafter, example embodiments of the present invention will be explained with reference to the drawings.

Example Embodiment 1

[0064] FIG. 2 is a block diagram showing a configuration example of a speaker authentication system of the first example embodiment of the present invention. The speaker authentication system of the first example embodiment comprises a plurality of voice processing units 11-1 to 11-n, a data storage unit 112, and a post-processing unit 116. In the case where individual voice processing units are not specifically distinguished, the code “11” is used to denote the voice processing unit without “4”, “−2”, . . . , and “-n”. The same applies to the code representing each element included in the voice processing unit 11.

[0065] In this example, the number of voice processing units 11 is n (refer to FIG. 2).

[0066] Common voice is input to each voice processing unit 11, and each voice processing unit 11 performs speaker authentication for the voice. Specifically, each voice processing unit 11 performs a process to determine the speaker of the voice.

[0067] Each individual voice processing unit 11 includes a pre-processing unit 111, a feature extraction unit 113, a similarity calculation unit 114, and an authentication unit 115. For example, the voice processing unit 11-1 includes a pre-processing unit 111-1, a feature extraction unit 113-1, a similarity calculation unit 114-1, and an authentication unit 115-1.

[0068] In this example, it is assumed that each of the voice processing units 11-1 to 11-n, the data storage unit 112, and the post-processing unit 116 are realized by individual computers. Each of the voice processing unit 11-1 to 11-n, the data storage unit 112, and the post-processing unit 116 are communicatively connected. However, aspects of the voice processing units 11-1 to 11-n, the data storage unit 112, and the post-processing unit 116 are not limited to such example.

[0069] The pre-processing units 111-1 to 111-n installed in each of the voice processing units 11-1 to 11-n performs pre-processing on voice. However, a method or parameters of the pre-processing are different for each pre-processing unit 111-1 to 111-n. In other words, the method or parameters of the pre-processing are different for each individual pre-processing unit 111. Therefore, in this example, there are n types of pre-processing.

[0070] For example, each pre-processing unit 111 performs pre-processing by applying a short-time Fourier transform to the voice (more specifically, voice waveform data) input through a microphone, and then applying a mel filter to the result. The dimensionality of the mel filter is different for each pre-processing unit 111. Since the dimensionality of the mel filter differs for each pre-processing unit 111, the pre-processing performed on the voice differs for each pre-processing unit 111.

[0071] An aspect in which the method or parameters of the pre-processing are different for each pre-processing unit 111 is not limited to the above example. The method or parameters of the pre-processing may be different for each pre-processing unit 111 in other aspects.

[0072] The data storage unit 112 stores data related to voice for one or more speakers, for each speaker. Here, data related to voice is data from which features expressing the characteristics of voice of the speaker can be derived.

[0073] The data storage unit 112 may store, for each speaker, voice input through the microphone (more specifically, voice waveform data). Alternatively, the data storage unit 112 may store, for each speaker, data obtained by applying pre-processing to the voice waveform data. Alternatively, the data storage unit 112 may store, for each speaker, the features themselves extracted from data obtained by applying pre-processing to the voice waveform data, or data in a form obtained by applying an operation to the features.

[0074] As mentioned above, there are n types of pre-processing. Therefore, when storing data obtained after the pre-processing of voice waveform data, the data storage unit 112 stores n types of data per speaker. In other words, n types of data are stored in the data storage unit 112 for each speaker.

[0075] When voice (voice waveform data) before the pre-processing is performed is stored in the data storage unit 112, the data that does not depend on pre-processing will be stored. Therefore, in this case, it is sufficient to store one type of voice waveform data for each speaker in the data storage unit 112. In the following description, for the sake of simplicity the description, first, a case where one type of voice waveform data is stored for each speaker in the data storage unit 112 will be explained as an example. FIG. 2 illustrates the case where each pre-processing unit 111 obtains data from the data storage unit 112 in this case. The case where data obtained after pre-processing for voice waveform data is stored in the data storage unit 112 will be described later.

[0076] As mentioned above, common voice is input to each voice processing unit 11, and each voice processing unit 11 performs speaker authentication on the voice. In other words, each voice processing unit 11 determines the voice from which speaker is input among the speakers whose data is stored in the data storage unit 112.

[0077] Each of the pre-processing units 111-1 to 111-n perform, as pre-processing, the process of transforming the input voice into a format that is easy for the feature extraction unit 113 to extract the features of the voice. An example of this pre-processing is the process of applying a short-time Fourier transform to voice (voice waveform data) and then applying a mel filter to the result, for example. However, in this example embodiment, the dimensionality of the mel filter in the pre-processing unit 111-1 to 111-n is different from each other. In other words, the dimensionality of the mel filter is different for each pre-processing unit 111.

[0078] Examples of pre-processing are not limited to the above example. In addition, as already described, the aspect in which the method or parameters of the pre-processing are different for each pre-processing unit 111 is not limited to the above example.

[0079] When each pre-processing unit 111 pre-processes the input voice (voice waveform data), the pre-processing unit 111 also pre-processes the voice (voice waveform data) of each speaker stored in the data storage unit 112. As a result, one voice processing unit 111 obtains a result of pre-processing for the input voice waveform data and a result of pre-processing for each voice waveform data of each speaker. The same is true for each of the other voice processing units 11.

[0080] Each feature extraction unit 113 extracts voice features from the result of pre-processing on the input voice waveform data. Similarly, each feature extraction unit 113 extracts voice features from the result of pre-processing performed by the pre-processing unit 111 for each speaker (hereinafter, referred to as registered speakers) whose data is stored in the data storage unit 112. As a result, in one voice processing unit 11, features of the input voice and features of the respective voice for each registered speaker are obtained. The same is true for each of the other voice processing units 11.

[0081] Each feature extraction unit 113 may extract features using a model obtained by machine learning, for example, or by performing statistical operation processing. However, the method of extracting features from the results of pre-processing is not limited to these methods, but may be other methods.

[0082] Each similarity calculation unit 114 calculates, for each registered speaker, the similarity between the features of the input voice and the features of the voice of the registered speaker. As a result, in one voice processing unit 11, a similarity is obtained for each registered speaker. The same is true for each of the other voice processing units 11.

[0083] Each similarity calculation unit 114 may calculate, as the similarity, a cosine similarity between the features of the input voice and the features of the voice of the registered speaker. Each similarity calculation unit 114 may also calculate, as the similarity, a reciprocal of the distance between the features of the input voice and the features of the voice of the registered speaker. However, the method of calculating the similarity is not limited to these methods, and other methods may also be used.

[0084] Each authentication unit 115 performs speaker authentication based on the similarity calculated for each registered speaker. In other words, each authentication unit 115 determines which voice of a speaker is the input voice among the registered speakers.

[0085] Each authentication unit 115 may, for example, compare the similarity calculated for each registered speaker with a threshold value, and identify the speaker whose similarity is greater than a threshold value as the speaker who emitted the input voice. If there is more than one speaker whose similarity is greater than the threshold value, each authentication unit 115 may identify the speaker whose similarity is the greatest among the speakers as the speaker who emitted the input voice.

[0086] The above threshold value may be a fixed value or a variable value that varies according to a predetermined calculation method.

[0087] In each voice processing unit 11-1 to 11-n, the authentication unit 115-1 to 115-n perform speaker authentication, so that the determination result of the speaker who emitted the input voice can be obtained for each voice processing unit 11. Here, since the pre-processing is different in each voice processing unit 11, the determination result of the speaker obtained in each voice processing unit 11 is not necessarily the same.

[0088] The post-processing unit 116 obtains the speaker authentication results from the authentication units 115-1 to 115-n, and specifies one speaker authentication result based on the speaker authentication results obtained by each of the authentication units 115-1 to 115-n. The post-processing unit 116 outputs the specified speaker authentication result to an output device (not shown in FIG. 2).

[0089] For example, the post-processing unit 116 may determine the speaker who emitted the input voice by majority voting based on the speaker authentication results obtained by each of the authentication units 115-1 to 115-n. In other words, the post-processing unit 116 may determine the speaker with the largest number of selected speakers among the speakers selected as the speaker authentication results in each of the authentication units 115-1 to 115-n as the speaker who emitted the input voice. However, the method by which the post-processing unit 116 specifies the single speaker authentication result is not limited to majority voting, and may be other methods.

[0090] In this example, each of the authentication units 115-1 to 115-n perform speaker authentication, and the post-processing unit 116 specifies the single speaker authentication result based on the speaker authentication results obtained by each of the authentication units 115-1 to 115-n. In this example, the speaker authentication system includes a plurality of elements (voice processing unit 11) that perform speaker authentication, and the speaker authentication system as a whole specifies the single speaker authentication result.

[0091] The speaker authentication system of the example embodiment of the present invention can also be used as a detection system for adversarial examples by using the differences of the pre-processing units 111-1 to 111-n. In other words, the speaker authentication system of the example embodiment of the present invention can also be used as a system for determining whether the input voice is adversarial or natural voice. In this case, for example, the post-processing unit 116 may determine that the input voice is an adversarial sample if the speaker authentication results in all the voice processing units 11-1 to 11-n do not match. However, the criteria for determining that the input voice is an adversarial sample is not limited to the above example.

[0092] In this example, each voice processing unit 11 is realized by a computer. In this case, the pre-processing unit 111, the feature extraction unit 113, the similarity calculation unit 114, and the authentication unit 115 in each voice processing unit 11 are realized by a CPU (Central Processing Unit) of a computer operating according to a voice processing program, for example. In this case, the CPU can read the voice processing program from a program storage medium such as a program storage device of the computer, and operate as the pre-processing unit 111, the feature extraction unit 113, the similarity calculation unit 114, and the authentication unit 115 according to the program.

[0093] Next, the processing process of the first example embodiment will be explained. FIG. 3 is a flowchart showing an example of the processing process of the first example embodiment. The matters already explained are omitted as appropriate.

[0094] First, common voice (voice waveform data) is input to the pre-processing unit 111-1 to 111-n (step S1).

[0095] Next, the pre-processing units 111-1 to 111-n perform pre-processing on the input voice waveform data, respectively (step S2). In addition, in step S2, the pre-processing units 111-1 to 111-n obtain the voice waveform data stored in the data storage unit 112 for each registered speaker and perform pre-processing on the obtained voice waveform data, respectively.

[0096] As mentioned above, the method or parameters of the pre-processing are different for each individual pre-processing unit 111. For example, the dimensionality of the mel filter used in pre-processing is different, for each pre-processing unit 111.

[0097] Next to step S2, the feature extraction units 113-1 to 113-n extract voice features from the results of the pre-processing in the corresponding pre-processing unit 111, respectively (step S3).

[0098] For example, the feature extraction unit 113-1 extracts the features of the input voice from the result of the pre-processing performed by the pre-processing unit 111-1 on the input voice waveform data. The feature extraction unit 113-1 extracts the features of the voice from the results of the pre-processing performed by the pre-processing unit 111-1 on the voice waveform data stored in the data storage unit 112, for each registered speaker. The other respective feature extraction units 113 operate in the same manner.

[0099] Next to step S3, the similarity calculation units 114-1 to 114-n calculate a similarity between the features of the input voice and the features of the voice of the registered speaker for each registered speaker, respectively (step S4).

[0100] Next, the authentication units 115-1 to 115-n perform speaker authentication based on the similarity calculated for each registered speaker, respectively (step S5). In other words, the authentication units 115-1 to 115-n determine which voice of a speaker is the input voice among the registered speakers, respectively.

[0101] Next, the post-processing unit 116 obtains the speaker authentication results from the authentication units 115-1 to 115-n, and specifies one speaker authentication result based on the speaker authentication results obtained from each of the authentication units 115-1 to 115-n (step S6). For example, the post-processing unit 116 may determine the speaker with the largest number of selected speakers among the speakers selected as a speaker authentication result by each of the authentication units 115-1 to 115-n as the speaker who emitted the input voice.

[0102] Next, the post-processing unit 116 outputs the speaker authentication result specified in step S6 to an output device (not shown in FIG. 2) (step S7). The aspect of output in step S7 is not particularly limited. For example, the post-processing unit 116 may display the speaker authentication result specified in step S6 on a display device (not shown in FIG. 2).

[0103] In the first example embodiment, the method or parameters of the pre-processing are different for each pre-processing unit 111 included in each voice processing unit 11. Therefore, even if the attack success rate of an adversarial sample is high in one voice processing unit 11, the attack success rate of the adversarial samples will be reduced in other voice processing units 11. Accordingly, the voice authentication result obtained in the voice processing unit 11 with a high attack success rate for the adversarial samples is not ultimately selected by the post-processing unit 116. Therefore, robustness against adversarial examples can be achieved. In addition, in this example embodiment, by changing the method or parameters of the pre-processing for each pre-processing unit 11, the success rate of attacks on multiple voice processing units 11 is made different. By doing so, the robustness against adversarial examples is enhanced. Therefore, no heuristic knowledge of known adversarial samples is used to increase the robustness against adversarial samples. As a result, according to this example embodiment, robustness can be ensured even against unknown adversarial samples.

[0104] As mentioned above, the speaker authentication system of this example embodiment can also be used as a detection system for adversarial examples by using the differences in the pre-processing units 111-1 to 111-n. For example, the speaker authentication system can also be used as such a detection system by determining that the input voice is an adversarial sample if the speaker authentication results in all voice processing units 11-1 to 11-n do not match, by the post-processing unit 116. As already explained, the criteria for determining that the input voice is an adversarial sample is not limited to the above example.

[0105] In the above description, such a case where the data storage unit 112 stores the voice (voice waveform data) input through the microphone for each speaker is explained as an example. As already explained, the data storage unit 112 may store data obtained after pre-processing of the voice waveform data. This case will be explained below.

[0106] The case where the data storage unit 112 stores the data obtained by applying pre-processing to the voice waveform data for each speaker will be explained. Each pre-processing unit 111 has a different pre-processing method or parameters. In other words, there are n types of pre-processing. Because of that, when focusing on a single speaker, the data obtained by applying each of the n types of pre-processing to the voice waveform data of the single speaker (referred to as p) should be prepared. Specifically, “data obtained by applying the pre-processing of the pre-processing unit 111-1 to the voice waveform data of speaker p”, “data obtained by applying the pre-processing of the pre-processing unit 111-2 to the voice waveform data of speaker p”, . . . , “data obtained by applying the pre-processing of the pre-processing unit 111-n to the voice waveform data of speaker p” are prepared. As a result, n types of data for speaker p can be obtained. In the same way, n types of data are prepared for each speaker other than speaker p. In this way, n types of data can be prepared for each speaker, and the n types of data for each individual speaker may be stored in the data storage unit 112.

[0107] In the above example, when the voice processing unit 11 obtains the data stored in the data storage unit 112, the feature extraction unit 113 may obtain the data obtained by performing the pre-processing of the pre-processing unit 111 corresponding to the feature extraction unit 113 from the data storage unit 112 and extract the features from the data, for each registered speaker.

[0108] For example, when the voice processing unit 11-1 obtains the data stored in the data storage unit 112, the feature extraction unit 113-1 may obtain the data obtained by performing the pre-processing of the pre-processing unit 111-1 from the data storage unit 112 and extract the features from the data, for each registered speaker. The same applies when the other voice processing unit 11 obtains the data stored in the data storage unit 112.

[0109] Next, the case where the data storage unit 112 stores the features themselves extracted from the data obtained by pre-processing the voice waveform data for each speaker will be explained. In this case also, n types of data (features) per person may be prepared, and each of the n types of data for each individual speaker may be stored in the data storage unit 112. For example, n types of data for speaker p can be stored in the data storage unit 112. For example, as the n types of data for speaker p, “features extracted from the pre-processing results of the pre-processing unit 111-1 on the voice waveform data of speaker p”, “features extracted from the pre-processing results of the pre-processing unit 111-2 on the voice waveform data of speaker p” . . . , “features extracted from the pre-processing results of the pre-processing unit 111-n on the voice waveform data of speaker p” are prepared. In the same way, n types of data (features) per person are prepared for each speaker other than speaker p. In this way, n types of data (features) may be prepared for each speaker, and each of the n types of data for each individual speaker may be stored in the data storage unit 112.

[0110] In the above example, the data storage unit 112 stores data related to the voice in the format of features. Therefore, when the voice processing unit 11 obtains the data stored in the data storage unit 112, the similarity calculation unit 114 may obtain the features corresponding to the pre-processing of the pre-processing unit 111 corresponding to the feature extraction unit 113 from the data storage unit 112, for each registered speaker. Then, the similarity calculation unit 114 may calculate a similarity between the features and the features of the voice input to the voice processing unit 11.

[0111] For example, when the voice processing unit 11-1 obtains the features stored in the data storage unit 112, the similarity calculation unit 114-1 may obtain “features extracted from the pre-processing results of the pre-processing unit 111-1 on the voice waveform data of speaker” from the data storage unit 112, for each registered speaker. Then, the similarity calculation unit 114-1 may calculate a similarity between the features and the features of the voice input to the voice processing unit 11-1. The same applies when the other voice processing unit 11 obtains the features stored in the data storage unit 112.

[0112] In the first example embodiment described above, each of the voice processing units 11-1 to 11-n, the data storage unit 112, and the post-processing unit 116 is realized by separate computers as an example. In the following, the case where the speaker authentication system comprising each voice processing unit 11-1 to 11-n, the data storage unit 112, and the post-processing unit 116 is realized by a single computer will be explained.

[0113] FIG. 4 is a summarized block diagram showing a configuration example of a single computer that realizes a speaker authentication system comprising each voice processing unit 11-1 to 11-n, the data storage unit 112, and the post-processing unit 116. The computer 1000 comprises a CPU 1001, a main memory 1002, an auxiliary memory 1003, an interface 1004, a microphone 1005, and a display device 1006.

[0114] Microphone 1005 is an input device used for voice input. The input device used for voice input may be a device other than the microphone 1005.

[0115] The display device 1006 is used to display the speaker authentication result specified in step S6 (refer to FIG. 3) above. However, as mentioned above, the output aspect in step S7 (refer to FIG. 3) is not limited.

[0116] The operations of the speaker authentication system comprising each voice processing unit 11-1 to 11-n, the data storage unit 112, and the post-processing unit 116 is stored in the format of a program in the auxiliary memory 1003. Hereinafter, this program is referred to as a speaker authentication program. The CPU 1001 reads the speaker authentication program from the auxiliary memory 1003 and expands it to the main memory 1002, and according to the speaker authentication program, operates as the plurality of voice processing units 11-1 to 11-n and the post-processing unit 116 in the first example embodiment. The data storage unit 112 may be realized by the auxiliary memory 1003, or by other storage devices provided by the computer 1000.

[0117] The auxiliary storage device 1003 is an example of a non-transitory tangible medium. Other examples of non-transitory tangible media include magnetic disks, optical magnetic disks, CD-ROM (Compact Disk Read Only Memory), DVD-ROM (Digital Versatile Disk Read Only Memory), semiconductor memory, and the like, which are connected through an interface 1004.

[0118] When the speaker authentication program is delivered to the computer 1000 through a communication line, the computer 1000 receiving the delivery may expand the speaker authentication program into the main memory device 1002 and operate as the plurality of voice processing units 11-1 to 11-n and the post-processing unit 116 in the first example embodiment.

Example Embodiment 2

[0119] FIG. 5 is a block diagram showing a configuration example of a speaker authentication system of the second example embodiment of the present invention. Elements similar to those of the first example embodiment are marked with the same code as in FIG. 2, and a detailed description is omitted. The speaker authentication system of the second example embodiment comprises a plurality of voice processing units 21-1 to 21-n, a data storage unit 112, and an authentication unit 215. In the case where individual voice processing units are not specifically distinguished, the code “21” is used to denote the voice processing unit without “4”, “−2”, . . . , and “-n”. The same applies to the code representing each element included in the voice processing unit 21.

[0120] In this example, the number of voice processing units 21 is n (refer to FIG. 5).

[0121] Common voice is input to each voice processing unit 21, and each voice processing unit 21 calculates a similarity between features of the input voice and features of each registered speaker (features obtained from the data of each speaker stored in the data storage unit 112).

[0122] As described below, each voice processing unit 21 includes the pre-processing unit 111. The method or parameters of the pre-processing are different for each individual pre-processing unit 111.

[0123] The data storage unit 112 stores data related to voice for one or more speakers for each speaker, similar to the data storage unit 112 in the first example embodiment.

[0124] The data storage unit 112 may store, for each speaker, voice input through the microphone (more specifically, voice waveform data). Alternatively, the data storage unit 112 may store, for each speaker, data obtained by applying pre-processing to the voice waveform data. Alternatively, the data storage unit 112 may store, for each speaker, the features themselves extracted from data obtained by applying pre-processing to the voice waveform data, or data in a form obtained by applying an operation to the features.

[0125] When the data storage unit 112 stores the data obtained by applying pre-processing to the voice waveform data for each speaker, n types of data may be prepared for each speaker, and the n types of data of each individual speaker may be stored in the data storage unit 112.

[0126] When the data storage unit 112 stores the features themselves extracted from the data obtained by applying pre-processing to the voice waveform data for each speaker, n types of data (features) may be prepared for each speaker, and the n types of features of each speaker may be stored in the data storage unit 112.

[0127] In the case where the data storage unit 112 stores voice (voice waveform data) before pre-processing is performed, it is sufficient to store one type of voice waveform data for each speaker in the data storage unit 112.

[0128] Since the matters related to these data storage units 112 have been described in the first example embodiment, a detailed explanation is omitted here.

[0129] Hereinafter, the case where the data storage unit 112 stores voice (voice waveform data) before the pre-processing is performed will be explained.

[0130] Each of the voice processing units 21 includes the pre-processing unit 111, the feature extraction unit 113, and the similarity calculation unit 114. For example, the voice processing unit 21-1 includes the pre-processing unit 111-1, the feature extraction unit 113-1, and the similarity calculation unit 114-1.

[0131] In this example, it is assumed that each of the voice processing units 21-1 to 21-n, the data storage unit 112, and the authentication unit 215 are realized by separate computers. Each of the voice processing units 21-1 to 21-n, the data storage unit 112, and the authentication unit 215 are communicatively connected. However, aspects of the voice processing units 21-1 to 21-n, the data storage unit 112, and the authentication unit 215 are not limited to such example.

[0132] The pre-processing units 111-1 to 111-n are the same as the pre-processing units 111-1 to 111-n in the first example embodiment. As explained in the first example embodiment, each of the pre-processing units 111-1 to 111-n performs, as pre-processing, the process of converting the input voice into a format in which the feature extraction unit 113 can easily extract the features of the voice. An example of this pre-processing is the process of applying a short-time Fourier transform to the voice (voice waveform data) and then applying a mel filter to the result, for example. Here, the method or parameters of the pre-processing are different for each pre-processing unit 111. In this example, the dimensionality of the mel filter in the pre-processing units 111-1 to 111-n is assumed to be different. In other words, the dimensionality of the mel filter is assumed to be different for each pre-processing unit 111.

[0133] Examples of pre-processing are not limited to the above examples. The aspect in which the method or parameters of the pre-processing are different for each pre-processing unit 111 is not limited to the above example.

[0134] When each pre-processing unit 111 pre-processes the input voice (voice waveform data), the pre-processing unit 111 also pre-processes the voice (voice waveform data) of each speaker stored in the data storage unit 112.

[0135] Each feature extraction unit 113 is the same as each feature extraction unit 113 in the first example embodiment. Each feature extraction unit 113 extracts voice features from a result of pre-processing on the input voice waveform data. Similarly, each feature extraction unit 113 extracts voice features from a result of pre-processing performed by the pre-processing unit 111 for each registered speaker.

[0136] Each feature extraction unit 113 may extract features using a model obtained by machine learning, for example, or by performing statistical operation processing. However, the method of extracting features from the result of pre-processing is not limited to these methods, but may be other methods.

[0137] Each similarity calculation unit 114 calculates, for each registered speaker, a similarity between the features of the input voice and the features of the voice of the registered speaker.

[0138] Each similarity calculation unit 114 may calculate, as the similarity, a cosine similarity between the features of the input voice and the features of the voice of the registered speaker.

[0139] Each similarity calculation unit 114 may also calculate, as the similarity, a reciprocal of the distance between the features of the input voice and the features of the voice of the registered speaker. However, the method of calculating the similarity is not limited to these methods, and other methods may also be used.

[0140] The authentication unit 215 performs speaker authentication based on the similarity calculated for each speaker by each voice processing unit 21-1 to 21-n (more specifically, each similarity calculation unit 114-1 to 114-n). In other words, the authentication unit 215 determines which voice of a speaker is the input voice among the registered speakers based on the similarity calculated for each registered speaker in each of the similarity calculation units 114-1 to 114-n. In addition, the authentication unit 215 outputs the speaker authentication result (which voice of a speaker is the input voice) to an output device (not shown in FIG. 5).

[0141] An example of the speaker authentication operation performed by the authentication unit 215 will be explained below.

[0142] The authentication unit 215 obtains a similarity for each registered speaker from each of the n similarity calculation units 114-1 to 114-n. For example, assume that there are x registered speakers. In this case, the authentication unit 215 obtains the similarity of x speakers from the similarity calculation unit 114-1. Similarly, the authentication unit 215 obtains the similarity of x speakers from the similarity calculation units 114-2 to 114-n.

[0143] The authentication unit 215 holds each threshold value for each individual pre-processing unit 111-1 to 111-n. In other words, the authentication unit 215 holds a threshold value corresponding to the pre-processing unit 111-1 (Th1), a threshold value corresponding to the pre-processing unit 111-2 (Th2), . . . , a threshold value corresponding to the pre-processing unit 111-n (Thn).

[0144] Then, the authentication unit 215 compares, for each voice processing unit 21, each similarity for each of x persons obtained from the similarity calculation unit 114 in the voice processing unit 21 with the threshold value corresponding to the pre-processing unit 111 in the voice processing unit 21. As a result, for a single speaker, n comparison results between the similarity and the threshold value are obtained. The authentication unit 215 may specify the number of comparison results that the similarity is greater than the threshold value for each registered speaker, and use the speaker with the largest number as the speaker authentication result. In other words, the authentication unit 215 may determine that the input voice is the voice of the speaker whose number is the largest.

[0145] For example, it is assumed that the speaker p is focused on among the plurality of registered speakers. The authentication unit 215 compares the magnitude relationship between the similarity calculated for speaker p, obtained from the similarity calculation unit 114-1, and the threshold value Th1 corresponding to the pre-processing unit 111-1. Similarly, the authentication unit 215 compares the magnitude relationship between the similarity calculated for speaker p, obtained from the similarity calculation unit 114-2, and the threshold value Th2 corresponding to the pre-processing unit 111-2. The authentication unit 215 performs the same process for the similarity calculated for speaker p, obtained from respective similarities calculation units 114-3 to 114-n. As a result, n comparison results between the similarity and the threshold value are obtained for speaker p.

[0146] Here, the case where the speaker p is focused on has been described, but the authentication unit 215 similarly derives n comparison results between the similarity and the threshold value, for each registered speaker.

[0147] Then, the authentication unit 215 specifies, for each speaker, the number of comparison results that the similarity is greater than a threshold value. Furthermore, the authentication unit 215 determines that the input voice is the voice of the speaker whose number is the largest.

[0148] The speaker authentication operation of the authentication unit 215 is not limited to the above example. In the above example, the case where the authentication unit 215 holds an individual threshold value for each of the individual pre-processing units 111-1 to 111-n has been described as an example. The authentication unit 215 may hold one type of threshold value independent of the pre-processing units 111-1 to 111-n. Hereinafter, an operation example of the authentication unit 215, when the authentication unit 215 holds one type of threshold value, will be shown.

[0149] The authentication unit 215 obtains a similarity for each registered speaker from each of the n similarity calculation units 114-1 to 114-n. This point is the same as the above-mentioned case.

[0150] Then, the authentication unit 215 calculates an arithmetic mean of the similarities obtained from each of the n similarity calculation units 114-1 to 114-n for each registered speaker. For example, it is assumed that the speaker p is focused on among the plurality of registered speakers. The authentication unit 215 calculates an arithmetic mean of “similarity calculated for speaker p obtained from the similarity calculation unit 114-1”, “similarity calculated for speaker p obtained from the similarity calculation unit 114-2”, . . . , and “similarity calculated for speaker p obtained from the similarity calculation unit 114-n”. As a result, the arithmetic mean of the similarities for speaker p is obtained.

[0151] The authentication unit 215 similarly calculates an arithmetic mean of the similarities for each registered speaker.

[0152] Then, the authentication unit 215 may compare the arithmetic mean of the similarity calculated for each registered speaker with the held threshold value, for example, and determine the speaker whose arithmetic mean of the similarity is greater than the threshold value as the speaker who emitted the input voice. When there are multiple speakers whose arithmetic mean of similarity is greater than the threshold value, the authentication unit 215 may determine the speaker whose arithmetic mean of similarity is the greatest among the speakers as the speaker who emitted the input voice.

[0153] Here, the operation of speaker authentication when the authentication unit 215 holds n types of threshold values and the operation of speaker authentication when the authentication unit 215 holds one type of threshold value have been explained. In the second example embodiment, the authentication unit 215 may identify the speaker who emitted the input voice by a more complex operation based on the similarity for each speaker obtained from each similarity calculation unit 114.

[0154] In this example, each voice processing unit 21 is realized by a computer. In this case, the pre-processing unit 111, the feature extraction unit 113, and the similarity calculation unit 114 in each voice processing units 21 are realized by a CPU of a computer operating according to a voice processing program, for example. In this case, the CPU can read a voice processing program from a program storage medium such as a program storage device of the computer, and operate as the pre-processing unit 111, the feature extraction unit 113, and the similarity calculation unit 114 according to the program.

[0155] Next, the processing process of the second example embodiment will be explained.

[0156] FIG. 6 is a flowchart showing an example of the processing process of the second example embodiment. The matters already described are omitted as appropriate. In addition, the explanation of the same processing as that of the first example embodiment will be omitted.

[0157] Steps S1 to S4 are the same as steps S1 to S4 in the first example embodiment, and the explanation thereof will be omitted.

[0158] After step S4, the authentication unit 215 performs speaker authentication based on the similarity calculated for each speaker by each similarity calculation unit 114-1 to 114-n (step S11). In step S11, the authentication unit 215 obtains the similarity for each registered speaker from each of the n similarity calculation units 114-1 to 114-n. Then, based on the similarity, the authentication unit 215 determines which voice of a speaker among the registered speakers is the input voice.

[0159] Since the example of the operation of this authentication unit 215 has already been explained, it is omitted here.

[0160] Next, the authentication unit 215 outputs the speaker authentication result in step S11 to an output device (not shown in FIG. 5). The output aspect in step S12 is not particularly limited. For example, the authentication unit 215 may display the speaker authentication result in step S11 on a display device (not shown in FIG. 5).

[0161] In the second example embodiment, as in the first example embodiment, it is possible to realize a speaker authentication system that is robust against adversarial examples. In the first example embodiment, each voice processing unit 11 includes the authentication unit 115 (refer to FIG. 2), but in the second example embodiment, each voice processing unit 21 does not include such an authentication unit. Therefore, in the second example embodiment, each voice processing unit 21 can be simplified.

[0162] In addition, the authentication unit 215 can realize speaker authentication in a different method from the first example embodiment, based on the similarity for each speaker obtained from each similarity calculation unit 114.

[0163] In the second example embodiment described above, the case where each voice processing unit 21-1 to 21-n, the data storage unit 112, and the authentication unit 215 are realized by separate computers has been explained as an example. In the following, the case where the speaker authentication system includes each voice processing unit 21-1 to 21-n, the data storage unit 112, and the authentication unit 215 is realized by a single computer will be explained as an example. This computer can be represented in the same way as in FIG. 4, and will be explained with reference to FIG. 4.

[0164] Microphone 1005 is an input device used for voice input. The input device used for voice input may be a device other than the microphone 1005.

[0165] The display device 1006 is used to display the speaker authentication result in the aforementioned step 11. However, as mentioned above, the output aspect in step S12 (refer to FIG. 6) is not particularly limited.

[0166] The operation of the speaker authentication system with each voice processing unit 21-1 to 21-n, the data storage unit 112, and authentication unit 215 is stored in the format of a program in the auxiliary memory 1003. In this example, this program is referred to as a speaker authentication program. The CPU 1001 reads the speaker authentication program from the auxiliary memory 1003, and expands it to the main memory 1002, and according to the speaker authentication program, operates as the plurality of voice processing units 21-1 to 21-n and the authentication unit 215 in the second example embodiment. The data storage unit 112 may be realized by the auxiliary memory 1003, or by other storage devices provided by the computer 1000.

Specific Example

[0167] Next, a specific example of the configuration of a speaker authentication system will be explained using the first example embodiment as an example. However, the matters explained in the first example embodiment will be omitted as appropriate. FIG. 7 is a block diagram showing a specific example of the configuration of a speaker authentication system of the first example embodiment. In the example shown in FIG. 7, the speaker authentication system comprises a plurality of voice processing devices 31-1 to 31-n, a data storage device 312, and a post-processing device 316. In the case where individual voice processing devices are not specifically distinguished, the code “31” is used to denote the voice processing device without “-1”, “−2”, . . . , and “-n”. The same applies to the code “317” representing the operation device included in the voice processing device 31.

[0168] In this example, it is assumed that the plurality of voice processing devices 31-1 to 31-n and the post-processing device 316 are realized by separate computers. These computers include a CPU, a memory, a network interface, and a magnetic storage device. For example, the voice processing devices 31-1 to 31-n may include a reading device for reading data from a computer-readable recording medium such as a CD-ROM, respectively.

[0169] Each of the voice processing device 31 includes an operation device 317. The operation device 317 corresponds to a CPU, for example. Each operation device 317 expands a voice processing program stored in a magnetic storage device of the voice processing unit 31 or the voice processing program received from outside through a network interface in a memory. Then, according to the voice processing program, each operation device 317 realizes the operation as the pre-processing unit 111, the feature extraction unit 113, the similarity calculation unit 114, and the authentication unit 115 (refer to FIG. 2) in the first example embodiment. However, the method or parameters of the pre-processing are different for each operation device 317 (in other words, for each voice processing device 31).

[0170] The CPU of the post-processing device 316 expands a program stored in a magnetic storage device of the post-processing device 316 or the program received from outside through a network interface in the memory. Then, according to the program, the CPU realizes the operation as the post-processing unit 116 (refer to FIG. 2) in the first example embodiment.

[0171] The data storage device 312 is, for example, a magnetic storage device, etc., which stores data related to voice for one or more speakers for each speaker, and provides the data to each of the operation devices 317-1 to 317-n. The data storage device 312 may be realized by a computer that includes a reading device for reading data from a computer-readable recording medium of a flexible disk or CD-ROM. The recording medium may then store the data related to the voice for each speaker.

[0172] FIG. 8 is a flowchart showing an example of the processing process in the specific example shown in FIG. 7. First, common voice is input to the operation devices 317-1 to 317-n (step S31). Step S31 corresponds to step S1 (refer to FIG. 3) in the first example embodiment.

[0173] Then, the operation devices 317-1 to 317-n execute the process corresponding to steps S2 to S5 in the first example embodiment (step S32).

[0174] The post-processing device 316 specifies one speaker authentication result based on the speaker authentication results obtained by each of the operation units 317-1 to 317-n (step S33).

[0175] Then, the post-processing device 316 outputs the speaker authentication result specified in step S33 to an output device (not shown in FIG. 7) (step S34). The output aspect in step S34 is not particularly limited.

[0176] Steps S33 and S34 are equivalent to steps S6 and S7 in the first example embodiment.

[0177] Next, an overview of the present invention will be explained. FIG. 9 is a block diagram showing an example of an overview of a speaker authentication system of the present invention.

[0178] A speaker authentication system of the present invention comprises a data storage unit 112, a plurality of voice processing units 11, and a post-processing unit 116.

[0179] The data storage unit 112 stores data related to voice of a speaker.

[0180] Each of the plurality of voice processing units 11 performs speaker authentication based on input voice and the data stored in the data storage unit 112.

[0181] The post-processing unit 116 specifies one speaker authentication result based on speaker authentication results obtained respectively by the plurality of the voice processing units 11.

[0182] Each voice processing unit 11 includes a pre-processing unit 111, a feature extraction unit 113, a similarity calculation unit 114, and an authentication unit 115.

[0183] The pre-processing unit 111 performs pre-processing for the voice.

[0184] The feature extraction unit 113 extracts features from voice data obtained by the pre-processing.

[0185] The similarity calculation unit 114 calculates a similarity between the features and features obtained from the data stored in the data storage unit 112.

[0186] The authentication unit 115 performs speaker authentication based on the similarity calculated by the similarity calculation unit 114.

[0187] The method or parameters of the pre-processing are different for each pre-processing unit 111 included in each voice processing unit 11.

[0188] With such a configuration, it is possible to achieve robustness against adversarial examples.

[0189] FIG. 10 is a block diagram showing another example of an overview of a speaker authentication system of the present invention.

[0190] A speaker authentication system of the present invention comprises a data storage unit 112, a plurality of voice processing units 21, and an authentication unit 215.

[0191] The data storage unit 112 stores data related to voice of a speaker.

[0192] Each of the plurality of voice processing units 21 calculates a similarity between features obtained from input voice and features obtained from the data stored in the data storage unit 112.

[0193] The authentication unit 215 performs speaker authentication based on the similarity obtained respectively by the plurality of voice processing units 21.

[0194] Each voice processing unit 21 includes a pre-processing unit 111, a feature extraction unit 113, and a similarity calculation unit 114.

[0195] The pre-processing unit 111 performs pre-processing for voice.

[0196] The feature extraction unit 113 extracts features from voice data obtained by the pre-processing.

[0197] The similarity calculation unit 114 calculates a similarity between the features and the features obtained from the data stored in the data storage unit 112.

[0198] The method or parameters of the pre-processing are different for each pre-processing unit 111 included in each voice processing unit 21.

[0199] Even with such a configuration, it is possible to achieve robustness against adversarial examples.

[0200] In the speaker authentication system summarized in FIGS. 9 and 10, each pre-processing unit may perform the pre-processing applying a mel filter after applying a short-time Fourier transform to the input voice, and the dimensionality of the mel filter is different for each pre-processing unit.

[0201] Although the invention of the present application has been described above with reference to the example embodiments, the present invention is not limited to the above example embodiments. Various changes can be made to the configuration and details of the present invention that can be understood by those skilled in the art within the scope of the present invention.

INDUSTRIAL APPLICABILITY

[0202] The present invention is suitably applied to speaker authentication systems.

REFERENCE SIGNS LIST

[0203] 11-1 to 11-n Voice processing unit [0204] 111-1 to 111-n Pre-processing unit [0205] 112 Data storage unit [0206] 113-1 to 113-n Feature extraction unit [0207] 114-1 to 114-n Similarity calculation unit [0208] 115-1 to 115-n Authentication unit [0209] 116 Post-processing unit [0210] 21-1 to 21-n Voice processing unit [0211] 215 Authentication unit

SPEAKER AUTHENTICATION SYSTEM, METHOD, AND PROGRAM

Assignee

Inventors

Cpc classification

Classification Explorer

G10L17/06

PHYSICS

Classification Explorer

G06F21/32

PHYSICS

Classification Explorer

G10L25/18

PHYSICS

Classification Explorer

G10L17/08

PHYSICS

Classification Explorer

G10L17/02

PHYSICS

International classification

Classification Explorer

G10L17/06

PHYSICS

Classification Explorer

G10L17/02

PHYSICS

Classification Explorer

G10L25/18

PHYSICS

Abstract

Claims

Description