VOICE PROCESSING DEVICE, VOICE PROCESSING METHOD, RECORDING MEDIUM, AND VOICE AUTHENTICATION SYSTEM
20230326465 · 2023-10-12
Assignee
Inventors
Cpc classification
G10L17/20
PHYSICS
G10L25/18
PHYSICS
International classification
G10L17/20
PHYSICS
G10L17/02
PHYSICS
Abstract
The present disclosure implements speaker verification with high accuracy regardless of input devices. An integration unit (110) integrates voice data inputted using an input device, and the frequency characteristic of the input device, and a feature extraction unit (120) extracts, from an integrated feature obtained by integrated the voice data and the frequency characteristic, a speaker feature for verifying the speaker of voice.
Claims
1. A voice processing device comprising: a memory configured to store instructions; and at least one processor configured to run the instructions to perform: integrating a first feature of voice data input by using an input device and a second feature of a frequency response of the input device; and extracting a speaker feature for identifying a speaker of the voice data from an integrated feature obtained by integrating the first feature of the voice data and the second feature of the frequency response.
2. The voice processing device according to claim 1, wherein at least one processor is configured to run the instructions to perform frequency conversion on the voice data to obtain an acoustic vector sequence that is a time series of an acoustic vector indicating a frequency response of the voice data input from the input device.
3. The voice processing device according to claim 2, wherein the at least one processor is configured to run the instructions to perform: calculating an average value of sensitivity of the input device for each frequency bin, and uses the average value of the sensitivity calculated for each frequency bin as an element of a characteristic vector indicating the frequency response of the input device.
4. The voice processing device according to claim 3, wherein the at least one processor is configured to run the instructions to perform: obtaining the characteristic vector by concatenating two characteristic vectors for two input devices used at time of registration and at time of verification of a speaker.
5. The voice processing device according to claim 3, wherein the integrated feature is a characteristic-acoustic vector sequence, wherein the acoustic vector sequence that is an acoustic feature and the characteristic vector that is the device feature are concatenated, and the at least one processor is configured to run the instructions to perform: concatenating the acoustic vector sequence and the characteristic vector to obtain the characteristic-acoustic vector sequence.
6. The voice processing device according to claim 1, wherein the at least one processor is configured to run the instructions to perform: inputting the integrated feature to a deep neural network (DNN) and obtains the speaker feature from a hidden layer of the DNN.
7. A voice processing method comprising: integrating a first feature of voice data input by using an input device and a second feature of a frequency response of the input device; and extracting a speaker feature for identifying a speaker of the voice data from an integrated feature obtained by integrating the first feature of the voice data and the second feature of the frequency response.
8. A non-transitory recording medium storing a program for causing a computer to execute: processing of integrating a first feature of voice data input by using an input device and a second feature of a frequency response of the input device; and processing of extracting a speaker feature for identifying a speaker of the voice data from an integrated feature obtained by integrating the first feature of the voice data and the second feature of the frequency response.
9. (canceled)
Description
BRIEF DESCRIPTION OF DRAWINGS
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
EXAMPLE EMBODIMENT
Common to All Example Embodiments
[0023] First, an example of a configuration of a commonly applied voice authentication system according to all example embodiments described below will be described.
Speech Authentication System 1
[0024] An example of a configuration of a voice authentication system 1 will be described with reference to
[0025] As illustrated in
[0026] Processing and operations executed by the voice processing device 100(200) will be described in detail in the first and second example embodiments described below. The voice processing device 100(200) acquires voice data (hereinafter, referred to as registered voice data) of a speaker (person A) registered in advance from a data base (DB) on a network or from a DB connected to the voice processing device 100(200). The voice processing device 100(200) acquires, from the input device, voice data (hereinafter, referred to as voice data for verification) of an object (person B) to be verified. The input device is used to input a voice to the voice processing device 100(200). In one example, the input device is a microphone for a call included in a smartphone or a headset microphone.
[0027] The voice processing device 100(200) generates speaker feature A based on the registered voice data. The voice processing device 100(200) generates speaker feature B based on the voice data for verification. The speaker feature A is obtained by integrated the registered voice data registered in the DB and the frequency response of the input device used to input the registered voice data. The acoustic feature is a feature vector having one or a plurality of feature amounts (hereinafter, may be referred to as a first parameter) that is a numerical value quantitatively representing the feature of the registered voice data as an element. The device feature is a feature vector having one or a plurality of feature amounts (hereinafter, may be referred to as a second parameter) that is a numerical value quantitatively representing the feature of the input device as an element. The speaker feature B is obtained by integrating the voice data for verification input using the input device and the frequency response of the input device used to input the voice data for verification.
[0028] The two-step processing below is referred to as “integration” of the voice data (registered voice data or voice data for verification) and the frequency response of the input device. Hereinafter, the registered voice data or the voice data for verification will be referred to as registered voice data/voice data for verification. The first step is to extract an acoustic feature related to the frequency response of the registered voice data/voice data for verification and to extract the device feature related to the frequency response of the sensitivity of the input device used for inputting. The second step is to concatenate both the acoustic feature and the device feature. Concatenating is to break down the acoustic feature into its element, a first parameter, break down the device feature into its element, a second parameter, and generate a feature vector including both the first parameter and the second parameter as mutually independent dimensional elements. As described above, the first parameter is a feature amount extracted from the frequency response of the registered voice data/voice data for verification. The second parameter is a feature amount extracted from the frequency response of the sensitivity of the input device used to input the registered voice data/voice data for verification. In this case, concatenation is to generate a (n+m)-dimensional feature vector having, as elements, n feature amounts that are the first parameter constituting the acoustic feature and m feature amounts that are the second parameter constituting the device feature (n and m are each an integer).
[0029] Thus, one feature (hereinafter, referred to as integrated feature) that depends on both the frequency response of the registered voice data/voice data for verification and the frequency response of the sensitivity of the input device used to input the registered voice data/voice data for verification can be obtained. The integrated feature is a feature vector having a plurality of (n+m, in the above example) feature amounts as an element.
[0030] The meaning of the integration in each example embodiment described below is the same as the meaning described here.
[0031] The acoustic feature is extracted from the registered voice data and the voice data for verification. On the other hand, the device feature is extracted from data related to the input device (in one example, data indicating the frequency response of the sensitivity of the input device). Then, the voice processing device 100(200) transmits the speaker feature A and the speaker feature B to the verification device 10.
[0032] The verification device 10 receives the speaker feature A and the speaker feature B from the voice processing device 100(200). The verification device 10 checks whether the speaker is a registered person himself/herself based on the speaker feature A and the speaker feature B output from the voice processing device 100(200). More specifically, the verification device 10 compares the speaker feature A with the speaker feature B, and outputs a verification result. That is, the verification device 10 outputs information indicating whether the person A and the person B are the same person.
[0033] The voice authentication system 1 may include a control device (control function) that controls an electronic lock of a door for entering an office, automatically activates or logs on an information terminal, or permits access to information on an intra-network based on a verification result output by the verification device 10.
[0034] The voice authentication system 1 may be achieved as a network service. In this case, the voice processing device 100(200) and the verification device 10 may be on a network and communicable with one or a plurality of input devices via a wireless network.
[0035] Hereinafter, a specific example of the voice processing device 100(200) included in the voice authentication system 1 will be described. In the description below, “voice data” refers to both “registered voice data” and “voice data for verification”.
First Example Embodiment
[0036] The voice processing device 100 will be described as the first example embodiment with reference to
Speech Processing Device 100
[0037] A configuration of the voice processing device 100 according to the present first example embodiment will be described with reference to
[0038] The integration unit 110 integrates the voice data input by using one or a plurality of input devices and the frequency response of the input device. The integration unit 110 is an example of an integration means.
[0039] In one example, the integration unit 110 acquires voice data (registered voice data or voice data for verification in
[0040] The integration unit 110 acquires data regarding the input device from the DB (
[0041]
[0042]
[0043] The integration unit 110 concatenates the acoustic feature thus obtained and the device feature to obtain the integrated feature based on the voice data for verification and integrated feature based on the registered voice data. As described regarding the voice authentication system 1, the integrated feature is one feature vector that depends on both the frequency response of the registered voice data/voice data for verification and the frequency response of the sensitivity of the input device used to input the registered voice data/voice data for verification. As described above, the integrated feature includes the first parameter regarding the frequency response of the registered voice data/voice data for verification and the second parameter regarding the frequency response of the sensitivity of the input device used to input the registered voice data/voice data for verification. An example of the processing and the integrated feature related to the details of integration will be described in the second example embodiment. The integration unit 110 outputs the integrated feature thus obtained to the feature extraction unit 120.
[0044] The feature extraction unit 120 extracts speaker features (speaker features A and B) for verifying the speaker of voice from the integrated feature obtained by integrating the voice data and the frequency response. The feature extraction unit 120 is an example of a feature extraction means.
[0045] An example of processing in which the feature extraction unit 120 extracts the speaker feature from the integrated feature will be described with reference to
[0046] In the learning phase, the feature extraction unit 120 inputs training data and updates each parameter of the DNN based on any loss function so that an output result matches correct answer data. The correct answer data is data indicating a correct answer of the speaker. The DNN completes the learning so that the speaker can be verified based on the integrated feature before the phase for extracting the speaker feature.
[0047] The feature extraction unit 120 inputs the integrated feature to the DNN that has learned. The DNN of the feature extraction unit 120 verifies the speaker (for example, the person A or the person B) using the input integrated feature. The feature extraction unit 120 extracts the speaker feature of interest of the DNN that has learned.
[0048] Specifically, the feature extraction unit 120 extracts, from a hidden layer of the DNN, the speaker feature of interest for verifying the speaker. In other words, the feature extraction unit 120 extracts the speaker feature for verifying the speaker of voice using the integrated feature obtained by integrating the voice data and the frequency response and the DNN. Therefore, the speaker feature is acquired based on the acoustic feature and the device feature, so that the speaker feature does not depend on the frequency response of the input device. Therefore, the verification device 10 can verify the speaker based on the speaker feature regardless of whether the same input device (having the same frequency response) or different input devices (having different frequency response) are used at the time of registration and at the time of verification.
Operation of Speech Processing Device 100
[0049] An operation of the voice processing device 100 according to the present first example embodiment will be described with reference to
[0050] As illustrated in
[0051] The feature extraction unit 120 receives, from the integration unit 110, data of the integrated feature obtained by integrating the voice data and the frequency response. The feature extraction unit 120 extracts the speaker feature from the received integrated feature (S2).
[0052] The feature extraction unit 120 outputs data of the speaker feature obtained as a result of step S2. In one example, the feature extraction unit 120 transmits the data of the speaker feature to the verification device 10 (
[0053] Thus, the operation of the voice processing device 100 according to the present first example embodiment ends.
Effects of the Present Example Embodiment
[0054] With the configuration of the present example embodiment, the integration unit 110 integrates the voice data input using the input device and the frequency response of the input device, and the feature extraction unit 120 extracts, from the integrated feature obtained by integrating the voice data and the frequency response, the speaker feature for verifying the speaker of the voice. The speaker feature includes not only information related to the acoustic feature of the voice input using the input device but also the information related to the frequency response of the input device. Therefore, the verification device 10 of the voice authentication system 1 can perform the speaker verification with high accuracy based on the speaker feature regardless of the difference between the input device used to input the voice at the time of registration and the input device used to input the voice at the time of verification.
[0055] However, it is desirable that the input device used to input the voice at the time of registration has sensitivity in a wide band as compared with the input device used to input the voice at the time of verification. More specifically, the use band (band having sensitivity) of the input device used to input the voice at the time of registration desirably includes the use band of the input device used to input the voice at the time of verification.
Second Example Embodiment
[0056] The voice processing device 200 will be described as the second example embodiment with reference to
Speech Processing Device 200
[0057] A configuration of the voice processing device 200 according to the present second example embodiment will be described with reference to
[0058] The integration unit 210 integrates the voice data input by using the input device and the frequency response of the input device. The integration unit 210 is an example of an integration means. As illustrated in
[0059] The characteristic vector calculation unit 211 calculates, for each frequency bin, an average value of the sensitivity of the input device in a band of frequencies (a band having a predetermined width including frequency bins), and sets the average value calculated for each frequency bin as an element of the characteristic vector (an example of the device feature). The characteristic vector indicates the frequency response unique to the input device. The characteristic vector calculation unit 211 is an example of a characteristic vector calculation means.
[0060] In one example, the characteristic vector calculation unit 211 of the integration unit 210 acquires data related to the input device from the DB (
[0061] The voice conversion unit 212 obtains an acoustic vector sequence (an example of the acoustic feature) by converting the voice data from the time domain to the frequency domain. Here, the acoustic vector sequence represents the time series of the acoustic vector for each predetermined time width. The voice conversion unit 212 is an example of a voice conversion means.
[0062] In one example, the voice conversion unit 212 of the integration unit 210 receives the voice data for verification from the input device, and acquires the registered voice data from the DB. The voice conversion unit 212 performs a fast Fourier transform (FFT) to convert the voice data into amplitude spectrum data for each predetermined time width.
[0063] Further, the voice conversion unit 212 may divide the amplitude spectrum data for each predetermined time width for each predetermined frequency band using a filter bank.
[0064] The voice conversion unit 212 obtains a plurality of feature amounts from the amplitude spectrum data for each predetermined time width (or those obtained by dividing it for each predetermined frequency band using a filter bank). Then, the voice conversion unit 212 generates an acoustic vector including a plurality of feature amounts acquired. In one example, the feature amount is the acoustic intensity for each predetermined frequency range. In this way, the voice conversion unit 212 obtains the time series of the acoustic vector (hereinafter, referred to as an acoustic vector sequence) for each predetermined time width. Then, the voice conversion unit 212 transmits the data of the calculated acoustic vector sequence to the concatenating unit 213.
[0065] The concatenating unit 213 obtains a characteristic-acoustic vector sequence (an example of the integrated feature) by “concatenating” the acoustic vector sequence (an example of the acoustic feature) and the characteristic vector (an example of the device feature).
[0066] In one example, the concatenating unit 213 of the integration unit 210 receives the characteristic vector data from the characteristic vector calculation unit 211. The concatenating unit 213 receives the data of the acoustic vector sequence from the voice conversion unit 212.
[0067] Then, the concatenating unit 213 expands the dimension of each acoustic vector of the acoustic vector sequence and adds the element of the characteristic vector as the element of the acoustic vector obtained by expanding each dimension of the acoustic vector sequence.
[0068] The concatenating unit 213 outputs the data of the characteristic-acoustic vector sequence thus obtained to the feature extraction unit 120.
[0069] The feature extraction unit 120 extracts the speaker feature for verifying the speaker of the voice from the characteristic-acoustic vector sequence (an example of the integrated feature) obtained by concatenating the acoustic vector sequence (an example of the acoustic feature) and the characteristic vector (an example of the device feature). The feature extraction unit 120 is an example of a feature extraction means.
[0070] In one example, the feature extraction unit 120 receives the characteristic-acoustic vector sequence data from the concatenating unit 213 of the integration unit 210. The feature extraction unit 120 inputs the characteristic-acoustic vector sequence data to the DNN that has learned (
[0071] The feature extraction unit 120 outputs the data of the integrated feature based on the characteristic-acoustic vector sequence to the verification device 10 (
Modification
[0072] In the present modification, the acoustic vector (speaker feature A) at the time of registration and the acoustic vector (speaker feature B) at the time of verification are compared in a common part of effective bands in which both the input device used at the time of verification and the input device used at the time of registration have sensitivity.
[0073] The characteristic vector calculation unit 211 according to the present modification obtains a third characteristic vector by combining (to be described below) a first characteristic vector indicating the frequency response of the sensitivity of an input device A and a second characteristic vector indicating the frequency response of the sensitivity of an input device B.
[0074] The characteristic vector calculation unit 211 according to the present modification outputs the data of the third characteristic vector thus calculated to the concatenating unit 213.
[0075] The concatenating unit 213 multiplies each of the acoustic vector (an example of the speaker feature A) at the time of registration and the acoustic vector (an example of the speaker feature B) at the time of verification by the third characteristic vector obtained by combining the two characteristic vectors.
[0076] In a band in which at least one of the input device used at the time of verification and the input device used at the time of registration has no sensitivity, a value of the third characteristic vector is zero. Therefore, the value of the acoustic vector multiplied by the third characteristic vector is also zero except for the common part of the effective bands in which the two input devices have sensitivity.
[0077] In this way, the effective band of the speaker feature A and the effective band of the speaker feature B are the same. Thus, the verification device 10 (
[0078] The combination of the two characteristic vectors in the present modification will be described in more detail. The characteristic vector calculation unit 211 compares an n-th element (fn) of the first characteristic vector with a related element (gn) of the second characteristic vector. Then, the characteristic vector calculation unit 211 sets a smaller one of these two elements (fn, gn) as a related element of the third characteristic vector. Alternatively, the characteristic vector calculation unit 211 may set a geometric mean √ (fn×gn) of the n-th element (fn) of the first characteristic vector and the related element (gn) of the second characteristic vector as an n-th element of the third characteristic vector. Alternatively, the characteristic vector calculation unit 211 may input the first characteristic vector and the second characteristic vector to a DNN, which is not illustrated, and extract, from a hidden layer of the DNN, a third characteristic vector in which a value of zero is weighted to a component other than the common part of the effective bands of both the first characteristic vector and the second characteristic vector.
Operation of Speech Processing Device 200
[0079] An operation of the voice processing device 200 according to the present second example embodiment will be described with reference to
[0080] As illustrated in
[0081] The characteristic vector calculation unit 211 calculates, for each frequency bin, an average value of the sensitivity of the input device in a band of frequencies (a band having a predetermined width including frequency bins) from the data indicating the frequency response of the input device. The characteristic vector calculation unit 211 calculates the characteristic vector having the calculated average value of the sensitivity for each frequency bin as an element (S202). Then, the characteristic vector calculation unit 211 transmits the data of the calculated characteristic vector to the concatenating unit 213.
[0082] The voice conversion unit 212 executes frequency analysis on the voice data using the filter bank to obtain amplitude spectrum data for each predetermined time width. The voice conversion unit 212 calculates the above-described acoustic vector sequence from the amplitude spectrum data for each predetermined time width (S203). Then, the voice conversion unit 212 transmits the data of the calculated acoustic vector sequence to the concatenating unit 213.
[0083] The concatenating unit 213 concatenates the acoustic vector sequence (an example of the acoustic feature) based on the voice data input using the input device and the characteristic vector (an example of the device feature) related to the frequency response of the input device to calculate the characteristic-acoustic vector sequence (an example of the integrated feature) (S204). The concatenating unit 213 outputs the data of the characteristic-acoustic vector sequence thus obtained to the feature extraction unit 120.
[0084] The feature extraction unit 120 receives the characteristic-acoustic vector sequence data from the concatenating unit 213 of the integration unit 210. The feature extraction unit 120 extracts the speaker feature from the characteristic-acoustic vector sequence (S205). Specifically, the feature extraction unit 120 extracts the speaker feature A (
[0085] The feature extraction unit 120 outputs data of the speaker feature thus obtained. In one example, the feature extraction unit 120 transmits the data of the speaker feature to the verification device 10 (
[0086] Thus, the operation of the voice processing device 200 according to the present second example embodiment ends.
Effects of the Present Example Embodiment
[0087] With the configuration of the present example embodiment, the integration unit 210 integrates the voice data input using the input device and the frequency response of the input device, and the feature extraction unit 120 extracts, from the integrated feature obtained by integrating the voice data and the frequency response, the speaker feature for verifying the speaker of the voice. The speaker feature includes not only information related to the acoustic feature of the voice input using the input device but also the information related to the frequency response of the input device. Therefore, the verification device 10 of the voice authentication system 1 can perform the speaker verification with high accuracy based on the speaker feature regardless of the difference between the input device used to input the voice at the time of registration and the input device used to input the voice at the time of verification.
[0088] More specifically, the integration unit 210 includes the characteristic vector calculation unit 211 that calculates an average value of the sensitivity of the input device for each frequency bin and uses the average value calculated for each frequency bin as an element of the characteristic vector. The characteristic vector indicates the frequency response of the input device.
[0089] The integration unit 210 includes the voice conversion unit 212 that obtains the acoustic vector sequence by performing Fourier transform on the voice from the time domain to the frequency domain using the filter bank. The integration unit 210 includes the concatenating unit 213 that obtains the characteristic-acoustic vector sequence by concatenating the acoustic vector sequence and the characteristic vector. Thus, it is possible to obtain the characteristic-acoustic vector sequence in which the acoustic vector sequence that is an acoustic feature and the characteristic vector that is a device feature are concatenated.
[0090] The feature extraction unit 120 can obtain the speaker feature based on the characteristic-acoustic vector sequence. Therefore, as described above, the verification device 10 of the voice authentication system 1 can perform the speaker verification with high accuracy based on the speaker feature.
Hardware Configuration
[0091] Each component of the voice processing devices 100 and 200 described in the first and second example embodiments represents a block on a function basis. Some or all of these components are achieved by an information processing device 900 as illustrated, for example, in
[0092] As illustrated in
[0102] The components of the voice processing devices 100 and 200 described in the first and second example embodiments are achieved by the CPU 901 reading and executing the program 904 that achieves these functions. The program 904 for achieving the function of each component is stored in the storage device 905 or the ROM 902 in advance, for example, and the CPU 901 loads the program 904 into the RAM 903 and executes the program 904 as necessary. The program 904 may be supplied to the CPU 901 via the communication network 909, or may be stored in advance in the recording medium 906, and the drive device 907 may read the program and supply the program to the CPU 901.
[0103] With the above configuration, the voice processing devices 100 and 200 described in the first and second example embodiments are achieved as hardware. Therefore, effects similar to the effects described in the first and second example embodiments can be obtained.
INDUSTRIAL APPLICABILITY
[0104] In one example, the present disclosure can be used in a voice authentication system that performs verification by analyzing voice data input using an input device.
Reference signs List
[0105] 1 voice authentication system [0106] 10 verification device [0107] 100 voice processing device [0108] 110 integration unit [0109] 120 feature extraction unit [0110] 200 voice processing device [0111] 210 integration unit [0112] 211 characteristic vector calculation unit [0113] 212 voice conversion unit