NEURAL NETWORK-BASED SIGNAL PROCESSING APPARATUS, NEURAL NETWORK-BASED SIGNAL PROCESSING METHOD, AND COMPUTER-READABLE STORAGE MEDIUM
20220335950 · 2022-10-20
Assignee
Inventors
Cpc classification
G06N7/01
PHYSICS
G10L17/26
PHYSICS
G10L17/02
PHYSICS
International classification
Abstract
A spoofing detection apparatus 100 includes a multi-channel spectrogram creation unit 10 and an evaluation unit 40. The multi-channel spectrogram creation unit 10 extracts different type of spectrograms from speech data and integrates the different type of spectrograms to create a multi-channel spectrogram. The evaluation unit 40 evaluates the created multi-channel spectrogram by applying the created multi-channel spectrogram to a classifier constructed using labeled multi-channel spectrograms as training data and classifies it to either genuine or spoof.
Claims
1. A neural network-based signal processing apparatus comprising: at least one memory storing instructions; and at least one processor configured to execute the instructions to: receive a multi-dimension features which contain two or more two-dimension feature maps; produce an attention weight for each element in the multi-dimension features by using a neural network; and produce low-dimension features or posterior probabilities for designated classes, based on the multi-dimension features and the attention weight.
2. The neural network-based signal processing apparatus according to claim 1, further at least one processor configured to execute the instructions to: squeeze the multi-dimension features along two dimensions by calculating statistics and produce an attention weight for the rest one dimension by using a neural network.
3. The neural network-based signal processing apparatus according to claim 1, further at least one processor configured to execute the instructions to: squeeze the multi-dimension features along any single dimension by calculating statistics and produce an attention weight for the rest two dimensions by using a neural network.
4. The neural network-based signal processing apparatus according to claim 1, further at least one processor configured to execute the instructions to: receive a multi-dimension features which contain two or more two-dimension feature maps, train an attention network jointly with a classification network, using labeled multi-dimension features.
5. The neural network-based signal processing apparatus according to claim 4, further at least one processor configured to execute the instructions to: multiple a weight matrix and the multi-dimension features, train the attention network jointly with a classification network, using the labeled multi-dimension features after multiplication.
6. The neural network-based signal processing apparatus, according to claim 1, further at least one processor configured to execute the instructions to: produce a posterior probability that the input multi-dimension features are from a genuine speech or spoofing.
7. A neural network-based signal processing method comprising: a receiving a multi-dimension features which contain two or more two-dimension feature maps, producing an attention weight for each element in the multi-dimension features by using a neural network, and producing low-dimension features or posterior probabilities for designated classes, based on the multi-dimension features and the attention weight.
8. A non-transitory computer-readable storage medium storing a program that includes commands for causing a computer to execute: receiving a multi-dimension features which contain two or more two-dimension feature maps, producing an attention weight for each element in the multi-dimension features by using a neural network, and producing low-dimension features or posterior probabilities for designated classes, based on the multi-dimension features and the attention weight.
9. The neural network-based signal processing method according to claim 7, Wherein, squeezing the multi-dimension features along two dimensions by calculating statistics and producing an attention weight for the rest one dimension by using a neural network.
10. The neural network-based signal processing method according to claim 7, Wherein, squeezing the multi-dimension features along any single dimension by calculating statistics and producing an attention weight for the rest two dimensions by using a neural network.
11. The neural network-based signal processing method according to claim 7, further comprising receiving a multi-dimension features which contain two or more two-dimension feature maps, training an attention network jointly with a classification network, using labeled multi-dimension features.
12. The neural network-based signal processing method according to claim 11, wherein, multiplying a weight matrix and the multi-dimension features, training the attention network jointly with a classification network, using the labeled multi-dimension features after multiplication.
13. The neural network-based signal processing method, according to claim 7, Wherein, producing a posterior probability that the input multi-dimension features are from a genuine speech or spoofing.
14. The non-transitory computer-readable storage medium according to claim 8, Wherein, squeezing the multi-dimension features along two dimensions by calculating statistics and producing an attention weight for the rest one dimension by using a neural network.
15. The non-transitory computer-readable storage medium according to claim 8, Wherein, squeezing the multi-dimension features along any single dimension by calculating statistics and producing an attention weight for the rest two dimensions by using a neural network.
16. The non-transitory computer-readable storage medium according to claim 8, the program further includes commands causing the computer to execute: receiving a multi-dimension features which contain two or more two-dimension feature maps, training an attention network jointly with a classification network, using labeled multi-dimension features.
17. The non-transitory computer-readable storage medium according to claim 16, wherein, multiplying a weight matrix and the multi-dimension features, training the attention network jointly with a classification network, using the labeled multi-dimension features after multiplication.
18. The non-transitory computer-readable storage medium, according to claim 8, Wherein, producing a posterior probability that the input multi-dimension features are from a genuine speech or spoofing.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0015] The drawings together with the detailed description, serve to explain the principles for the inventive neural network-based signal processing method. The drawings are for illustration and do not limit the application of the technique.
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028] Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures illustrating integrated circuit architecture may be exaggerated relative to other elements to help to improve understanding of the present and alternate example embodiments.
DESCRIPTION OF EMBODIMENTS
[0029] Each example embodiment of the present invention will be described below with reference to the figures. The following detailed descriptions are merely exemplary in nature and are not intended to limit the invention or the application and uses of the invention. Furthermore, there is no intention to be bound by any theory presented in the preceding background of the invention or the following detailed description.
Embodiment
[0030] Example embodiment of the present invention are described in detail below referring to the accompanying drawings.
Apparatus Configuration
[0031] First, a configuration of a neural network-based signal processing apparatus 100 according to the present embodiment will be described using
[0032] As shown in
[0033] As described above, according to the neural network-based signal processing apparatus 100, it is possible to evaluate important features and support selection of the important features, even if the important features locate differently across feature maps.
[0034] Subsequently, the configuration of the neural network-based signal processing apparatus according to the embodiment will be more specifically described with reference to
[0035] In the present embodiment, the neural network-based signal processing apparatus functions in a training phase and a test phase. Therefore, in
[0036] As shown in
[0037] Among these, the feature map extraction unit 10 and the multiple feature map stacking unit 20 function in both phases. For this reason, the feature map feature 10 is represented as 10_a in the training phase and 10_b in the testing phase. Similarly, the multiple feature map stacking unit 20 is also represented as 20_a in the training phase and 20_b in the testing phase.
[0038] In the training phase, the feature map extraction unit 10_a extracts multiple feature maps from input training data. The multiple feature map stacking unit 20_a stacks the multiple extracted feature maps to a 3D feature set. The multi-dimension attentive NN training unit 30 trains a neural network using the 3D feature sets and labels of the training data. The multi-dimension attentive NN training unit 30 stores the trained NN parameter in NN parameter storage 40.
[0039] In the evaluation phase, the feature map extraction unit 10_b extracts multiple feature maps from input testing data. The multiple feature map stacking unit 20_b stacks the multiple extracted feature maps to a 3D feature set. The multi-dimension attentive NN evaluation unit 50 receives NN parameters from storage 40 and receives the 3D feature set from the multiple feature map stacking unit 20_b. After that, the multi-dimension attentive NN evaluation unit 50 calculates the posterior for a certain output node.
[0040] In an example of spoofing detection, the multi-dimension attentive NN evaluation unit 50 calculates the posterior of node “spoof” as the score. Note that the multi-dimension attentive NN evaluation unit 50 can also output hidden layers as a new feature set for the input audio. Then the feature set can be used together with any classifiers, such as cosine similarity, probabilistic linear discriminant analysis (PLDA) and so on.
[0041] Furthermore, the multi-dimension attentive NN evaluation unit 50 can squeezes the multi-dimension features along two dimensions by calculating statistics and produces an attention weight for the rest one dimension by using the neural network. And more, the multi-dimension attentive NN evaluation unit 50 can squeeze the multi-dimension features along any single dimension by calculating statistics and produces an attention weight for the rest two dimensions by using a neural network.
[0042] Specific five examples of the multi-dimension attentive neural network training unit 30 will be described with reference to
[0043]
[0044] The T&F squeezing unit 11_a squeezes the input 3D feature sets of [d.sub.c, d.sub.t, d.sub.f] dimension along both of the time and frequency dimensions, and gets two statistics (mean and standard deviation) of d.sub.c dimension. The channel-attentive NN training unit 12_a takes the statistics as input and outputs a set of weights for channels, and expands the weights of d.sub.c dimension into [d.sub.c, d.sub.t, d.sub.f] by copying, the same size as the input feature map.
[0045] One example of the channel-attentive NN training unit 12_a is shown
[0046] The T&C squeezing unit 13_a squeezes the 3D feature sets, along both of the time and channel dimensions, and gets the mean and standard deviation statistics of d.sub.f dimension. The frequency-attentive NN training unit 14_a takes the statistics as input and outputs a set (d.sub.f) of weights for frequency bins, and expands the weights into [d.sub.c, d.sub.t, d.sub.f] dimension, the same size as the input feature map. The frequency-attentive NN training unit 14_a can be the same as or different from the example of the channel-attentive NN training unit 12_a shown
[0047] The F&C squeezing unit 15_a squeezes the 3D feature sets, along both of the frequency and channel dimensions, and gets the mean and standard deviation statistics of d.sub.t dimension. The time-attentive NN training unit 16_a takes the statistics as input and outputs a set (d.sub.f) of weights for time frames, and expands the weights into [d.sub.c, d.sub.t, d.sub.f] dimension, the same size as the input feature map. The time-attentive NN training unit 16_a can be the same as or different from the example of the channel-attentive NN training unit 12_a shown
[0048] The multiplication unit 17_a multiplies the three weight matrices with the input 3D feature sets in the element-wise manner, and passes them to the NN training unit 18_a, which includes one or more hidden layers and one output layer. In an example of spoofing detection, the output layer consist of two nodes, “spoof” and “genuine”. In an example of speaker recognition, the nodes in the output layer are speaker IDs. Note that the multi-dimension attentive NN training unit 10 (11_a˜18_a) is trained jointly with only one objective function, for example, cross entropy loss minimization.
[0049]
[0050] The T&F squeezing unit 11_b squeezes the 3D feature sets input of [d.sub.c, d.sub.t, d.sub.f] dimension along both of the time and frequency dimensions, and gets two statistics (mean and standard deviation) of d.sub.c dimension. The channel-attentive NN training unit 12_b takes the statistics as input and outputs a set of weights for channels, and expands the weights of d.sub.c dimension into [d.sub.c, d.sub.t, d.sub.f], the same size as the input 3D feature sets. The channel-attentive NN training unit 12_b can be the same as or different from the example of the channel-attentive NN training unit 12_a shown
[0051] The T&C squeezing unit 13_b squeezes the output of 17_b, along both of the time and channel dimensions, and gets the mean and standard deviation statistics of d.sub.f dimension. The frequency-attentive NN training unit 14_g takes the statistics as input and outputs a set (d.sub.f) of weights for frequency bins, and expands the weights into [d.sub.c, d.sub.t, d.sub.f], the same size as the input feature map. The frequency-attentive NN training unit 14_b can be the same as or different from the example of the channel-attentive NN training unit 12_a shown
[0052] The F&C squeezing unit 15_b squeezes the feature map input, along both of the frequency and channel dimensions, and gets the mean and standard deviation statistics of d.sub.t dimension, respectively. The time-attentive NN training unit 16_b takes the statistics as input and outputs a set (d.sub.f) of weights for time frames, and expands the weights into [d.sub.c, d.sub.t, d.sub.f], the same size as the input feature map. The time-attentive NN training unit 16_b can be the same as or different from the example of the channel-attentive NN training unit 12_a shown
[0053] The NN training unit 18_b takes the output of the multiplication unit 17_d as input. The network training unit 18_b includes one or more hidden layers and one output layer. Note that the multi-dimension attentive NN training unit 10 (11_b˜18_b) is trained jointly with only one objective function.
[0054]
[0055] The T squeezing unit 19_a squeezes the input 3D feature sets of the dimension [d.sub.c, d.sub.t, d.sub.f] along the time dimension, and gets two statistics (mean and standard deviation) of [d.sub.c, d.sub.f] dimension. The channel-frequency attentive NN training unit 20_a takes the statistics as input and outputs a set of weights of dimension [d.sub.c, d.sub.f], and expands the weights into [d.sub.c, d.sub.t, d.sub.f], the same size as the input feature map. The channel-frequency attentive NN training unit 20_a can be the same as or different from the example of the channel-attentive NN training unit 12_a shown
[0056] The F&C squeezing unit 15_a squeezes the input 3D feature sets along both of the frequency and channel dimensions, and gets the mean and standard deviation statistics of d.sub.t dimension, respectively. The time-attentive NN training unit 16_a takes the statistics as input and outputs a set (d.sub.t) of weights for time frames, and expand the weights into [d.sub.c, d.sub.t, d.sub.f], the same size as the input feature map. The time-attentive NN training unit 16_a can be the same as or different from the example of the channel-attentive NN training unit 12_a shown
[0057] The multiplication unit 17_e multiplies the two weight matrices with the input 3D feature maps in the element-wise manner, and pass to the NN training unit 18_c, which includes one or more hidden layers and one output layer. Note that the multi-dimension attentive NN training unit 10 is trained together with only one objective function.
[0058]
[0059] The T squeezing unit 19_b squeezes the input 3D feature sets of [d.sub.c, d.sub.t, d.sub.f] dimension along the time dimension, and gets two statistics (mean and standard deviation) of [d.sub.c, d.sub.f] dimension. The channel-frequency attentive network 20_b takes the statistics as input and outputs a set of weights of [d.sub.c, d.sub.f] dimension, and expands the weights into [d.sub.c, d.sub.t, d.sub.f], the same size as the input feature map. The channel-frequency-attentive NN training unit 20_b can be the same as or different from the example of the channel-attentive NN training unit 12_a shown
[0060] The F&C squeezing unit 15_d squeezes the output of 17_f along both of the frequency and channel dimensions, and gets the mean and standard deviation statistics of d.sub.t dimension, respectively. The time-attentive NN training unit 16_d takes the statistics as input and outputs a set (d.sub.t) of weights for time frames, and expand the weights into [d.sub.c, d.sub.t, d.sub.f], the same size as the input 3D feature sets. The time-attentive NN training unit 16_d can be the same as or different from the example of the channel-attentive NN training unit 12_a shown
[0061] The NN training unit 18_d takes the output of 17_g as input. 18_d includes one or more hidden layers and one output layer. Note that the multi-dimension attentive NN training unit 30 is trained together with only one objective function.
[0062] In the third (
[0063]
[0064] The channel-time-frequency attentive network 21 takes the 3D feature sets as input and outputs a set of weights of [d.sub.c, d.sub.t, d.sub.f] dimension. The channel-time-frequency attentive network 21 can be the same as or different from the example of the channel-attentive NN training unit 12_a shown
[0065] The NN training unit 18_e takes the output of 17_h as input. 18_e includes one or more hidden layers and one output layer. Note that the multi-dimension attentive NN training unit 30 is trained together with only one objective function.
Operations of Apparatus
[0066] Operations performed by the neural network-based signal processing apparatus 100 according to the embodiment of the present invention will be described with reference to
[0067]
[0068] First, as shown in
[0069]
[0070]
Effect of the Example Embodiment
[0071] This invention introduces an attention mechanism across multiple feature maps and support automatic selection of the best features. According to the present embodiment, it is possible to select important features to the speech processing tasks, even if they locate differently across feature maps. The five examples of multi-dimension attentive NN training unit (
[0072] The first (
[0073] The third (
[0074] The fifth (
Program
[0075] A program of the embodiment need only be a program for causing a computer to execute steps A01 to A02 shown in
[0076] The program according to the embodiment of the present invention may be executed by a computer system constructed using a plurality of computers. In this case, for example, each computer may function as a different one of the feature map extraction unit 10, the multiple feature map stacking unit 20, the multi-dimension attentive NN training unit 30, the NN parameter storage, and the multi-dimension attentive NN evaluation unit 50.
Physical Configuration
[0077] The following describes a computer that realizes the neural network-based signal processing apparatus by executing the program of the embodiment, with reference to
[0078] As shown in
[0079] The CPU 111 carries out various calculations by expanding programs (codes) according to the present embodiment, which are stored in the storage device 113, to the main memory 112 and executing them in a predetermined sequence. The main memory 112 is typically a volatile storage device such as a DRAM (Dynamic Random-Access Memory). Also, the program according to the present embodiment is provided in a state of being stored in a computer-readable storage medium 120. Note that the program according to the present embodiment may be distributed over the Internet, which is connected to via the communication interface 117.
[0080] Also, specific examples of the storage device 113 include a semiconductor storage device such as a flash memory, in addition to a hard disk drive. The input interface 114 mediates data transmission between the CPU 111 and an input device 118 such as a keyboard or a mouse. The display controller 115 is connected to a display device 119 and controls display on the display device 118.
[0081] The data reader/writer 116 mediates data transmission between the CPU 111 and the storage medium 120, reads out programs from the storage medium 120, and writes results of processing performed by the computer 110 in the storage medium 120. The communication interface 17 mediates data transmission between the CPU 111 and another computer.
[0082] Also, specific examples of the storage medium 120 include a general-purpose semiconductor storage device such as CF (Compact Flash (registered trademark)) and SD (Secure Digital), a magnetic storage medium such as a flexible disk, and an optical storage medium such as a CD-ROM (Compact Disk Read Only Memory).
[0083] The neural network-based signal processing apparatus 100 according to the present exemplary embodiment can also be realized using items of hardware corresponding to various components, rather than using the computer having the program installed therein. Furthermore, the neural network-based signal processing apparatus 100 may be realized by the program, and the remaining part of the neural network-based signal processing apparatus 100 may be realized by hardware.
[0084] The above-described embodiment can be partially or entirely expressed by, but is not limited to, the following Supplementary Notes 1 to 18.
(Supplementary Note 1)
[0085] A neural network-based signal processing apparatus comprising:
[0086] a multi-dimension attentive neural network evaluation unit that receives a multi-dimension features which contain two or more two-dimension feature maps, produces an attention weight for each element in the multi-dimension features by using a neural network, and produces low-dimension features or posterior probabilities for designated classes, based on the multi-dimension features and the attention weight.
(Supplementary Note 2)
[0087] The neural network-based signal processing apparatus according to supplementary note 1,
[0088] Wherein the multi-dimension attentive neural network evaluation unit squeezes the multi-dimension features along two dimensions by calculating statistics and produces an attention weight for the rest one dimension by using a neural network.
(Supplementary Note 3)
[0089] The neural network-based signal processing apparatus according to supplementary note 1,
[0090] Wherein the multi-dimension attentive neural network evaluation unit that squeeze the multi-dimension features along any single dimension by calculating statistics and produces an attention weight for the rest two dimensions by using a neural network.
(Supplementary Note 4)
[0091] The neural network-based signal processing apparatus according to any of supplementary notes 1 to 3, further comprising
[0092] a multi-dimension attentive network training unit that receives a multi-dimension features which contain two or more two-dimension feature maps, trains an attention network jointly with a classification network, using labeled multi-dimension features.
(Supplementary Note 5)
[0093] The neural network-based signal processing apparatus according to supplementary note 4,
[0094] wherein the multi-dimension attentive network training unit multiplies a weight matrix and the multi-dimension features, trains the attention network jointly with a classification network, using the labeled multi-dimension features after multiplication.
(Supplementary Note 6)
[0095] The neural network-based signal processing apparatus, according to any of supplementary notes 1 to 5,
[0096] Wherein the multi-dimension attentive neural network evaluation unit produces a posterior probability that the input multi-dimension features are from a genuine speech or spoofing.
(Supplementary Note 7)
[0097] A neural network-based signal processing method comprising:
[0098] (a) a step of receiving a multi-dimension features which contain two or more two-dimension feature maps, produces an attention weight for each element in the multi-dimension features by using a neural network, and producing low-dimension features or posterior probabilities for designated classes, based on the multi-dimension features and the attention weight.
(Supplementary Note 8)
[0099] The neural network-based signal processing method according to supplementary note 7,
[0100] Wherein in the step (a), squeezing the multi-dimension features along two dimensions by calculating statistics and produces an attention weight for the rest one dimension by using a neural network.
(Supplementary Note 9)
[0101] The neural network-based signal processing method according to supplementary note 7,
[0102] Wherein in the step (a), squeezing the multi-dimension features along any single dimension by calculating statistics and produces an attention weight for the rest two dimensions by using a neural network.
(Supplementary Note 10)
[0103] The neural network-based signal processing method according to any of supplementary notes 7 to 9, further comprising
[0104] (c) a step of receiving a multi-dimension features which contain two or more two-dimension feature maps, trains an attention network jointly with a classification network, using labeled multi-dimension features.
(Supplementary Note 11)
[0105] The neural network-based signal processing method according to supplementary note 10,
[0106] wherein in the step (c), multiplying a weight matrix and the multi-dimension features, trains the attention network jointly with a classification network, using the labeled multi-dimension features after multiplication.
(Supplementary Note 12)
[0107] The neural network-based signal processing method, according to any of supplementary notes 7 to 11,
[0108] Wherein in the step (a), producing a posterior probability that the input multi-dimension features are from a genuine speech or spoofing.
(Supplementary Note 13)
[0109] A computer-readable storage medium storing a program that includes commands for causing a computer to execute:
[0110] (a) a step of receiving a multi-dimension features which contain two or more two-dimension feature maps, produces an attention weight for each element in the multi-dimension features by using a neural network, and producing low-dimension features or posterior probabilities for designated classes, based on the multi-dimension features and the attention weight.
(Supplementary Note 14)
[0111] The computer-readable storage medium according to supplementary note 13, Wherein in the step (a), squeezing the multi-dimension features along two dimensions by calculating statistics and produces an attention weight for the rest one dimension by using a neural network.
(Supplementary Note 15)
[0112] The computer-readable storage medium according to supplementary note 13,
[0113] Wherein in the step (a), squeezing the multi-dimension features along any single dimension by calculating statistics and produces an attention weight for the rest two dimensions by using a neural network.
(Supplementary Note 16)
[0114] The computer-readable storage medium according to any of supplementary notes 13 to 15,
[0115] Wherein the program further includes commands causing the computer to execute (c) a step of receiving a multi-dimension features which contain two or more two-dimension feature maps, trains an attention network jointly with a classification network, using labeled multi-dimension features.
(Supplementary Note 17)
[0116] The computer-readable storage medium according to supplementary note 16, wherein in the step (c), multiplying a weight matrix and the multi-dimension features, trains the attention network jointly with a classification network, using the labeled multi-dimension features after multiplication.
(Supplementary Note 18)
[0117] The computer-readable storage medium, according to any of supplementary notes 13 to 17,
[0118] Wherein in the step (a), producing a posterior probability that the input multi-dimension features are from a genuine speech or spoofing.
[0119] Although the invention of the present application has been described above with reference to the embodiment, the invention of the present application is not limited to the above embodiment. Various changes that can be understood by a person skilled in the art can be made to the configurations and details of the invention of the present application within the scope of the invention of the present application.
INDUSTRIAL APPLICABILITY
[0120] As described above, according to the present invention, it is possible to suppress misrecognition by using multiple spectrograms obtained from speech in speaker spoofing detection. The present invention is useful in fields, e.g. speaker verification.
REFERENCE SIGNS LIST
[0121] 10 feature map extraction unit [0122] 20 multiple feature map stacking unit [0123] 30 multi-dimension attentive neural network (NN) training unit [0124] 40 neural network (NN) parameter storage [0125] 50 multi-dimension attentive neural network (NN) evaluation unit [0126] 100 neural network-based signal processing apparatus [0127] 110 Computer [0128] 111 CPU [0129] 112 Main memory [0130] 113 Storage device [0131] 114 Input interface [0132] 115 Display controller [0133] 116 Data reader/writer [0134] 117 Communication interface [0135] 118 Input device [0136] 119 Display apparatus [0137] 120 Storage medium [0138] 121 Bus