SPEECH IMAGERY RECOGNITION DEVICE, WEARING FIXTURE, SPEECH IMAGERY RECOGNITION METHOD, AND PROGRAM
20220238113 · 2022-07-28
Inventors
Cpc classification
G10L15/22
PHYSICS
G10L15/02
PHYSICS
G10L15/30
PHYSICS
G10L25/18
PHYSICS
G10L2015/025
PHYSICS
International classification
G10L15/02
PHYSICS
G10L15/22
PHYSICS
G10L15/30
PHYSICS
Abstract
According to one embodiment, a speech imagery recognition device is configured to recognize speech from electroencephalogram (EEG) signals during speech imagery. The speech imagery recognition device comprises an analysis processor and an extractor. The analysis processor is configured to analyze discrete signals, which are obtained from EEG signals received from a plurality of electrodes, for each of the electrodes and output a spectral time sequence. The extractor is configured to obtain eigenvectors for each phoneme from the spectral time sequence and output a phoneme-feature vector time sequence based on the eigenvectors.
Claims
1. A speech imagery recognition device configured to recognize speech from electroencephalogram (EEG) signals during speech imagery, the device comprising: an analysis processor configured to analyze discrete signals, which are obtained from EEG signals received from a plurality of electrodes, for each of the electrodes and output a spectral time sequence; and an extractor configured to obtain eigenvectors for each phoneme from the spectral time sequence and output a phoneme-feature vector time sequence based on the eigenvectors.
2. The speech imagery recognition device according to claim 1, further comprising an EEG input unit configured to convert the EEG signals received from the electrodes to the discrete signals.
3. The speech imagery recognition device according to claim 1, further comprising a preprocessor configured to subtract an average noise amplitude spectrum from a spectrum of a speech imagery signal obtained by converting the discrete signals to a frequency domain to remove noise from the EEG signals.
4. The speech imagery recognition device according to claim 3, wherein the preprocessor is further configured to perform an independent component analysis that extracts a small number of independent information sources from each electrode signal after noise removal.
5. The speech imagery recognition device according to claim 1, further comprising a recognizer configured to recognize speech based on the phoneme-feature vector time sequence.
6. The speech imagery recognition device according to claim 5, further comprising an output unit configured to output the speech recognized by the recognizer.
7. The speech imagery recognition device according to claim 6, wherein the output unit is further configured to display a screen that helps a user adjust the optimal position of the electrodes while performing speech imagery.
8. The speech imagery recognition device according to claim 1, wherein the analysis processor is further configured to extract the spectral time sequence using a linear predictive analysis.
9. The speech imagery recognition device according to claim 1, wherein the analysis processor is further configured to perform a process of absorbing a frequency fluctuation based on the discrete signals.
10. The speech imagery recognition device according to claim 1, wherein the analysis processor is further configured to extract a frequency derived from a peak on a frequency axis as a line spectrum component for each time frame.
11. The speech imagery recognition device according to claim 1, wherein the extractor is further configured to output a phoneme-likelihood vector time sequence, which is a linguistic feature, through a predetermined convolution operator.
12. The speech imagery recognition device according to claim 1, further comprising a plurality of electrodes configured to be placed over Broca's area.
13. The speech imagery recognition device according claim 12, further comprising a wearing fixture configured to be worn on the head.
14. The speech imagery recognition device according to claim 1, comprising either or both of a mobile terminal and a server.
15. A wearing fixture for a speech imagery recognition device configured to recognize speech from electroencephalogram (EEG) signals during speech imagery, the wearing fixture comprising: a plurality of electrodes configured to be placed over Broca's area; and a processor configured to output signals from the electrodes, wherein the speech imagery recognition device is configured to: analyze discrete signals, which are obtained from EEG signals output from the processor, for each of the electrodes to output a spectral time sequence; and extract and output a phoneme-feature vector time sequence based on the spectral time sequence.
16. A speech imagery recognition method for recognizing speech from electroencephalogram (EEG) signals during speech imagery, the method comprising: analyzing discrete signals, which are obtained from EEG signals received from a plurality of electrodes, for each of the electrodes to output a spectral time sequence; and extracting and outputting a phoneme-feature vector time sequence based on the spectral time sequence.
17. A computer program product comprising a non-transitory computer-usable medium having a computer-readable program code embodied therein for recognizing speech from electroencephalogram (EEG) signals during speech imagery, the computer-readable program code causing a computer to: analyze discrete signals, which are obtained from EEG signals received from a plurality of electrodes, for each of the electrodes to output a spectral time sequence; and extract a phoneme-feature vector time sequence based on a spectral component for each of the electrodes.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
MODES FOR CARRYING OUT THE INVENTION
Embodiments
[0039] In the following, exemplary embodiments of a speech imagery recognition device according to the present invention will be described with reference to the accompanying drawings. Note that the drawings are used to illustrate the technical features of the invention, and are not intended to limit the configuration of the device as well as various processing procedures and the like to those aspects illustrated therein unless otherwise specifically mentioned. Incidentally, like parts are designated by like reference numerals or characters throughout the description of the embodiments.
[0040]
[0041] The speech imagery recognition device 1 includes an EEG input unit 2, a preprocessor 3, an analysis processor 4, a linguistic feature extractor 5, a word/sentence recognizer 6, and a post-processing/output unit 7. The EEG input unit 2 is configured to convert EEG signals received from a plurality of electrodes placed on the scalp (not illustrated) into discrete signals. The preprocessor 3 is configured to remove noise from the discrete signals for each electrode. The analysis processor 4 is configured to analyze the discrete signals for each electrode and output a spectral time sequence. The linguistic feature extractor 5 is configured to extract and output a phoneme-feature vector time sequence based on the spectral time sequence of all the electrodes. The word/sentence recognizer 6 is configured to recognize words and sentences that constitute a spoken language from the phoneme-feature vector time sequence. The post-processing/output unit 7 is configured to display speech information or output the information in audio.
[0042] The EEG input unit 2 converts analog signals x(q, t) output from the EEG electrodes into discrete signals through A/D conversion or the like, and corrects the bias of the individual electrodes by using the average value of discrete signals of all the electrodes or the like. At the same time, the EEG input unit 2 removes unnecessary frequency components of 70 Hz or below by a low-frequency removal filter (high-pass filter) and unnecessary frequency components of 180 Hz or above by a high-frequency removal filter (low-pass filter) from discrete signals for each electrode, thereby outputting a signal x.sub.1(q, n).
[0043]
[0044] The preprocessor 3 removes noise that has passed through the filters for each electrode. One example of this process will be described below. First, the discrete signal x.sub.1(q, n) (q: electrode number, n: time) of each electrode, which has undergone a series of processes in the EEG input unit 2, is multiplied by a certain time window, and then it is mapped from the time domain to the frequency domain using the Fast Fourier Transform (FFT). Thereafter, an amplitude spectral time sequence X.sub.1(q, f, n′) (f: frequency, n′: time frame number after windowing) is obtained from complex number components in the frequency domain as follows:
[Formula 1]
FFT: x.sub.1(q,n).fwdarw.Re{X.sub.1(q,f,n′)}+jIm{X.sub.1(q,f,n′)} (1)
[Formula 2]
X.sub.1(q,f,n′)=[Re{X.sub.1(q,f,n′)}.sup.2+Im{X.sub.1(q,f,n′)}.sup.2].sup.1/2 (2)
[0045] where j represents an imaginary unit, and Re{ } and Im{ } represent a real part and an imaginary part, respectively. In noise subtraction, an average noise amplitude spectrum is obtained from the spectrum N(q, f, n′) of an EEG signal recorded prior to speech imagery by the following formula.
[0046] In the above formula, an average noise spectrum is calculated from 8 frames before and after time n′; however, it may be set as appropriate depending on the system. In the setting of time n′, generally, there may be the following two ways:
[0047] (a) The user performs speech imagery in response to a prompt signal (a signal that indicates the start of imagery) provided by a speech imagery recognition application system.
[0048] (b) The user performs speech imagery after providing the application system with a predetermined call (wake-up word) such as “Yamada-san”.
[0049] In both cases, N(q, f, n′) is calculated from EEG signals recorded in the section before or after the speech imagery.
[0050] Then, for each electrode q, Nav(q, f, n′) is subtracted from the speech imagery signal spectrum X.sub.1(q, f, n′) as represented by the following formula:
[Formula 4]
X.sub.2(q,f,n′)=X.sub.1(q,f,n′)−Nav(q,f,n′) (4)
[0051]
[0052] It should be noted that it is effective to perform the process of extracting a small number of independent information sources from signals of the 9 electrodes after noise removal, i.e., independent component analysis (ICA) (Non-Patent Document 4). This process can remove unnecessary components that cannot be removed by the filtering process and can also select a small number of effective information sources from discrete signals of the 9 electrodes. However, ICA has a drawback of so-called permutation in which the order of independent components varies in the result of each analysis. How this drawback is eliminated to incorporate ICA into this embodiment will be explained later.
[0053] While the analysis processor 4 may use the spectral time sequence X.sub.2(q, f, n′) of the speech imagery signal after noise removal (and after extraction of q independent components) obtained by the preprocessor 3, linear predictive analysis (LPA) is used as an analysis method that brings out better the effect of the present invention in an example described below. The analysis processor 4 can use a spectrum or a line spectrum.
[0054] Linear predictive coding (LPC) is currently a global standard method for speech communication. There are two information sources in speech: pulse waves at a constant frequency produced by the vocal cords and random waves produced by narrowing the vocal tract. For this reason, a complex process is required where sound sources are stored separately as a codebook, all the sound sources in the codebook are passed through the linear prediction coefficient of speech (which is responsible for the transfer function of the vocal tract), and then the synthesized speech is compared with the original speech.
[0055] On the other hand, the only source of information in brain waves is considered to be random waves as illustrated in
[0056] In the convolutional integration, EEG spectrum can be expressed as X(f)=W(f)S(f)=S(f) (note: W(f)=1), where S(f) represents the transfer (frequency) function of the impulse response s(n), which is responsible for spoken language information in the frequency domain. The function S(f) can be obtained from the Fourier transform of the linear prediction coefficient {α.sub.m} as represented by the following formula:
[Formula 5]
X(f)=S(f)=[s(n)]=
[α.sub.0δ(n)+α.sub.1δ(n−1)+α.sub.2δ(n−2)+ . . . +α.sub.nδ(n−p)] (5)
[0057] where δ(n−p) is a function that represents the time n=p of a signal, and F[ ] is the Fourier transform. In linear predictive analysis (LPA) for EEG signals, as illustrated in
[Formula 6]
H(f)=σ/X(f)=σ/[α.sub.0δ(n)+α.sub.1δ(n−1)+α.sub.2δ(n−2)+ . . . +α.sub.nδ(n−p)] (6)
[0058] where σ is an amplitude bias value. This method of performing accurate analysis throughout the synthesis process is called “Analysis-by-Synthesis (AbS)” and is also effective in EEG analysis. In the Fourier transform F[ ] of the above formula, 0 points are added to p linear prediction coefficients (α.sub.0=1.0) (called zero padding), which enables the Fourier transform of any number of points such as, for example, 128 points, 256 points, . . . . With the zero padding, the frequency resolution accuracy can be arbitrarily adjusted to 64 points, 128 points, . . . to obtain a spectral component A(q, f, n′).
[0059]
[0060] Through the LPA analysis, the spectrum of EEG during speech imagery is represented with a small number of spectral peaks. This suggests that in the brain (especially in Broca's area where linguistic information of speech imagery is produced), the linguistic representation is composed of short sine waves (tone burst), in other words, the linguistic representation is represented by a unique line spectrum.
[0061] The linguistic feature extractor 5 extracts line spectral components as “linguistic representation” from spectra with a spread and outputs a phoneme-likelihood vector time sequence, which is a linguistic feature, through a phoneme-specific convolution operator.
[0062] The processing procedure will be described below with reference to the flowchart of
[0063] For data within a certain time width (several frames before and after time n′) and a frequency width (adjacent frequencies f−1, f, f+1), the intermediate value of the whole is obtained and used as a representative value. This process can absorb frequency fluctuations as it can remove values deviating from the median value. The output of the nonlinear filter is generally smoothed by a Gaussian window or the like.
[0064] Next, the process of extracting a line spectrum (step S3) will be described. In this process, components derived from the peak that appears on the frequency axis is extracted as a line spectrum for each time frame (8 msec). Specifically, only the frequencies that satisfy the following conditions are defined as sinusoidal frequency components with the original amplitude, i.e., line spectrum components:
[0065] (i) Frequency at which the maximum value (detected at the first derivative) ΔΔ.sub.f=0 on the frequency axis
[0066] (ii) When the inflection point (detected at the second derivative) ΔΔ.sub.f=0
[0067] if Δ.sub.f>0, frequency at which the value of ΔΔ.sub.f changes from positive to negative
[0068] if Δ.sub.f<0, frequency at which the value of ΔΔ.sub.f changes from negative to positive
[0069]
[0070]
[0071] The linguistic feature extractor 5 is aimed at extracting phoneme features in the end. Specifically, it is aimed at extracting phoneme components, which are the smallest unit of speech information, in the form of a phoneme feature vector from line spectral components of each electrode. Speech information in EEG signals has the so-called tensor structure that spans three axes: line spectrum (frequency information), electrodes (spatial information), and frames (temporal information).
[0072] The flowchart of
[0073] Next, principal component analysis is performed for each syllable (step S12), and eigenvectors for each syllable are phoneme-grouped with respect to each relevant phoneme in a manner as follows: the phoneme /s/: {ψ.sup./sa/(m), ψ.sup./shi/(m), ψ.sup./su/(m), ψ.sup./se/(m), ψ.sup./so/(m)}, the phoneme /a/: {ψ.sup./a/(m), ψ.sup./ka/(m), ψ.sup./sa/(m), ψ.sup./ta/(m), ψ.sup./na/(m), . . . ,}. Then, the autocorrelation matrix is calculated from the eigenvectors of the same phoneme group and integrated into a phoneme-specific autocorrelation matrix R.sup.s, R.sup.a, . . . (step S13). From the phoneme-specific autocorrelation matrix, subspaces (eigenvectors) φ.sup./s/(m), φ.sup./a/(m) for respective phonemes can be obtained.
[0074] After that, by using eigenvectors obtained for each phoneme k as “phoneme-specific convolution operator”, the phoneme similarity (likelihood) L(k) is calculated for unknown 9-electrode (or a few after ICA) line spectral time sequences (step S4, step S14, step S15).
[Formula 7]
L(k)=.sup.Max.sub.q<X(q,f,n′),ϕ(f,n′)>.sup.2,k=1,2, . . . ,K (7)
[0075] where Max means to take the maximum value for q electrodes or q ICA components, and < > represents an inner product operation. Note that X(q, f, n′) and φ(f, n′) are each normalized by a norm in advance.
[0076] A phoneme feature vector is defined as a vector composed of a series of K likelihoods L(k) of phoneme k; k=1, 2, . . . , K. In formula (7), the eigenvector φ(f, n′) of the phoneme is used to construct the phoneme-specific convolution operator, and a scalar value L(k) is obtained for each phoneme k as a likelihood. A vector of K scalar values is output from the linguistic feature extractor 5 as (phoneme-likelihood vector) time-sequence data as the time n′ of input X(f, n′) advances (step S5, step S16).
[0077]
[0078] Since it is difficult to collect a large amount of speech imagery data at present, the problem is solved through a phoneme convolution operator in the example described herein. However, as the brain database related to speech imagery becomes more complete in the future, it will be possible to use a deep convolutional network (DCN), which have been widely used in such fields as image processing in recent years, instead of the phoneme-specific convolution operator.
[0079] The word/sentence recognizer 6 recognizes a word/sentence from the time-sequence data of the phoneme feature vector (more specifically, phoneme-likelihood vector time-sequence data). There are some methods that can be applied to word/sentence recognition such as a method that uses a hidden Markov model (HMM) (where a triphone including a sequence of three consecutive phonemes is used), which has been put to practical use in the field of speech recognition, and a method that uses a deep neural network (LSTM, etc.). In addition, linguistic information (probability as to word sequence), an advantage of current speech recognition, can be used as well. Furthermore, as the time axis shift is a concern in speech imagery, the use of “spotting”, which is performed in the current robust audio system to continuously search for words and sentences in the time direction, is also effective in improving quality in speech imagery.
[0080] The post-processing/output unit 7 receives a word (sequence) of the recognition result and produces a required display and audio output. Here, the post-processing/output unit 7 may have a function of providing the user with feedback on whether the multi-electrode EEG sensor is in the correct position based on the result of speech imagery recognition of a predetermined word/sentence so that the user can move the EEG sensor through the screen of a terminal such as a smartphone or a voice instruction, thereby helping the user to find the proper position.
[0081] The post-processing/output unit 7 displays a screen that helps the user adjust the optimal position of the electrodes while performing speech imagery.
[0082] As illustrated in
[0083] The speech imagery recognition device 1 illustrated in
[0084] While the speech imagery recognition device 1 has been described as including the EEG input unit 2, the preprocessor 3, the analysis processor 4, the linguistic feature extractor 5, the word/sentence recognizer 6, and the post-processing/output unit 7 as illustrated in
[0085]
[0086] The processor 23 of the wearing fixture 11, the mobile terminal 12, and the server 13 comprise, for example, a computer that includes a central processing unit (CPU), a memory, a read-only memory (ROM), a hard disk, and the like. The mobile terminal 12 can perform part or all of the processing of the speech imagery recognition device 1 illustrated in
[0087] A speech imagery recognition method of recognizing speech from EEG signals during speech imagery is performed by the wearing fixture 11, the mobile terminal 12, and/or the server 13; the wearing fixture 11, the mobile terminal 12, and/or the server 13 can perform the method independently or in collaboration with one another. The speech imagery recognition method can be performed by the mobile terminal 12 and the server 13.
[0088] A program that causes a computer to perform a speech imagery recognition process of recognizing speech from EEG signals during speech imagery may be downloaded or stored in the hard disk or the like. The program causes the computer to perform the analysis process of analyzing discrete signals of EEG signals received from a plurality of electrodes for each electrode and outputting a spectral time sequence, and the extraction process of extracting a phoneme-feature vector time sequence based on spectral components of each electrode.
[0089]
[0090]
[0091] As described above, according to the embodiments, line spectral components as a linguistic representation are directly extracted from EEG signals during speech imagery, eigenvectors are obtained for each phoneme from the spectral components. Thereby, the linguistic representation and unknown input are converted into a vector of a series of phoneme features (phoneme likelihoods) by using the eigenvectors as a convolution operator (see Formula 7).
[0092] In the following, additional notes will be provided with respect to the above embodiments.
(Additional Note 1)
[0093] A speech imagery recognition method for recognizing speech from electroencephalogram (EEG) signals during speech imagery, the method comprising:
[0094] an analysis process of analyzing discrete signals, which are obtained from EEG signals received from a plurality of electrodes, for each of the electrodes and outputting a spectral time sequence; and
[0095] an extraction process of outputting a phoneme-feature vector time sequence based on the spectral time sequence.
(Additional Note 2)
[0096] The speech imagery recognition method as set forth in additional note 1, further comprising converting the EEG signals received from the electrodes to the discrete signals.
(Additional Note 3)
[0097] The speech imagery recognition method as set forth in additional note 1 or 2, further comprising preprocessing of subtracting an average noise amplitude spectrum from a spectrum of a speech imagery signal obtained by converting the discrete signals to a frequency domain to remove noise from the EEG signals.
(Additional Note 4)
[0098] The speech imagery recognition method as set forth in additional note 3, wherein the preprocessing includes performing an independent component analysis that extracts a small number of independent information sources from each electrode signal after noise removal.
(Additional Note 5)
[0099] The speech imagery recognition method as set forth in any one of additional notes 1 to 4, further comprising recognizing speech based on the phoneme-feature vector time sequence.
(Additional Note 6)
[0100] The speech imagery recognition method as set forth in any one of additional notes 1 to 5, further comprising outputting the speech recognized.
(Additional Note 7)
[0101] The speech imagery recognition method as set forth in additional note 6, further comprising displaying a screen that helps a user adjust the optimal position of the electrodes while performing speech imagery.
(Additional Note 8)
[0102] The speech imagery recognition method as set forth in any one of additional notes 1 to 7, wherein the analysis process includes extracting the spectral time sequence using a linear predictive analysis.
(Additional Note 9)
[0103] The speech imagery recognition method as set forth in any one of additional notes 1 to 8, wherein the analysis process includes a process of absorbing a frequency fluctuation based on the discrete signals.
(Additional Note 10)
[0104] The speech imagery recognition method as set forth in any one of additional notes 1 to 9, wherein the analysis process includes extracting a frequency derived from a peak on a frequency axis as a line spectrum component for each time frame.
(Additional Note 11)
[0105] The speech imagery recognition method as set forth in any one of additional notes 1 to 10, wherein the extraction process includes outputting a phoneme-likelihood vector time sequence, which is a linguistic feature, through a predetermined convolution operator.
(Additional Note 12)
[0106] The speech imagery recognition method as set forth in any one of additional notes 1 to 11, implemented by either or both of a mobile terminal and a server.
(Additional Note 13)
[0107] The speech imagery recognition method as set forth in any one of additional notes 1 to 12, further comprising outputting signals from a plurality of electrodes of a wearing fixture, which are placed over Broca's area.
[0108] With a speech imagery recognition device, a wearing fixture, a method, and a program according to the embodiments of the present invention, line spectral components as a linguistic representation can be directly extracted from EEG signals during speech imagery and converted into phoneme features. Thus, a brain-computer interface (BCI) can be incorporated into the current framework of speech recognition.
LIST OF REFERENCE SIGNS
[0109] 1 Speech imagery recognition device [0110] 2 EEG input unit [0111] 3 Preprocessor [0112] 4 Analysis processor [0113] 5 Linguistic feature extractor [0114] 6 Word/sentence recognizer [0115] 7 Post-processing/output unit