SPEECH RECOGNITION

Abstract

An optical microphone arrangement comprises: an array of optical microphones (4) on a substrate (8), each of said optical microphones (4) providing a signal indicative of displacement of a respective membrane (24) as a result of an incoming audible sound; at first processor (12) arranged to receive said signals from said optical microphones (4) and to perform a first processing step on said signals to produce a first output; and a second processor (14) arranged to receive at least one of said signals or said first output; wherein at least said second processor (14) determines presence of at least one element of human speech from said audible sound.

Claims

1. An optical microphone arrangement comprising: an array of optical microphones on a substrate, each of said optical microphones providing a signal indicative of displacement of a respective membrane as a result of an incoming audible sound; at first processor arranged to receive said signals from said optical microphones and to perform a first processing step on said signals to produce a first output; and a second processor arranged to receive at least one of said signals or said first output; wherein at least said second processor determines presence of at least one element of human speech from said audible sound.

2. The optical microphone arrangement as claimed in claim 1 wherein the optical microphones are arranged at a mutual spacing of less than 5 mm.

3. The optical microphone arrangement as claimed in claim 1 wherein at least one of the first and second processors is arranged to perform a plurality of processing operations on said signals wherein said processing operations correspond to a plurality of assumptions that the signals emanate from a respective plurality of directions to give a plurality of candidate determinations; and thereafter to select one of said candidate assumptions based on a selection criterion.

4. The optical microphone arrangement as claimed in claim 1 wherein the first processor is arranged to determine presence of at least one element of human speech from said audible sound and, if said element is determined to be present, to issue a wake-up signal to cause said second processor to change from a relatively passive mode to a more active mode.

5. The optical microphone arrangement as claimed in claim 1 wherein the first processor and the optical microphone array are provided in a common device.

6. The optical microphone arrangement as claimed in claim 1 wherein the second processor is provided remotely of a or the device in which the optical microphone array is provided.

7. The optical microphone arrangement as claimed in claim 1 wherein the first processor is arranged to carry out initial signal processing to assist with speech recognition in the second processor.

8. The optical microphone arrangement as claimed in claim 1 wherein said first processor is arranged to carry out beamforming on said signals and said second processor is arranged to carry out speech recognition.

9. The optical microphone arrangement as claimed in claim 1 wherein the second processor is arranged to determine presence of at least one element of human speech from said audible sound using at least a base frequency and an overtone frequency which is an integer multiple of said base frequency.

10. The optical microphone arrangement as claimed in claim 9 arranged to use a plurality of overtones.

11. The optical microphone arrangement as claimed in claim 9 wherein the optical microphones have a mutual spacing less than half of a wavelength of said base frequency.

12. The optical microphone arrangement as claimed in claim 9 arranged to carry out beamforming at a frequency of the overtone(s).

13. The optical microphone arrangement as claimed in claim 12 wherein said beamforming is carried out by the first processor.

14. (canceled)

15. (canceled)

16. (canceled)

17. (canceled)

18. (canceled)

19. (canceled)

20. The optical microphone arrangement as claimed in claim 1 wherein the optical microphones comprise: a membrane; a light source arranged to direct light at said membrane such that at least a proportion of said light is reflected from the membrane; and an optical detector arranged to detect said reflected light.

21. The optical microphone arrangement as claimed in claim 20 comprising a diffractive element is provided between said light source and said membrane.

22. The optical microphone arrangement as claimed in claim 21 wherein the diffractive element comprises a diffractive pattern formed by a reflective material.

23. The optical microphone arrangement as claimed in claim 20 comprising a plurality of detectors for each microphone.

24. The optical microphone arrangement as claimed in claim 20 comprising a plurality of diffractive elements for each microphone.

25. A method of determining presence of at least one element of speech from an incoming audible sound, said audible sound having at least a portion thereof within a wavelength band, the method comprising receiving said audible sound using an array of optical microphones as claimed in claim 1, said microphones having a mutual spacing less than half of a longest wavelength of said wavelength band; and processing the signals from the microphones to detect said element of speech.

26. (canceled)

27. The method as claimed in claim 25 wherein the microphones have a mutual spacing less than half of a median wavelength of said wavelength band.

28. The method as claimed in claim 25 comprising processing the signals from the microphones so as to use preferentially a portion of said audible sound received from a given direction or range of directions.

29. The method as claimed in claim 28 comprising using sound from a plurality of directions and selecting one of said directions based on which gives a best result.

Description

[0042] Certain embodiments of the invention will now be described, by way of example only with reference to the accompanying drawings in which:

[0043] FIG. 1 shows an array of optical microphones in accordance with the invention;

[0044] FIG. 2 is a block system diagram of a speech recognition system embodying the invention;

[0045] FIG. 3 is a series of schematic illustrations of the basic operating principle of the optical microphones in the array of FIG. 1;

[0046] FIG. 4 is a graph showing light intensity at each of the two detectors against membrane displacement for the microphone of FIG. 3;

[0047] FIG. 5 is similar to FIG. 3 but with a variant of the design of optical microphone;

[0048] FIG. 6 is a graph of intensity vs displacement for the detectors of FIG. 5; and

[0049] FIG. 7 is a more detailed sectional view of a possible optical microphone layout;

[0050] FIG. 8 is a flow chart describing the candidate selection process which may be employed in accordance with the invention;

[0051] FIG. 9 is a graph showing the received frequency spectrum for a spoken a sound; and

[0052] FIG. 10 is a flowchart describing operation of a further embodiment of the invention which employs overtone detection.

[0053] FIG. 1 shows an array of optical microphones 2. The microphones 2 are provided on a common substrate 4 which could, for example, be a printed circuit board (PCB). The microphones may, purely by way of example, have a centre-to-centre spacing of approximately 2 mm . The array could, for example have an extent of 2 cm across or 2 cm by 2 cm in the case of a square array. The array might therefore comprise of the order of a hundred individual microphone elements.

[0054] FIG. 2 is a block system diagram for a mobile electronic device 8such as a smartphone, smart watch or tablet computerwhich includes the array of optical microphones 2. The signal outputs from the microphones 2 are connected to a data bus 10. The microphones 2 could feed raw data signals to the bus or some elementary processing could be carried out at each microphone 2, e.g. filtering or amplification. The bus 10 connects the microphones to a digital signal processor (DSP) 12. This could be a standard DSP or custom designed. The output from the DSP 12 is fed to an applications processor 14, also provided on the device 8. The applications processor 14 communicates with a remotely located processor 16 by means of a suitable data network. This could involve any known wireless data network such as WiFi, Zigbee, Bluetooth etc.

[0055] In use the microphones 2 are active when the device 8 is in an active state (i.e. not in standby) and they pass signals to the DSP 12 via the bus 10. The DSP 12 carries out processing on the received signals as will now be described. First, assuming that the array comprises P individual microphone elements, the signals y(t) received by the microphones, denoted here as .sub.1(t), .sub.2(t), . . . .sub.P(t), are recorded. Next, the frequency spectrum of one or more of those signals is estimated from a time-sample. A crude yet fast and effective way of doing this for the r'th signal from the array is to compute

[00001] ${\hat{P}}_{r} () = \frac{1}{N} .Math. {.Math. {.Math.}_{k = 0}^{N - 1} .Math. .Math. y_{r} (t - k) .Math. e^{- ik .Math. .Math.} .Math.}^{2}$

[0056] For a set of frequencies {} of interest. This power spectrum estimate can be computed efficiently via a Fast Fourier Transform, noting that the term inside the brackets |.| is simply a Discrete Fourier Transform (DFT) of the incoming signal .sub.r(t).

[0057] Third, based on the power spectrum estimates {circumflex over (P)}.sub.r()one of them or a plurality of them could be computedand a decision can be made whether to do something else. Such a decision could involve starting a further process in the first processor 12 to carry out better signal extraction, using for example beam forming or other separation techniques. Alternatively the decision could be to wake up the second processor 16.

[0058] In a first simplistic example, the processor 12 uses a crude detection mechanism to detect a key word, say hello. This mechanism could be such that it considers the power spectrum of an uttered sentence, to see if it has a match with the power spectrum of the world hello. Such a matching operation can be done with very low power requirements, via, for instance, a hardware-enabled Discrete Fourier Transform (DFT) to derive an estimate of power spectrum as explained above, and also in more detail in e.g. Statistical Digital Signal Processing and modelling by M. H. Hayes. If there is a matchas could be detected using any kind of classifier such a linear or discriminant analysisthe second processor 14 could be woken up to listen in on both a buffered signal (such as the hello candidate) as well as follow-up utterances, such as open file or turn off computer.

[0059] The first detection step may, as a consequence of the simpler implementation, be rather crude. For instance, the word hotel could have a similar DFT power spectrum to hello, and lead to a wake-up of the second processor 14 as well. However, at this stage, the more advanced processing power of the second processor 14 means that it can disambiguate the word hotel from the word hello, and hence make a decision not to follow up with more processing and instead return to its sleep state.

[0060] The optical microphones 2 are advantageous over more conventional MEMS microphones. The lower self-noise means that the power spectrum estimates will be more accurate and able to pick up trigger words at longer distances than with conventional MEMS microphones. Moreover two or more optical microphones from the array can be used to accurately detect the direction of arrival of the sound using any know direction of arrival (DOA) technique, such as simplistic beam forming, time-delayed signal subtraction or the MUSIC algorithm (see i.e. Spectral Analysis of Signals, by P. Stoica & Randolph Moses. For example this could be used to estimate whether the sound is likely to have come from a someone speaking in front of the device or from a source that is, say, to the side of the device. The low noise characteristics of the optical MEMS microphones means that such useful detection angles can be computed even with a very small baseline array, making it particularly useful for small form factor devices such as smart watches, bracelets or glasses.

[0061] In a second and more advanced example, the first processor 12 is used to detect a key word such as hello, but this may happen after beam forming has been used. The processor 12 may react to certain characteristics of the incoming signals. This could be a distribution of signals looking like speech, such as a sub- or super-Gaussian distribution, as explained in i.e. Independent Component Analysis for Mixed sub-gaussian and super-Gaussian Sources, by Tee-Won Lee and Terrence J. Sejnowski. Then, the processor 12 decides to turn on beam forming to try to locate the source. It can work on both stored signals as well as new incoming signals. If the output of a beam former produced a word that could be recognized as a potential triggering word, the second processor 14 is woken up. Again, this second processor can, using its greater processor power, matching methods and word dictionary size, detect that the word hello was not actually spoken (but perhaps instead halo), and go back to its sleep state.

[0062] In this second example, the usefulness of the array optical microphones 2 is twofold. First, the original signal distribution is recovered by the microphones is more accurate than with conventional microphones due to the previously-mentioned low-noise characteristics. Second, the use of the combination of microphone elements 2, by high-resolution array beam forming, enables detection of lower level sounds (such as whispers or far away sound), as well as a better (i.e. less noise-prone) candidates for word detection both at the first 12 and the second 14 processor. Without the optical microphone array, the array would have had to be built much bigger to exhibit the same level of sensitivityi.e. by using a bigger base line.

[0063] In both of the above cases, the second processor 14 can use more powerful means of signal extraction than the first one. For instance, the first processer 12 may use a crude beam-forming approach, such as delay-and-sum (DAS) beam forming. It could also use more sophisticated approaches such as adaptive (Capon) beam forming. However generally, the second processor 14 will use more powerful means of spatial signal extraction than the first 12.

[0064] For instance, if the first processor 12 used DAS beam forming, then the second processor 14 might use adaptive beam forming to increase the effective resolution/performance over the first. Or, the second processor 12 may use a time-domain de-convolution approach for source separation, which generally requires inversion of a Block-Toeplitz matrix structure, as explained in i.e. Blind Speech Separation in Time-Domain Using Block-Toeplitz Structure of Reconstructed Signal Matrices, by Zbyn{hacek over (e)}k Koldovsk, Ji{hacek over (r)} Mlek and Petr Tichavsk. This is typically much more CPU-intensive than using frequency domain based methods, but can also yield much higher accuracy and resolution in its signal recovery efforts. The second processor 14 may also use more advanced word recognition methods than the first processor. For instance, while the first processor 12 may use the matching of a power spectrum as a first approximation, the second processor may use techniques such as Hidden Markov Models (HMM), Artificial Neural Networks (ANN) or approaches incorporating language models (LMs) to boost its performance. It may also have a bigger and/or more cleverly searchable set of words which it can use for recognition due to its increased memory.

[0065] The processing necessary to carry out speech recognition may be conducted entirely on the device 8. However advanced processing could be carried out by the remote processor 16 instead of or in addition to the local second processor 14.

[0066] FIG. 3 shows schematically the main functional parts of an exemplary optical microphone manufactured using standard micro-electromechanical systems (MEMS) technology. It comprises a substrate 18 on which is mounted an upstanding housing 20. The housing has an aperture 22 in its upper face across which spans a flexible silicon nitride membrane 24. Inside the housing, mounted on the substrate 18, are a light source in the form of a laser, e.g. a vertical cavity surface-emitting laser (VCSEL) 26, and two photo-detectors 28, 30. Between the laser diode 26 and the membrane 24 is a diffractive element 32. This could, for example, be implemented by reflective metal strips deposited in a diffractive pattern on top of a transparent plate such as a bonded glass chip (see FIG. 7) or provided by elements suspended at appropriate positions inside the housing 20.

[0067] The left hand diagram of FIG. 3 illustrates the membrane having been flexed upwardly, the centre diagram illustrates it being in a neutral position and the right hand diagram illustrates it being flexed downwardly. These represent different instantaneous positions of the membrane 24 as it is driven by an incoming sound wave. As will be appreciated from FIG. 3, the position of the membrane 24 determines the distance between it and the diffractive element 32.

[0068] In use some of the light from the laser 26 passes through the pattern of the diffractive element 32 and some is reflected by the lines making up the pattern. The light passing through reflects from the rear surface of the membrane 24 and back through the diffractive element 32. The relative phase of the light that has travelled these two paths determines the fraction of light which is directed into the different diffraction orders of the diffractive element (each diffraction order being directed in fixed direction). In presently preferred embodiments the diffractive element 32 is in the form of a diffractive Fresnel lens. Thus the lines of the diffractive pattern 32 are sized and spaced according to the standard Fresnel formula which gives a central focal area corresponding to the zeroth order. The first photo-detector 28 is positioned to receive the light in the zeroth order, while the second photo-detector 30 is positioned to receive light from the focused first diffraction order of the diffractive Fresnel lens. When the spacing between the diffractive element 32 and the membrane 24 is half of the wavelength of the laser light from the diode 26 or an integer multiple thereof, virtually all light reflected by the diffractive element is directed into the zeroth diffraction order. At this position the second detector 30 receives very little light as it is located at the position of the diffractive element's first order (which is focussed into a point for a diffractive Fresnel lens).

[0069] As will be appreciated, the optical path length is of course dependent on the distance between the diffractive element 32 and the membrane 24. The intensity of light recorded by the first photo-detector 28 measuring the zeroth diffraction order and the second photo-detector 30 (whose positions are fixed), varies as the above-mentioned spacing varies but in an out-of-phase manner. This is illustrated by the graph in FIG. 4. One line 34 corresponds to the intensity recorded at the first photo-detector 28 and the other line 36 corresponds to the intensity recorded at the second photo-detector 30. As mentioned above, when the spacing is equal to half of the wavelength (or an integer multiple thereof) the intensity 34 at the first detector 28 is at a maximum and drops off to zero as the spacing changes to a quarter wavelength or odd multiples thereof. The intensity 36 recorded at the second detector 30 is a quarter wavelength out of phase with this and so the second line 34 is at a maximum when the first line is at a minimum and vice versa.

[0070] The sensitivity of the microphone is determined by the change in output signal for a given change in displacement of the membrane. It can be seen from FIG. 4 therefore that the maximum sensitivity occurs in the zones 38 in which the lines 34, 36 have maximum gradient. This is also the zone in which the gradient is approximately linear.

[0071] Although it may be possible to carry out the necessary measurement with only one photo-detector, the two detectors 28, 30, measuring the zeroth and first diffraction orders respectively, may be advantageous as taking the difference between those two signals could provide a measurement that is corrected for fluctuations in laser intensity.

[0072] A variant of the arrangement described above is shown in FIGS. 5 and 6. In this arrangement there are two separate diffractive elements 40, 42, with a relative offset in distance relative to the microphone membrane 24 (in this case an offset of one eighth of the wavelength of the laser). With one photo-detector 44 positioned in alignment with a particular diffraction order of the first diffractive element 40 and a second photo-detector 46 aligned with an order of the second diffractive element 42, the lines 48, 50 respectively of FIG. 6 are achieved. From these it can be seen that the signals detected by the two detectors 44,46 are one eighth of a wavelength out of phase with one another, the maximum sensitivity zones 52, 54 of the two respective diffractive elements are contiguous and so by using the signals from both detectors 44, 46 the dynamic range of the microphone can be extended.

[0073] It is possible of course to use three or more diffractive elements with predetermined offsets relative to the membrane, in order to produce three or more signals with predetermined phase offsets. Those signals can then be recombined in order to provide a measurement of the membrane displacement with high linearity, on a large dynamic range and compensated for fluctuations in laser intensity.

[0074] FIG. 7 shows certain an exemplary optical microphone in a little more detail. This comprises a transparent glass substrate 56 which includes a central portion 58 on which is provided the diffractive element 60 formed as a number of reflective lines. A silicon layer 62 is provided on top of the glass substrate 56 and the silicon nitride membrane 64 is provided between them. The glass substrate 56 has been structured in order to allow air to be displaced from under the membrane 64 when the latter moves under the action of incident sound waves.

[0075] As previously mentioned the oversampled array of optical microphones described herein can be used to analyse received sound on a number of different assumptions. As will be described below these could correspond to differing directions of emanation or environmental conditions. These candidates can then each be used to attempt speech recognition with the most successful one being adopted.

[0076] First the use of an array of microphones to focus on sound from a particular direction will be explained. This is known as beam forming and can be considered to be equivalent to the problem of maximizing the energy received from a particular direction (taken in this example to be the forward direction, normal to the array) whilst minimizing energy from other directions.

[0077] Minimizing the narrowband energy coming into an antenna array (in a half-plane) through a beam former, subject to the constraint of fixing energy (and avoiding distortions) in the forward-looking direction, amounts to:

[00002] $\begin{matrix} \min_{w} .Math._{0}^{} .Math. {.Math. w^{H} .Math. a () .Math.}^{2} .Math. d .Math. .Math. .Math. .Math. subject .Math. .Math. to .Math. .Math. w^{H} .Math. 1 = constant & Equation .Math. .Math. (1) \end{matrix}$

[0078] Where a() is a steering vector at the angle , and wC.sup.P w is the antenna weight vector, which is complex and hence can encompass both time-delays and weighting (the present analysis is carried out in the frequency domain). P is the number of array elements. The purpose of the weights is to work on the incoming signals to get an aggregate signal. Let y denote the Fourier-transformed signal vector coming from the array. Then the aggregate signal, or the output from the beam former becomes z=w.sup.Hy

[0079] The objective is to design the weights vector w such that the aggregate signal z has certain characteristics. In array processing, these are typically related to spatial behavior, i.e. how much the aggregate signal z is influenced by signals coming from some direction versus other directions. This will now be explained in more detail. Equation (1) can be discretized as:

[00003] $\begin{matrix} \min_{w} .Math. {.Math.}_{i = 1}^{N} .Math. .Math. {.Math. w^{H} .Math. a (_{i}) .Math.}^{2} .Math. .Math. subject .Math. .Math. to .Math. .Math. w^{H} .Math. 1 = constant & Equation .Math. .Math. (2) \end{matrix}$

[0080] For some discretization of angles .sub.1, .sub.2, . . . , .sub.N. The sum can be rewritten as:

[00004] $\begin{matrix} {.Math.}_{i = 1}^{N} .Math. .Math. {.Math. w^{H} .Math. a (_{i}) .Math.}^{2} = .Math. {.Math.}_{i = 1}^{N} .Math. .Math. w^{H} .Math. {a (_{i})}^{H} .Math. w = w^{H} .Math. {{.Math.}_{i = 1}^{N} .Math. .Math. a (_{i}) .Math. {a (_{i})}^{H}} .Math. w = w^{H} .Math. Cw .Math. .Math. .Math. where .Math. .Math. C = {{.Math.}_{i = 1}^{N} .Math. .Math. a (_{i}) .Math. a & Equation .Math. .Math. (3) \end{matrix}$

[0081] So the discretized optimization criterion becomes:

min.sub.w w.sup.HCw subject to w.sup.H1=constant Equation (4)

[0082] This is a modified or constrained eigenvector problem, that could be solved using a number of well-known techniques. One such variant will be described. It should be note that, in general, the vector 1 is equal to one of the steering vectors, the one where =/2. The problem could therefore be reformulate as one having a least squares focus, which is to try to fit the beam pattern so that there is full focus forwards and as low energy as possible in all other directions. This could be accomplished as:

[00005] $\begin{matrix} \min_{w} .Math. {.Math.}_{i = 1, i k}^{N} .Math. .Math._{i} .Math. {.Math. w^{H} .Math. a (_{i}) - 0 .Math.}_{2}^{2} +_{k} .Math. {.Math. w^{H} .Math. a (_{k}) - 1 .Math.}_{2}^{2} & Equation .Math. .Math. (5) \end{matrix}$

[0083] Where k is the index of the forward looking steering vector, i.e. a(.sub.k)=1. This expression states that using weights is an attempt to force every angular response to zero, except the forward looking one, which is being attempted to be forced to unity. It is generally presumed that there is no preference as to which directions (other than the forward looking one) are more important to force down, so it can be assumed that .sub.i=.sub.j=c for i, jk. Note that this can now be rewritten as:

[00006] $\begin{matrix} \min_{w} .Math. c .Math. w^{H} .Math. \tilde{C} .Math. w +_{k} .Math. {.Math. w^{H} .Math. 1 - 1 .Math.}_{2}^{2} \min_{w} .Math. w^{H} .Math. \tilde{C} .Math. w + \frac{_{k}}{c} .Math. {.Math. w^{H} .Math. 1 - 1 .Math.}_{2}^{2} & Equation .Math. .Math. (6) \end{matrix}$

[0084] Where {tilde over (C)} is the matrix generated the same way as C, but with the k'th steering vector kept out i.e:

[00007] $\begin{matrix} \tilde{C} = {{.Math.}_{i = 1, i k}^{N} .Math. .Math. a (_{i}) .Math. {a (_{i})}^{H}} & Equation .Math. .Math. (7) \end{matrix}$

[0085] It should be noted that for the original optimization problem in Equation (4), it makes no difference whether one tries to minimize w.sup.H{tilde over (C)}w or w.sup.HCwthe relationship between the forward-looking vector 1 and the weights w (i.e. the constraint) makes sure of this.

[0086] It will be noted also that the right hand side of Equation (4) is the Lagrange multiplier expression for solving the modified eigenvalue problem (when the constant=1). So Equations (4) and (6) are equivalent, and so also Equations (4), (5) and (6) are equivalent under the foregoing assumptions. So, starting to work on equation (5), it may be seen that it can be rewritten as:

[00008] $\begin{matrix} \min_{w} .Math. {.Math.}_{i = 1}^{N} .Math. .Math._{i} .Math. {.Math. w^{H} .Math. a (_{i}) - e_{i} .Math.}_{2}^{2} & Equation .Math. .Math. (8) \end{matrix}$

[0087] Where e.sub.i=0 for all i but k, where e.sub.k=1 .

[0088] By defining a.sub.i=a(.sub.1) there is now:

[00009] $\begin{matrix} {.Math.}_{i = 1}^{N} .Math. .Math._{i} .Math. {.Math. w^{H} .Math. a_{i} - e_{i} .Math.}_{2}^{2} = .Math. {.Math.}_{i = 1}^{N} .Math. .Math. {.Math. w^{H} (_{i} .Math. a_{i}) -_{i} .Math. e_{i} .Math.}_{2}^{2} = {.Math.}_{i = 1}^{N} .Math. .Math. {.Math. w^{H} .Math. {\tilde{a}}_{i} - {\tilde{e}}_{i} .Math.}_{2}^{2} & Equation .Math. .Math. (9) \end{matrix}$

[0089] This simply implies seeking the least squares solution to the problem:

min.sub.ww.sup.H{tilde over (e)}.sub.F.sup.2 Equation (10)

[0090] where =[.sub.1a.sub.1,.sub.2a.sub.2, . . . , .sub.Na.sub.N] and {tilde over (e)}=[.sub.1e.sub.1,.sub.2e.sub.2, . . . , .sub.Ne.sub.N]=[0,0, . . . ,.sub.k,0,0, . . . ].

[0091] This is effectively saying that it is necessary to try to find a complex vector (w) whose elements combine the rows of the matrix so that they become a scaled, unit row vector, where only the k'th element is different from zero. But more generally, in trying to separate the different spatial directions, one could choose multiple vectors {w.sub.i} each focusing in on a different spatial direction. Having solved this problem, it will be the case that Equation (10) above will also have been solved.

[0092] This would be to try to find a matrix W such that:

{tilde over (W)}.sup.H=.sub.k.Math.I where W=[w.sub.1,w.sub.2, . . . w.sub.N]Equation (11)

[0093] However this simply amounts to saying that the matrix has a (pseudo)-inverse. Moreover, it should be notes that if has a pseudo-inverse, then A also has a pseudo-inverse. This follows since the columns of the matrix are simply rescaled versions of the columns of A. It is therefore possible, quite generally, to focus on whether or not A has a pseudo-inverse, and under which circumstances. In array processing, the steering vectors of a uniform, linear array (ULA) become sampled, complex sinusoids. This means that the column vectors of A are simply complex sinusoids. If more and more elements are added within the base-line of the array (i.e. the array is oversampled), the sampling quality (or resolution) of those sinusoids is gradually improved.

[0094] When, hypothetically, the number of rows tends to infinity, then the columns of the matrix A will be samplings of continuous complex sinusoids. Any (non-continuous) level of resolution can be seen as a quantization of the continuous complex sinusoids.

[0095] Let .sub.1, .sub.2, . . . .sub.Q be a set of frequencies, with .sub.i.sub.j for all ij.

[0096] Let R be the support length. Let

[00010] $f_{k} (t) = e^{i .Math. .Math. .Math. .Math. t .Math. .Math._{k}}$

t[0,R], and .sub.k(t)=0 elsewhere. Then the functions .sub.k(t) are linearly independent.

[0097] What this implies is that in the theoretically idealized case where there are an infinite number of array antenna elements, infinitely closely spaced, the sinusoids corresponding to the spatial directions (i.e. the steering vectors) would all be unique, and identifiable, and no one sinusoid could be constructed as a linear combination of others. This is what yields the invertibility of the (row-continuous) matrix A. However, in practice, there is a finite number of elements, which results in a discretization of this perfect situation. While the continuous sinusoids are all unique and linearly independent of one another, there is no guarantee that a discretization of the same sinusoids obey the same properties. In fact, if the number of antenna elements is lower than the number of angles which the device is trying to separate spatially, it is guaranteed that the sinusoids are not independent from one another. It follows, however, that as the number of rows in the matrix A increasesi.e. the number of antenna elements in the array increasesthe matrix A becomes more and more invertible because it approaches closer and closer to the perfect (continuous) situation. As more antenna elements are inserted, the dimensions of the matrix C increases, as do the number of rows in the matrix A, from which the matrix C is derived. As explained above, the more invertible the matrix A, the easier it become to satisfy the conditions in equation (2) above, i.e. min.sub.w w.sup.HCw subject to w.sup.H1=constant.

[0098] It is easy to see how the above considerations become important for the optimal implementation of the invention, and in particular to the real-life challenges arising. The processor carrying out the algorithms in accordance with the invention is effectively working with eigenvectors of matrices and is concerned with small eigenvectors/eigenvalue pairs, i.e. those that will minimize or closely minimize

s(w|C)=min.sub.w w.sup.HCw Equation (12).

[0099] This means that there are specific precautions that must be taken. Ignoring for the moment ignore the constraint w.sup.H1=constant (since this can be shown to be a minor modification giving a projection onto a subspace), and recapturing how the eigenvalues and eigenvectors behave, the eigenvalue decomposition of the matrix C (which is Hermitian) can be considered:

[00011] $\begin{matrix} C = {.Math.}_{i = 1}^{r = rank (C)} .Math. .Math._{i} .Math. v_{i} .Math. v_{i}^{H} & Equation .Math. .Math. (13) \end{matrix}$

[0100] Where {.sub.i} is the set of non-zeros eigenvalues, sorted by decreasing values. The following term is considered:

[00012] $\begin{matrix} w^{H} .Math. Cw = .Math. w^{H} [{.Math.}_{i = 1}^{r = rank (C)} .Math. .Math._{i} .Math. v_{i} .Math. v_{i}^{H}] .Math. w = {.Math.}_{i = 1}^{r} .Math. .Math._{i} .Math. w^{H} .Math. v_{i} .Math. v_{i}^{H} .Math. w = {.Math.}_{i = 1}^{r} .Math. .Math. {_{i} (w^{H} .Math. v_{i})}^{2} & Equation .Math. .Math. (14) \end{matrix}$

[0101] It can be seen that when w is more parallel to the eigenvectors corresponding to small eigenvalues, the term gets smaller. It is also known that eigenvectors corresponding to small eigenvalues are generally unstable. This means that a small change to the matrix C could give very different scores, for instance that s(w|C)<<s(w|)

[0102] For some perturbation of the matrix C. This means that, if there was a small error on C, the effective array resolution (which is related to s) could be dramatically degraded.

[0103] However this is exactly what will happen in many real life scenarios. Consider the matrix C specifically, which is constructed as:

[00013] $\begin{matrix} C = {{.Math.}_{i = 1}^{N} .Math. .Math. a (_{i}) .Math. {a (_{i})}^{H}} & Equation .Math. .Math. (15) \end{matrix}$

[0104] The steering vectors a() are related to, among other things, the speed of sound. However in practice the speed of sound will change relative to its assumed value a result of temperature or humidity changes. For example a change from an assumed value of 340 m/to an actual value of 345 m/s would give rise to a distortion of C (to become {tilde over (C)}) which could be have an order of magnitude impact on the score s.

[0105] For the purpose of speech recognition therefore, it might be necessary to apply several versions of the matrix C and the associated (optimal) weights w, to get the desired resolution. This could happen in a number of ways including: trying out different combinations C/w relating to different temperatures, and seeing which array output has the lowest overall energy; trying out different combinations C/w relating to different temperatures, and seeing which array output has the signal output which is most representative of speech (say, reflecting the statistical distribution of a speech signal); and trying out different combinations C/w relating to different temperatures, and seeing which array gives the highest classification rates with a speech recognition engine.

[0106] Referring back to FIG. 2, it may be seen that, although the first processor 14 may be sufficiently powerful to carry out some of these steps, the demands on this processor will quickly become high and hence drive either the cost of the circuitry, and/or the power consumption up to a level which is too high for a mobile device. However by using the remote processor 16 to conduct this more extensive search whenever it is needed, power can be saved by keeping the remote processor can in a low power mode when such operations are not necessary. It will be appreciated of course that this advantage can be achieved even if both processors are provided on the same device. It is therefore not essential for one of the processors to be provided remotely.

[0107] A more specific example of the use of greater processing power to select from multiple candidates will now described with reference to FIG. 8. In the first step 101 a candidate for a speech signal is detected from one or more microphones 2, as previously described. The detection could be carried out by the first processor 12.

[0108] Next, in step 102, the signal separation algorithm is set up, meaning that it is based on certain assumptions about the physical conditions and realities around the microphone array. For instance, the steering vectors a() have a relation to the speed of sound, and so an assumption as to what the speed of sound isit could be 340, 330 or 345 m/s depending on things like temperature or humiditywould be a parameter that could be set. Next, in step 103, those parameters are applied with a signal separation algorithm. It would often be a beam former, but it could also be a time-domain de-convolution approach or any other approach. The output, or potentially the plurality of outputs, from this process is/are then fed to a speech recognition engine at step 104.

[0109] If the speech recognition engine recognizes a word from a dictionary or a vocabulary, that word, or some other indication of that word such as its short form, hash code or index, can be fed to an application at step 105. It should be noted that although the term word is used herein, this could be replaced with a phrase, a sound, or some other entity that is of importance for natural speech recognition.

[0110] If no word is recognized at step 104, or if the likelihood of correct classification is too low, or some other key criterion is met such as the determined risk of dual or multiple word matches being deemed too high, the process moves on to step 106, where they key parameters are modified. As mentioned before, those could be relating to key physical variables like the speed of sound and the impacting result on the steering vectors (and in turn, the matric C) However, they could also relate to different beam patterns or focusing strategies. For instance, in one instance of the parametric selection, a relatively broad beam may be used, and in another, a narrower beam used. They could also relate to different algorithm selections. For instance, if at first, beam formers were used without luck, more computationally complex searches like time-domain de-convolution approaches could be attempted.

[0111] The legal set of parameters for this search may be contained in a parameter database 107. This could be implemented either as a list, matrix or other structure of legal and relevant parameters to use for the search, and could include without being limited to: speed of sound, background noise characteristics, assumptions of positions of potential interfering sources, assumptions of sensor overload (saturation), or any other, searchable quantity. Likewise, the database 107 need not be a fixed database with a final set of parameters setting; it could equally well be a generator algorithm that constructs new parameters sets using a set of rules to search for words using a variety of said settings.

[0112] Even though the implementation here is shown as sequential, parallel implementation can be equally well envisaged, where various levels of confidence in the detection process of words are matched against each other and the winner selected. Depending on the CPU architecture, such an approach may sometimes be much faster and efficient.

Impact of Noise

[0113] Consideration is now given to the impact of noise in real-world implementations. For this the algorithm seeks to use the weights vector w to lock energy/focus in the forwards direction. At the same time there should ideally be as little energy as possible coming in through the beam former from other directions, whether it is interference (from other directions) or noise. This is illustrated in FIG. 8 where it is desirable to lock onto and receive the main beam whilst suppressing the side lobes.

[0114] A suitable discretization yields the following equation:

[00014] $\begin{matrix} y =_{0}^{} .Math. a () .Math. s () + n = {.Math.}_{i = 1}^{N} .Math. .Math. a (_{i}) .Math. s (_{i}) .Math. n & Equation .Math. .Math. (16) \end{matrix}$

[0115] In fact, this is an approximation, but the associated error cold be modeled into the noise term n, so this can be accepted for now. Here, the numbers s(.sub.i) are the signals arriving from the different directions .sub.i. Those are complex numbers representing phase and amplitude, since it is the frequency domain being considered. Carrying this out on vector/matrix form, gives:

[00015] $\begin{matrix} y = {.Math.}_{i = 1}^{N} .Math. .Math. a (_{i}) .Math. s (_{i}) + n = As + n .Math. .Math. where .Math. .Math. A = [a (_{1}) .Math. a (_{2}) .Math. .Math. .Math. .Math. .Math. a (_{N})] = [a_{1} .Math. a_{2} .Math. .Math. .Math. .Math. .Math. a_{N}] .Math. .Math. s = [\begin{matrix} s (_{1}) \\ s (_{2}) \\ .Math. \\ s (_{N}) \end{matrix}] = [\begin{matrix} s_{1} \\ s_{2} \\ .Math. \\ s_{N} \end{matrix}] .Math. .Math. and .Math. .Math. n = [\begin{matrix} n_{1} \\ n_{2} \\ .Math. \\ n_{N} \end{matrix}] & Equation .Math. .Math. (17) \end{matrix}$

[0116] Where n, is the (complex) noise at each sensor. To bring into focus the forward looking lock, this can be rewritten as:

y=As+n={tilde over (s)}+a.sub.ks.sub.k+Equation (18)

[0117] Where k is the index of the forward looking vector (=/2), which means that a.sub.k=1

[0118] A beam forming weights vector w is now applied to obtain a beam formed signal

z=w.sup.Hy=w.sup.H[As+n]=w.sup.H[{tilde over (s)}+a.sub.ks.sub.k+]=w.sup.H[{tilde over (s)}+1s.sub.k+]=w.sup.H{tilde over (s)}+w.sup.H1s.sub.k+w.sup.HEquation (19)

[0119] It is already known that w.sup.H1=1 (because w was derived under this condition) so the expression is now:

z=w.sup.H{tilde over (s)}+s.sub.k+w.sup.HEquation (20)

[0120] What is of interest is the signal s.sub.k which is the signal coming from the forwards directions. In trying to recover this signal as well as possible (through beam forming), the other two terms, w.sup.H{tilde over (s)} and w.sup.H should be as small as possible in terms of magnitude. Since z already captures the signal .sup.s.sup.k (and must do so due to the design of w), effectively one wishes to minimize the expectation of |z|. This amounts to wanting to minimize

[00016] $\begin{matrix} \begin{matrix} E .Math. {.Math. z .Math.}^{2} = .Math. E .Math. {{zz}^{*}} \\ = .Math. E .Math. {(w^{H} .Math. \tilde{A} .Math. \tilde{s} + s_{k} + w^{H} .Math. \tilde{n}) .Math. {(w^{H} .Math. \tilde{A} .Math. \tilde{s} + s_{k} + w^{H} .Math. \tilde{n})}^{*}} \\ = .Math. E (w^{H} .Math. \tilde{A} .Math. \tilde{s}) .Math. {(w^{H} .Math. \tilde{A} .Math. \tilde{s})}^{*} + {.Math. s_{k} .Math.}^{2} + E (w^{H} .Math. \tilde{n}) .Math. {(w^{H} .Math. \tilde{n})}^{*} \\ = .Math. E (w^{H} .Math. \tilde{A} .Math. \tilde{s} .Math. {\tilde{s}}^{H} .Math. {\tilde{A}}^{H} .Math. w) + {.Math. s_{k} .Math.}^{2} + E (w^{H} .Math. \tilde{n} .Math. {\tilde{n}}^{H} .Math. w) \\ = .Math. w^{H} .Math. E (\tilde{A} .Math. \tilde{s} .Math. {\tilde{s}}^{H} .Math. {\tilde{A}}^{H}) .Math. w + {.Math. s_{k} .Math.}^{2} + w^{H} .Math. E (\tilde{n} .Math. {\tilde{n}}^{H}) .Math. w \\ = .Math. w^{H} .Math. \tilde{A} .Math. {\tilde{A}}^{H} .Math. w + {.Math. s_{k} .Math.}^{2} +^{2} .Math. w^{H} .Math. Iw \\ = .Math. w^{H} .Math. \tilde{A} .Math. {\tilde{A}}^{H} .Math. w + {.Math. s_{k} .Math.}^{2} +^{2} .Math. w^{H} .Math. w \\ = .Math. w^{H} .Math. \tilde{A} .Math. {\tilde{A}}^{H} .Math. w + {.Math. s_{k} .Math.}^{2} +^{2} .Math. {.Math. w .Math.}_{2}^{2} \\ = .Math. w^{H} .Math. \tilde{C} .Math. w + {.Math. s_{k} .Math.}^{2} +^{2} .Math. {.Math. w .Math.}_{2}^{2} \end{matrix} & Equations .Math. .Math. (21) \end{matrix}$

[0121] Where it has been assumed the sources (s) are uncorrelated and of equal (unit) energy, although other energy levels make no difference to the following arguments. Now, the first term may already be recognized as the one minimized originally, so this is, in a certain sense, already minimal for the w chosen. The second term is fixed and the third term has two components, the noise variance and the norm of the vector w. The signal-to-noise-and-interference ratio can be described as:

[00017] $\begin{matrix} SINR = \frac{{.Math. s_{k} .Math.}^{2}}{w_{H} .Math. \tilde{C} .Math. w +^{2} .Math. {.Math. w .Math.}_{2}^{2}} = {.Math. s_{k} .Math.}^{2} .Math. \frac{1}{w_{H} .Math. \tilde{C} .Math. w +^{2} .Math. {.Math. w .Math.}_{2}^{2}} & Equation .Math. .Math. (22) \end{matrix}$

[0122] Where only the last term needs to be observes since the signal energy is going to be a (situation dependent) constant. Clearly, the variance of the noise is important and so the low noise level of the optical microphones is particularly desirable to obtaining a good SINR in the beam forming context.

[0123] FIG. 9 shows a Fast Fourier Transform plot of a typical audio signal received when a person utters the letter sound a. From this it may be seen that the spectrum has a main peak 202 at a base frequency of 226 kHz. However there are additional clear overtones 204, 206, 208, 210 at twice, four times, eight times and sixteen times the frequency. These can be used to further boost performance of speech recognition as will be described below with reference to FIG. 10. Although the specific examples given here are power-of-two multiples of the base frequency, this is not essential; the invention can be used with any convenient integer multiples of the base frequency.

[0124] FIG. 10 is a flowchart describing operation of a further embodiment of the invention which employs the overtones 204-210 illustrated in FIG. 9. This is a modified version of the operation described above with reference to FIG. 8.

[0125] As before, in the first step 1010 a candidate for a speech signal is detected from one or more microphones 2 and in step 1020, the signal separation algorithm is set up, meaning that it is based on certain assumptions about the physical conditions and realities around the microphone array such as the speed of sound etc.

[0126] Next, in steps 1030, those parameters are applied with signal separation algorithms to signals at the base frequency and also in parallel steps 1031, 1032 at the first to nth overtone frequencies. The separation can be made individually, based on individual parameters for each of the frequencies of interest. However, the separation can also share one or more parameters, such as those relating to a series of guesses of spatial directions, which will typically co-occur for any given audio source outputting multiple frequencies (i.e. overtones). Other parameters, such as guesses on amplitude of the signal components (which could be based on predictive approaches) could also be shared.

[0127] In step 1040, the outputs of the overtone signal separations are combined. This could happen in any number of ways. For instance, the separated overtone signals could be added up before passed onto step 1050. In other embodiments, the amplitudes or envelopes of the signals could be added. In yet other embodiments, the signals or their envelopes/amplitudes could be subject to separate filters before being joinedso that, for instance, any component too contaminated by noise or interference is not made part of the sum. This could happen using e.g. an outlier detection mechanism, where for instance the envelope of the frequency components are used. Frequencies with an envelope pattern diverging significantly from the other envelope patterns may be kept out of the calculations/combinations.

[0128] Even though the frequencies are treated distinctively separate in steps 1030,1031, . . . 1032 and then recombined at step 1040, the treatment of overtones may not need to be divided up explicitly. For instance other embodiments could use time-domain techniques which don't employ Fourier transformations and hence individual frequency use per se, but instead use pure time-domain representations and then effectively tie information about overtones into the estimation approach by using appropriate covariance matrices, which essentially build in the expected effect of co-varying base-tones and overtones into a signal estimation approach.

[0129] As before a speech recognition engine is used to see whether it recognizes a word from a dictionary or a vocabulary at step 1050. If so, that word, or some other indication of that word such as its short form, hash code or index, can be fed to an application at step 1060. It should be noted that although the term word is used herein, this could be replaced with a phrase, a sound, or some other entity that is of importance for natural speech recognition.

[0130] If no word is recognized at step 1050, or if the likelihood of correct classification is too low, or some other key criterion is met such as the determined risk of dual or multiple word matches being deemed too high, the process moves on to step 1070, where they key parameters are modified.

[0131] Again, as before, the legal set of parameters for this search may be contained in a parameter database 1080.

SPEECH RECOGNITION

Assignee

Inventors

Cpc classification

Classification Explorer

G10L15/22

PHYSICS

Classification Explorer

H04R1/406

ELECTRICITY

Classification Explorer

G10L25/78

PHYSICS

Classification Explorer

G10L2021/02166

PHYSICS

Classification Explorer

G01S3/805

PHYSICS

Classification Explorer

G10L2015/223

PHYSICS

Classification Explorer

G10L2015/088

PHYSICS

Classification Explorer

H04R3/005

ELECTRICITY

Classification Explorer

H04R2201/003

ELECTRICITY

Classification Explorer

G10L25/84

PHYSICS

Classification Explorer

G01S3/80

PHYSICS

Classification Explorer

H04R23/008

ELECTRICITY

Classification Explorer

G10L15/28

PHYSICS

International classification

Classification Explorer

G10L25/84

PHYSICS

Classification Explorer

H04R3/00

ELECTRICITY

Classification Explorer

H04R23/00

ELECTRICITY

Classification Explorer

G10L15/28

PHYSICS

Classification Explorer

H04R1/40

ELECTRICITY

Classification Explorer

G10L15/22

PHYSICS

Abstract

Claims

Description