ELECTRONIC DEVICE AND METHOD FOR OBTAINING A USER'S SPEECH IN A FIRST SOUND SIGNAL
20230197094 · 2023-06-22
Assignee
Inventors
Cpc classification
H04R25/407
ELECTRICITY
H04R25/554
ELECTRICITY
H04R2201/107
ELECTRICITY
International classification
Abstract
An electronic device includes: a first external input transducer configured to capture a first sound signal that comprises a first speech part of a speech of a user and a first noise part of noise from a surrounding; an internal input transducer configured to capture a second signal, the second signal comprising a second speech part of the speech of the user, where the first speech part and the second speech part are of a same speech portion of the speech at a first time interval; and a signal processor configured to: estimate a fundamental frequency of the speech of the user at the first time interval based on the second signal; update the first model based on the estimated fundamental frequency of the speech at the first time interval; and process the first sound signal based on the updated first model to obtain the first speech part.
Claims
1. A method performed by an electronic device, the method comprising: capturing, by a first external input transducer of the electronic device, a first sound signal, the first sound signal comprising a first speech part of a speech of a user and a first noise part of noise from a surrounding; capturing, by an internal input transducer of the electronic device, a second signal, the second signal comprising a second speech part of the speech of the user, where the first speech part and the second speech part are of a same speech portion of the speech at a first time interval; estimating, by a signal processor of the electronic device, a first fundamental frequency of the speech of the user at the first time interval, the first fundamental frequency being estimated based on the second signal; updating, by the signal processor, a first model based on the estimated first fundamental frequency of the speech of the user at the first time interval; and processing, by the signal processor, the first sound signal based on the updated first model to obtain the first speech part of the first sound signal.
2. The method according to claim 1, further comprising: capturing, by the first external input transducer, a third sound signal, the third sound signal comprising a third speech part of the speech of the user; capturing, by the internal input transducer, a fourth signal, the fourth signal comprising a fourth speech part of the speech of the user, where the third speech part and the fourth speech part are of a same speech portion of the speech of the user at a second time interval; estimating a second fundamental frequency of the speech of the user at the second time interval, the second fundamental frequency being estimated based on the fourth signal; updating the first model based on the estimated second fundamental frequency of the speech of the user at the second time interval; and processing the third sound signal to obtain the third speech part, wherein the act of processing the third sound signal is performed based on the first model that has been updated based on the estimated second fundamental frequency.
3. The method according to claim 1, further comprising: estimating additional fundamental frequencies of the speech of the user at additional time intervals respectively; updating the first model based on the estimated additional fundamental frequency at each of the additional time intervals; and obtaining a speech part for each of the additional time intervals.
4. The method according to claim 1, wherein the first model is a periodic model.
5. The method according to claim 1, wherein the act of processing the first sound signal based on the updated first model to obtain the first speech part comprises filtering the first sound signal in a periodic filter.
6. The method according to claim 5, wherein act of filtering the first sound signal in the periodic filter comprises applying multiples of the estimated first fundamental frequency.
7. The method according to claim 5, wherein the first model is a harmonic model, and wherein the periodic filter is a harmonic filter.
8. The method according to claim 1, further comprising processing the obtained first speech part; wherein the act of processing the obtained first speech part comprises mixing a noise signal with the obtained first speech part.
9. The method according to claim 1, wherein the internal input transducer is configured to be arranged in an ear canal of the user or on a body of the user.
10. The method according to claim 1, wherein the internal input transducer comprises a vibration sensor.
11. The method according to claim 10, wherein a bandwidth of the vibration sensor is configured to span low frequencies of the speech of the user, the low frequencies being up to approximately 1.5 kHz.
12. The method according to claim 1, wherein the first external input transducer is a microphone configured to point towards the surrounding.
13. The method according to claim 1, wherein the electronic device further comprises a second external input transducer, and wherein the act of processing the first sound signal based on the updated first model to obtain the first speech part comprises beamforming the first sound signal in a periodic beamformer.
14. The method according to claim 1, wherein the electronic device comprises a first hearing device and a second hearing device, and wherein the first fundamental frequency is estimated by the first hearing device and/or the second hearing device.
15. An electronic device comprising: a first external input transducer configured to capture a first sound signal, the first sound signal comprising a first speech part of a speech of a user and a first noise part of noise from a surrounding; an internal input transducer configured to capture a second signal, the second signal comprising a second speech part of the speech of the user, where the first speech part and the second speech part are of a same speech portion of the speech at a first time interval; and a signal processor configured to: estimate a fundamental frequency of the speech of the user at the first time interval based on the second signal; update the first model based on the estimated fundamental frequency of the speech at the first time interval; and process the first sound signal based on the updated first model to obtain the first speech part.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0121] The above and other features and advantages will become readily apparent to those skilled in the art by the following detailed description of exemplary embodiments thereof with reference to the attached drawings, in which:
[0122]
[0123]
[0124]
[0125]
[0126]
[0127]
[0128]
[0129]
DETAILED DESCRIPTION
[0130] Various embodiments are described hereinafter with reference to the figures. Like reference numerals refer to like elements throughout. Like elements will, thus, not be described in detail with respect to the description of each figure. It should also be noted that the figures are only intended to facilitate the description of the embodiments. They are not intended as an exhaustive description of the claimed invention or as a limitation on the scope of the claimed invention. In addition, an illustrated embodiment needs not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced in any other embodiments even if not so illustrated, or if not so explicitly described.
[0131] Throughout, the same reference numerals are used for identical or corresponding parts.
[0132]
[0133]
[0134] estimating a fundamental frequency of the user's speech at the first interval in time, the fundamental frequency being estimated based on the second signal 14;
[0135] applying the estimated fundamental frequency of the user's speech at the first interval in time into a first model to update the first model;
[0136] processing the first sound signal 16 based on the updated first model to obtain the first speech part of the first sound signal.
[0137]
[0138]
[0139] The first external input transducer 4 and the second external input transducer 4′ may be arranged on a part, e.g. a housing, of the electronic device 2 which is arranged in the ear 2 of the user.
[0140] The electronic device 2 may comprise a third external input transducer 4″, e.g. arranged on a part of the electronic device which is arranged behind the ear 20 of the user.
[0141] The electronic device 2 comprises an internal input transducer 12 which is configured to be arranged in the ear canal of the user's ear 20. Alternatively, the internal input transducer 12 may be arranged on the body of the user, e.g. arranged on the user's wrist.
[0142]
[0143] Thus, the electronic device 2 may comprise a transceiver 16 and an antenna 18 for transmitting 30 the signal, e.g. the first speech part of the first sound signal, processed in the signal processor 6 to another device, such as a smart phone paired with the electronic device 2. Phone calls with far-end callers 24 may be performed using the smart phone, whereby the first speech part of the first sound signal may be transmitted via the wireless connection 30 in the phone call to a transceiver 32 of a second electronic device, such as a smart phone of the far-end caller 24.
[0144]
[0145]
[0146]
[0147]
(a) is a clean signal recorded with external microphones.
(b) is the clean signal zoomed in at low frequencies between 0-1 kHz.
(c) is the noisy external microphone signal corrupted by babble noise.
(d) is the noisy signal zoomed in at low frequencies between 0-1 kHz.
(e) is the vibration sensor signal.
(f) is the vibration sensor signal zoomed in at low frequencies between 0-1 kHz.
[0148] The spectrograms illustrate how the low frequencies are better preserved in the vibration sensor signal, whereas the high frequencies are better preserved in the external microphone signal. Therefore it an advantage to use the vibration sensor signal to estimate the fundamental frequency of the user's speech, and based on this, obtain the first speech of the user's speech from the external microphone signal.
[0149]
[0150]
[0151]
[0152]
[0153]
[0154] The top left graph shows a speech signal, where the x-axis is time in seconds, and the y-axis is amplitude. The speech signal has a duration/length of 2.5 seconds.
[0155] The speech signal is transformed to a frequency representation in the top right graph, where the x-axis is time in seconds, and the y-axis is frequency in Hz. This frequency representation shows a spectrogram of speech, which corresponds to the spectrograms in
[0156] Going back to the speech signal in the top left graph, this speech signal can be divided into segments of time. One segment of the speech signal is shown in the bottom left figure. The segment of the speech signal has a length of 0.025 seconds. The periodicity of the speech signal in the specific segment is illustrated by the red vertical lines every 0.005 seconds.
[0157] The segment of the speech signal is transformed to a frequency representation in the bottom right graph, where the x-axis is now frequency in Hz, and the y-axis is power.
[0158] The bottom right graph shows the corresponding spectrum of the segment. The bottom right graph shows the signal divided in harmonic frequencies, where the harmonic frequency ω0 is the lowest frequency at about 25 Hz, the next harmonic is ω1 at about 50 Hz, and then a number of harmonics are shown up to about 100 Hz.
[0159] From the bottom right graph showing the corresponding spectrum of the segment, a fundamental frequency ω0 of the speech segment is estimated as shown in the middle right graph, where the x-axis is time in seconds, and the y-axis is fundamental frequency ω0 in Hz.
[0160] The estimated fundamental frequency in the middle right graph is shown below the spectrum of speech in the top right graph, and as the x-axes of both these graphs are time in seconds, the estimated fundamental frequency at a time tin the middle right graph can be seen together with the spectrum of speech at the same time t in the top right graph. Thus, the graphs of
[0161] Although particular features have been shown and described, it will be understood that they are not intended to limit the claimed invention, and it will be made obvious to those skilled in the art that various changes and modifications may be made without departing from the scope of the claimed invention. The specification and drawings are, accordingly to be regarded in an illustrative rather than restrictive sense. The claimed invention is intended to cover all alternatives, modifications and equivalents.
[0162] Items:
1. A method in an electronic device, for obtaining a user's speech in a first sound signal, the first sound signal comprising the user's speech and noise from the surroundings, the electronic device comprising: [0163] a first external input transducer configured for capturing the first sound signal, the first sound signal comprising a first speech part of the user's speech and a first noise part; [0164] an internal input transducer configured for capturing a second signal, the second signal comprising a second speech part of the user's speech;
where the first speech part and the second speech part are of a same speech portion of the user's speech at a first interval in time; [0165] a signal processor;
where the method comprises, in the signal processor: [0166] estimating a first fundamental frequency of the user's speech at the first interval in time, the first fundamental frequency being estimated based on the second signal; [0167] applying the estimated first fundamental frequency of the user's speech at the first interval in time into a first model to update the first model; and [0168] processing the first sound signal based on the updated first model to obtain the first speech part of the first sound signal.
2. The method according to any of the preceding items, wherein
the first external input transducer is configured for capturing a third sound signal, the third sound signal comprising a third speech part of the user's speech and a third noise part;
the internal input transducer is configured for capturing a fourth signal, the fourth signal comprising a fourth speech part of the user's speech;
where the third speech part and the fourth speech part are of a same speech portion of the user's speech at a second interval in time;
where the method comprises, in the signal processor: [0169] estimating a second fundamental frequency of the user's speech at the second interval in time, the second fundamental frequency being estimated based on the fourth signal; [0170] applying the estimated second fundamental frequency of the user's speech at the second interval in time into the first model to update the first model; [0171] processing the third sound signal based on the updated first model to obtain the third speech part of the third sound signal.
3. The method according to any of the preceding items,
wherein the method is configured to be performed at regular intervals in time for obtaining/deriving the user's speech during/over a time period,
where the method comprises estimating the current fundamental frequency of the user's speech at each interval in time;
where the method comprises applying the current fundamental frequency in the first model to update the first model;
where the method comprises obtaining a current speech part at each interval in time.
4. The method according to any of the preceding items, wherein the first model is a periodic model.
5. The method according to any of the preceding items, wherein processing the first sound signal, which is based on the updated first model to obtain the first speech part, comprises filtering the first sound signal in a periodic filter.
6. The method according to any of the preceding items, wherein filtering the first sound signal in the periodic filter comprises applying multiples of the estimated first fundamental frequency of the user's speech.
7. The method according to any of the preceding items, wherein the periodic model is a harmonic model, and wherein the periodic filter is a harmonic filter.
8. The method according to any of the preceding items, wherein the method further comprises: [0172] processing the obtained first speech part; and wherein the processing of the obtained first speech part comprises mixing a noise signal with the obtained first speech part.
9. The method according to any of the preceding items, wherein the internal input transducer is configured to be arranged in the ear canal of the user or on the body of the user.
10. The method according to any of the preceding items, wherein the internal input transducer comprises a vibration sensor.
11. The method according to any of the preceding items, wherein the bandwidth of the vibration sensor is configured to span low frequencies of the user's speech, the low frequencies being up to approximately 1.5 kHz.
12. The method according to any of the preceding items, wherein the first external input transducer is a microphone configured to point towards the surroundings.
13. The method according to any of the preceding items, wherein the electronic device further comprises a second external input transducer, and wherein processing the first sound signal, which is based on the updated first model to obtain the first speech part, comprises beamforming the first sound signal in a periodic beamformer.
14. The method according to any of the preceding items, wherein the electronic device comprises a first hearing device and a second hearing device, and wherein the first fundamental frequency is configured to be estimated in the first hearing device and/or in the second hearing device.
15. An electronic device for obtaining a user's speech in a first sound signal, the first sound signal comprising the user's speech and noise from the surroundings, the electronic device comprising: [0173] a first external input transducer configured for capturing the first sound signal, the first sound signal comprising a first speech part of the user's speech and a first noise part; [0174] an internal input transducer configured for capturing a second signal, the second signal comprising a second speech part of the user's speech;
where the first speech part and the second speech part are of a same speech portion of the user's speech at a first interval in time; [0175] a signal processor configured for: [0176] estimating a fundamental frequency of the user's speech at the first interval in time, the fundamental frequency being estimated based on the second signal; [0177] entering/applying the estimated fundamental frequency of the user's speech at the first interval in time into a first model to update the first model; [0178] processing the first sound signal based on the updated first model to obtain the first speech part of the first sound signal.
LIST OF REFERENCES
[0179] 2 electronic device [0180] 4 first external input transducer [0181] 4′ second external input transducer [0182] 4″ third external input transducer [0183] 6 signal processor [0184] 8 output transducer [0185] 10 first sound signal comprising a first speech part of the user's speech and a first noise part [0186] 12 internal input transducer [0187] 14 second signal comprising a second speech part of the user's speech [0188] 16 transceiver [0189] 18 antenna [0190] 20 user's ear [0191] 22 user of the electronic device [0192] 24 far-end caller or recipient [0193] 26 user's speech [0194] 28 noise/sounds from the surrounding [0195] 30 wireless connection [0196] 32 transceiver of a second electronic device [0197] 100 method for obtaining a user's speech in a first sound signal [0198] 102 step of estimating a first fundamental frequency of the user's speech at the first interval in time [0199] 104 step of applying the estimated first fundamental frequency of the user's speech at the first interval in time into a first model to update the first model [0200] 106 step of processing the first sound signal based on the updated first model to obtain the first speech part of the first sound signal