ELECTRONIC DEVICE AND METHOD FOR OBTAINING A USER'S SPEECH IN A FIRST SOUND SIGNAL

20230197094 · 2023-06-22

Assignee

Inventors

Cpc classification

International classification

Abstract

An electronic device includes: a first external input transducer configured to capture a first sound signal that comprises a first speech part of a speech of a user and a first noise part of noise from a surrounding; an internal input transducer configured to capture a second signal, the second signal comprising a second speech part of the speech of the user, where the first speech part and the second speech part are of a same speech portion of the speech at a first time interval; and a signal processor configured to: estimate a fundamental frequency of the speech of the user at the first time interval based on the second signal; update the first model based on the estimated fundamental frequency of the speech at the first time interval; and process the first sound signal based on the updated first model to obtain the first speech part.

Claims

1. A method performed by an electronic device, the method comprising: capturing, by a first external input transducer of the electronic device, a first sound signal, the first sound signal comprising a first speech part of a speech of a user and a first noise part of noise from a surrounding; capturing, by an internal input transducer of the electronic device, a second signal, the second signal comprising a second speech part of the speech of the user, where the first speech part and the second speech part are of a same speech portion of the speech at a first time interval; estimating, by a signal processor of the electronic device, a first fundamental frequency of the speech of the user at the first time interval, the first fundamental frequency being estimated based on the second signal; updating, by the signal processor, a first model based on the estimated first fundamental frequency of the speech of the user at the first time interval; and processing, by the signal processor, the first sound signal based on the updated first model to obtain the first speech part of the first sound signal.

2. The method according to claim 1, further comprising: capturing, by the first external input transducer, a third sound signal, the third sound signal comprising a third speech part of the speech of the user; capturing, by the internal input transducer, a fourth signal, the fourth signal comprising a fourth speech part of the speech of the user, where the third speech part and the fourth speech part are of a same speech portion of the speech of the user at a second time interval; estimating a second fundamental frequency of the speech of the user at the second time interval, the second fundamental frequency being estimated based on the fourth signal; updating the first model based on the estimated second fundamental frequency of the speech of the user at the second time interval; and processing the third sound signal to obtain the third speech part, wherein the act of processing the third sound signal is performed based on the first model that has been updated based on the estimated second fundamental frequency.

3. The method according to claim 1, further comprising: estimating additional fundamental frequencies of the speech of the user at additional time intervals respectively; updating the first model based on the estimated additional fundamental frequency at each of the additional time intervals; and obtaining a speech part for each of the additional time intervals.

4. The method according to claim 1, wherein the first model is a periodic model.

5. The method according to claim 1, wherein the act of processing the first sound signal based on the updated first model to obtain the first speech part comprises filtering the first sound signal in a periodic filter.

6. The method according to claim 5, wherein act of filtering the first sound signal in the periodic filter comprises applying multiples of the estimated first fundamental frequency.

7. The method according to claim 5, wherein the first model is a harmonic model, and wherein the periodic filter is a harmonic filter.

8. The method according to claim 1, further comprising processing the obtained first speech part; wherein the act of processing the obtained first speech part comprises mixing a noise signal with the obtained first speech part.

9. The method according to claim 1, wherein the internal input transducer is configured to be arranged in an ear canal of the user or on a body of the user.

10. The method according to claim 1, wherein the internal input transducer comprises a vibration sensor.

11. The method according to claim 10, wherein a bandwidth of the vibration sensor is configured to span low frequencies of the speech of the user, the low frequencies being up to approximately 1.5 kHz.

12. The method according to claim 1, wherein the first external input transducer is a microphone configured to point towards the surrounding.

13. The method according to claim 1, wherein the electronic device further comprises a second external input transducer, and wherein the act of processing the first sound signal based on the updated first model to obtain the first speech part comprises beamforming the first sound signal in a periodic beamformer.

14. The method according to claim 1, wherein the electronic device comprises a first hearing device and a second hearing device, and wherein the first fundamental frequency is estimated by the first hearing device and/or the second hearing device.

15. An electronic device comprising: a first external input transducer configured to capture a first sound signal, the first sound signal comprising a first speech part of a speech of a user and a first noise part of noise from a surrounding; an internal input transducer configured to capture a second signal, the second signal comprising a second speech part of the speech of the user, where the first speech part and the second speech part are of a same speech portion of the speech at a first time interval; and a signal processor configured to: estimate a fundamental frequency of the speech of the user at the first time interval based on the second signal; update the first model based on the estimated fundamental frequency of the speech at the first time interval; and process the first sound signal based on the updated first model to obtain the first speech part.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0121] The above and other features and advantages will become readily apparent to those skilled in the art by the following detailed description of exemplary embodiments thereof with reference to the attached drawings, in which:

[0122] FIG. 1 schematically illustrates an example of a method in an electronic device, for obtaining a user's speech in a first sound signal.

[0123] FIGS. 2A and 2B schematically illustrate examples of an electronic device for obtaining a user's speech in a first sound signal.

[0124] FIG. 3 schematically illustrates an example of a user's ear with an electronic device in the ear.

[0125] FIG. 4 schematically illustrates an example of using the obtained first speech part in a phone call between the user of the electronic device and a far-end caller or recipient.

[0126] FIGS. 5A and 5B schematically illustrate examples of block diagrams of a method for obtaining a first speech part of a first sound signal, where FIG. 5A schematically illustrates an example of a block diagram for harmonic filter own-voice pick-up using a first external microphone, and where FIG. 5B schematically illustrates an example of a block diagram for harmonic beamformer own-voice pick-up using at least two external microphones.

[0127] FIG. 6 shows examples of spectrograms of speech signals.

[0128] FIGS. 7A and 7B schematically illustrates examples of beamformers.

[0129] FIG. 8 schematically illustrates an example of representations and segments of a speech signal, and how the fundamental frequency for time segments or time intervals can be estimated from a speech signal.

DETAILED DESCRIPTION

[0130] Various embodiments are described hereinafter with reference to the figures. Like reference numerals refer to like elements throughout. Like elements will, thus, not be described in detail with respect to the description of each figure. It should also be noted that the figures are only intended to facilitate the description of the embodiments. They are not intended as an exhaustive description of the claimed invention or as a limitation on the scope of the claimed invention. In addition, an illustrated embodiment needs not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced in any other embodiments even if not so illustrated, or if not so explicitly described.

[0131] Throughout, the same reference numerals are used for identical or corresponding parts.

[0132] FIG. 1 schematically illustrates an example of a method 100 in an electronic device, for obtaining a user's speech in a first sound signal. The first sound signal comprising the user's speech and noise from the surroundings. The electronic device comprises a first external input transducer configured for capturing the first sound signal. The first sound signal comprising a first speech part of the user's speech and a first noise part. The electronic device comprises an internal input transducer configured for capturing a second signal. The second signal comprising a second speech part of the user's speech. The first speech part and the second speech part are of a same speech portion of the user's speech at a first interval in time. The electronic device comprises a signal processor. The signal processor may be configured for processing the first sound signal and the second signal. The method comprises, in the signal processor, estimating 102 a first fundamental frequency of the user's speech at the first interval in time. The first fundamental frequency being estimated based on the second signal. The method comprises, in the signal processor, applying 104 the estimated first fundamental frequency of the user's speech at the first interval in time into a first model to update the first model. The method comprises, in the signal processor, processing 106 the first sound signal based on the updated first model to obtain the first speech part of the first sound signal.

[0133] FIG. 2A schematically illustrates an example of an electronic device 2 for obtaining a user's speech in a first sound signal 10. The first sound signal 10 comprises the user's speech and noise from the surroundings. The electronic device 2 comprises a first external input transducer 4 configured for capturing the first sound signal 10. The first sound signal 10 comprising a first speech part of the user's speech and a first noise part. The electronic device 2 comprises an internal input transducer 12 configured for capturing a second signal 14. The second signal 14 comprising a second speech part of the user's speech. Where the first speech part and the second speech part are of a same speech portion of the user's speech at a first interval in time. The electronic device 2 comprises a signal processor 6. The signal processor 6 may be configured for processing the first sound signal 10 and the second signal 14. Where the signal processor 6 is configured to:

[0134] estimating a fundamental frequency of the user's speech at the first interval in time, the fundamental frequency being estimated based on the second signal 14;

[0135] applying the estimated fundamental frequency of the user's speech at the first interval in time into a first model to update the first model;

[0136] processing the first sound signal 16 based on the updated first model to obtain the first speech part of the first sound signal.

[0137] FIG. 2B schematically illustrates an example of an electronic device 2 for obtaining a user's speech in a first sound signal 10. The electronic device of FIG. 2B comprises the same features as in FIG. 2A. Furthermore, FIG. 2B shows that the electronic device 2 may also comprise an output transducer 8 connected to the signal processor 6 for outputting a signal, e.g. the first speech part of the first sound signal, processed in the signal processor 6 to the user's own ear canal. Furthermore, FIG. 2B shows that the electronic device 2 may also comprise a transceiver 16 and an antenna 18 for transmitting the signal, e.g. the first speech part of the first sound signal, processed in the signal processor 6 to e.g. another device, such as a smart phone paired with the electronic device. Phone calls with far-end callers may be performed using the smart phone, whereby the first speech part of the first sound signal may be transmitted in the phone call to the far-end caller.

[0138] FIG. 3 schematically illustrates an example of a user's ear 20 with an electronic device 2 in the ear 20. The electronic device 2 comprises a first external input transducer 4 which may be a microphone configured to be arranged on an external facing surface of the electronic device 2 to point towards the surroundings. The electronic device 2 may further comprise a second external input transducer 4′ also arranged on an external facing surface of the electronic device 2 to point towards the surroundings.

[0139] The first external input transducer 4 and the second external input transducer 4′ may be arranged on a part, e.g. a housing, of the electronic device 2 which is arranged in the ear 2 of the user.

[0140] The electronic device 2 may comprise a third external input transducer 4″, e.g. arranged on a part of the electronic device which is arranged behind the ear 20 of the user.

[0141] The electronic device 2 comprises an internal input transducer 12 which is configured to be arranged in the ear canal of the user's ear 20. Alternatively, the internal input transducer 12 may be arranged on the body of the user, e.g. arranged on the user's wrist.

[0142] FIG. 4 schematically illustrates an example of using the obtained first speech part in a phone call between the user 22 of the electronic device and a far-end caller or recipient 24. When the user 22 of the electronic device 2 speaks, the first external input transducer 4 of the electronic device 2 may capture both the user's speech 26 and sounds 28 from the surroundings. If the user 22 of the electronic device 2 is having a phone call via a wireless connection 30 with a far-end caller 24, the user's speech 26 may be captured by the external input transducer 4 of the electronic device 2 and transmitted to the far-end caller 24. However, as the external input transducer 4 may capture both the user's speech 26 and sounds 28 from the surroundings, the sounds 28 from the surroundings may be perceived as noise in a phone call, where it is desired to only transmit the user's speech 26 and not the sound/noise 28 from the surroundings. According to the present method and electronic device, the user's speech 26 or own-voice is obtained from the first sound signal, with no noise 28 or limited noise 28 or only little noise 28 in the signal. Thus, the first speech part is transmitted via the wireless connection 30 to the far-end recipient 24, whereby the far-end recipient 24 receives the first speech signal and not the noise signal 28 of the first sound signal. Thereby will the far-end recipient 24 receive a clean speech signal and no sounds/noise 28 or only few sounds/little noise 28 from the surroundings of the user 26.

[0143] Thus, the electronic device 2 may comprise a transceiver 16 and an antenna 18 for transmitting 30 the signal, e.g. the first speech part of the first sound signal, processed in the signal processor 6 to another device, such as a smart phone paired with the electronic device 2. Phone calls with far-end callers 24 may be performed using the smart phone, whereby the first speech part of the first sound signal may be transmitted via the wireless connection 30 in the phone call to a transceiver 32 of a second electronic device, such as a smart phone of the far-end caller 24.

[0144] FIGS. 5a and 5b schematically illustrate examples of block diagrams of a method for obtaining a first speech part of a first sound signal.

[0145] FIG. 5A schematically illustrates an example of a block diagram for harmonic filter own-voice pick-up using a first external microphone. A vibration sensor is an example of an internal input transducer 12. The vibration sensor captures a vibration signal which is an example of a second signal, and provides this signal to a pitch estimation which is a first fundamental frequency estimation. The pitch estimation estimates a pitch or a first fundamental frequency ω which is applied to a harmonic model which is an example of a first model. An external microphone is an example of a first external input transducer 4. The external microphone captures a sound signal, which is an example of a first sound signal, and provides this signal to a harmonic filter where the harmonic model is also provided. Based on this, the harmonic filter provides an own-voice signal which is an example of a first speech part.

[0146] FIG. 5B schematically illustrates an example of a block diagram for harmonic beamforming own-voice pick-up using at least two external microphones. A vibration sensor is an example of an internal input transducer 12. The vibration sensor captures a vibration signal which is an example of a second signal, and provides this signal to a pitch estimation which is a first fundamental frequency estimation. The pitch estimation estimates a pitch or a first fundamental frequency ω0 which is applied to a harmonic model which is an example of a first model. External microphones are an example of external input transducers, thus there may be at least a first external microphone 4 and a second external microphones 4′. The external microphones captures a sound signal, which is an example of a first sound signal, and provides this signal to a harmonic beamformer where the harmonic model is also provided. Based on this, the harmonic beamformer provides an own-voice signal which is an example of a first speech part.

[0147] FIG. 6 shows examples of spectrograms. The spectrograms are of signals, such as speech signals. The x-axis is time in seconds. The y-axis is frequency in kHz.

(a) is a clean signal recorded with external microphones.
(b) is the clean signal zoomed in at low frequencies between 0-1 kHz.
(c) is the noisy external microphone signal corrupted by babble noise.
(d) is the noisy signal zoomed in at low frequencies between 0-1 kHz.
(e) is the vibration sensor signal.
(f) is the vibration sensor signal zoomed in at low frequencies between 0-1 kHz.

[0148] The spectrograms illustrate how the low frequencies are better preserved in the vibration sensor signal, whereas the high frequencies are better preserved in the external microphone signal. Therefore it an advantage to use the vibration sensor signal to estimate the fundamental frequency of the user's speech, and based on this, obtain the first speech of the user's speech from the external microphone signal.

[0149] FIGS. 7A and 7B schematically illustrates examples of beamformers. The x-axis is angle in degrees. The y-axis is frequency in Hz.

[0150] FIG. 7A schematically illustrates an example of a broadband beamformer. FIG. 7B schematically illustrates an example of a harmonic beamformer.

[0151] FIG. 7a shows the beampattern of a broadband beamformer with its directivity steered to 0 degrees preserving most of the signal along an entire lobe from 0 Hz to 4000 Hz.

[0152] FIG. 7b shows the beampattern of a harmonic beamformer with its directivity steered to 0 degrees and is only preserving the signal at the harmonic frequencies distributed from 0 Hz to 4000 Hz, while eliminating the interference between the harmonic frequencies.

[0153] FIG. 8 schematically illustrates an example of representations and segments of a speech signal, and how the fundamental frequency for time segments or time intervals can be estimated from a speech signal.

[0154] The top left graph shows a speech signal, where the x-axis is time in seconds, and the y-axis is amplitude. The speech signal has a duration/length of 2.5 seconds.

[0155] The speech signal is transformed to a frequency representation in the top right graph, where the x-axis is time in seconds, and the y-axis is frequency in Hz. This frequency representation shows a spectrogram of speech, which corresponds to the spectrograms in FIG. 6.

[0156] Going back to the speech signal in the top left graph, this speech signal can be divided into segments of time. One segment of the speech signal is shown in the bottom left figure. The segment of the speech signal has a length of 0.025 seconds. The periodicity of the speech signal in the specific segment is illustrated by the red vertical lines every 0.005 seconds.

[0157] The segment of the speech signal is transformed to a frequency representation in the bottom right graph, where the x-axis is now frequency in Hz, and the y-axis is power.

[0158] The bottom right graph shows the corresponding spectrum of the segment. The bottom right graph shows the signal divided in harmonic frequencies, where the harmonic frequency ω0 is the lowest frequency at about 25 Hz, the next harmonic is ω1 at about 50 Hz, and then a number of harmonics are shown up to about 100 Hz.

[0159] From the bottom right graph showing the corresponding spectrum of the segment, a fundamental frequency ω0 of the speech segment is estimated as shown in the middle right graph, where the x-axis is time in seconds, and the y-axis is fundamental frequency ω0 in Hz.

[0160] The estimated fundamental frequency in the middle right graph is shown below the spectrum of speech in the top right graph, and as the x-axes of both these graphs are time in seconds, the estimated fundamental frequency at a time tin the middle right graph can be seen together with the spectrum of speech at the same time t in the top right graph. Thus, the graphs of FIG. 8 show how the fundamental frequency for time segments or time intervals can be estimated from a speech signal.

[0161] Although particular features have been shown and described, it will be understood that they are not intended to limit the claimed invention, and it will be made obvious to those skilled in the art that various changes and modifications may be made without departing from the scope of the claimed invention. The specification and drawings are, accordingly to be regarded in an illustrative rather than restrictive sense. The claimed invention is intended to cover all alternatives, modifications and equivalents.

[0162] Items:

1. A method in an electronic device, for obtaining a user's speech in a first sound signal, the first sound signal comprising the user's speech and noise from the surroundings, the electronic device comprising: [0163] a first external input transducer configured for capturing the first sound signal, the first sound signal comprising a first speech part of the user's speech and a first noise part; [0164] an internal input transducer configured for capturing a second signal, the second signal comprising a second speech part of the user's speech;
where the first speech part and the second speech part are of a same speech portion of the user's speech at a first interval in time; [0165] a signal processor;
where the method comprises, in the signal processor: [0166] estimating a first fundamental frequency of the user's speech at the first interval in time, the first fundamental frequency being estimated based on the second signal; [0167] applying the estimated first fundamental frequency of the user's speech at the first interval in time into a first model to update the first model; and [0168] processing the first sound signal based on the updated first model to obtain the first speech part of the first sound signal.
2. The method according to any of the preceding items, wherein
the first external input transducer is configured for capturing a third sound signal, the third sound signal comprising a third speech part of the user's speech and a third noise part;
the internal input transducer is configured for capturing a fourth signal, the fourth signal comprising a fourth speech part of the user's speech;
where the third speech part and the fourth speech part are of a same speech portion of the user's speech at a second interval in time;
where the method comprises, in the signal processor: [0169] estimating a second fundamental frequency of the user's speech at the second interval in time, the second fundamental frequency being estimated based on the fourth signal; [0170] applying the estimated second fundamental frequency of the user's speech at the second interval in time into the first model to update the first model; [0171] processing the third sound signal based on the updated first model to obtain the third speech part of the third sound signal.
3. The method according to any of the preceding items,
wherein the method is configured to be performed at regular intervals in time for obtaining/deriving the user's speech during/over a time period,
where the method comprises estimating the current fundamental frequency of the user's speech at each interval in time;
where the method comprises applying the current fundamental frequency in the first model to update the first model;
where the method comprises obtaining a current speech part at each interval in time.
4. The method according to any of the preceding items, wherein the first model is a periodic model.
5. The method according to any of the preceding items, wherein processing the first sound signal, which is based on the updated first model to obtain the first speech part, comprises filtering the first sound signal in a periodic filter.
6. The method according to any of the preceding items, wherein filtering the first sound signal in the periodic filter comprises applying multiples of the estimated first fundamental frequency of the user's speech.
7. The method according to any of the preceding items, wherein the periodic model is a harmonic model, and wherein the periodic filter is a harmonic filter.
8. The method according to any of the preceding items, wherein the method further comprises: [0172] processing the obtained first speech part; and wherein the processing of the obtained first speech part comprises mixing a noise signal with the obtained first speech part.
9. The method according to any of the preceding items, wherein the internal input transducer is configured to be arranged in the ear canal of the user or on the body of the user.
10. The method according to any of the preceding items, wherein the internal input transducer comprises a vibration sensor.
11. The method according to any of the preceding items, wherein the bandwidth of the vibration sensor is configured to span low frequencies of the user's speech, the low frequencies being up to approximately 1.5 kHz.
12. The method according to any of the preceding items, wherein the first external input transducer is a microphone configured to point towards the surroundings.
13. The method according to any of the preceding items, wherein the electronic device further comprises a second external input transducer, and wherein processing the first sound signal, which is based on the updated first model to obtain the first speech part, comprises beamforming the first sound signal in a periodic beamformer.
14. The method according to any of the preceding items, wherein the electronic device comprises a first hearing device and a second hearing device, and wherein the first fundamental frequency is configured to be estimated in the first hearing device and/or in the second hearing device.
15. An electronic device for obtaining a user's speech in a first sound signal, the first sound signal comprising the user's speech and noise from the surroundings, the electronic device comprising: [0173] a first external input transducer configured for capturing the first sound signal, the first sound signal comprising a first speech part of the user's speech and a first noise part; [0174] an internal input transducer configured for capturing a second signal, the second signal comprising a second speech part of the user's speech;
where the first speech part and the second speech part are of a same speech portion of the user's speech at a first interval in time; [0175] a signal processor configured for: [0176] estimating a fundamental frequency of the user's speech at the first interval in time, the fundamental frequency being estimated based on the second signal; [0177] entering/applying the estimated fundamental frequency of the user's speech at the first interval in time into a first model to update the first model; [0178] processing the first sound signal based on the updated first model to obtain the first speech part of the first sound signal.

LIST OF REFERENCES

[0179] 2 electronic device [0180] 4 first external input transducer [0181] 4′ second external input transducer [0182] 4″ third external input transducer [0183] 6 signal processor [0184] 8 output transducer [0185] 10 first sound signal comprising a first speech part of the user's speech and a first noise part [0186] 12 internal input transducer [0187] 14 second signal comprising a second speech part of the user's speech [0188] 16 transceiver [0189] 18 antenna [0190] 20 user's ear [0191] 22 user of the electronic device [0192] 24 far-end caller or recipient [0193] 26 user's speech [0194] 28 noise/sounds from the surrounding [0195] 30 wireless connection [0196] 32 transceiver of a second electronic device [0197] 100 method for obtaining a user's speech in a first sound signal [0198] 102 step of estimating a first fundamental frequency of the user's speech at the first interval in time [0199] 104 step of applying the estimated first fundamental frequency of the user's speech at the first interval in time into a first model to update the first model [0200] 106 step of processing the first sound signal based on the updated first model to obtain the first speech part of the first sound signal