HEARING DEVICE SYSTEM AND METHOD FOR PROCESSING AUDIO SIGNALS
20210281958 · 2021-09-09
Inventors
Cpc classification
H04R2499/11
ELECTRICITY
H04R25/70
ELECTRICITY
H04R2420/07
ELECTRICITY
H04R25/554
ELECTRICITY
International classification
Abstract
A hearing device (2) comprises a recording unit (5) for recording an input signal (I), an audio processing unit (6) for determining an output signal (0) and a playback unit (7) for playing back the output signal (0) to a user (U). The audio processing unit (6) comprises a neural network (8) for separating a user voice signal (u) from the input signal (I). Further, a system (1) and a method for processing audio signals are described.
Claims
1. A hearing device comprising: a recording unit for recording an input signal, an audio processing unit for determining an output signal, wherein the audio processing unit comprises a neural network for separating a user voice signal from the input signal, and a playback unit for playing back the output signal to a user.
2. The hearing device according to claim 1, wherein the audio processing unit further comprises a classical audio signal processing means for processing at least parts of the input signal to denoise at least parts of the input signal.
3. The hearing device according to claim 2, wherein the classical audio signal processing means and the neural network are configured to be run in parallel and/or in series.
4. The hearing device according to claim 1, wherein the neural network is configured as a long short-term memory network with three layers.
5. The hearing device according to claim 1, further comprising: a voice detection unit for measuring a presence of a user voice signal in the input signal.
6. A system for processing audio signals, comprising a hearing device comprising an audio processing unit for determining an output signal, wherein the audio processing unit comprises a first neural network for separating a user voice signal from the input signal, a secondary device, wherein the secondary device comprises a secondary audio processing unit for determining a secondary output signal, wherein the secondary audio processing unit comprises a secondary neural network for denoising, at least parts of a secondary input signal, wherein the secondary device configured to form a wireless data connection with the hearing device for transmitting at least parts of the secondary output signal to the hearing device and/or to receive the secondary input signal from the hearing device.
7. The system according to claim 6, wherein the secondary device comprises a secondary recording unit for recording the secondary input signal.
8. The system according to claim 6, wherein the secondary neural network is configured to separate the user voice signal from the secondary input signal.
9. The system according to claim 6, wherein that the secondary audio processing unit comprises a calibration neural network for calibrating the neural network and/or the secondary neural network.
10. The system according to claim 6, wherein the secondary device is mobile device or a wireless microphone.
11. The system according to claim 6, wherein the data connection is implemented using a BLUETOOTH protocol or using a proprietary protocol, wherein the proprietary protocol has a lower latency than BLUETOOTH.
12. A method for processing audio signals, the method comprising: recording an input signal using the recoding unit, determining an output signal using the audio processing unit, wherein a user voice signal is separated from the input signal by the neural network, providing the output signal to the user using a hearing device.
13. The method according to claim 12, further comprising: determining the output signal comprises denoising at least parts of the input signal by a classical audio signal processing means.
14. The method according to claim 12, wherein at least parts of the input signal are denoised by the classical audio signal processing means in parallel to the separation of the user voice signal by the neural network.
15. The method according to claim 12, wherein at least parts of the input signal are processed denoised by the classical audio signal processing means after the user voice signal is separated from the input signal by the neural network.
16. The method according to claim 12, wherein the neural network is a first neural network, further comprising: determining a secondary output signal by denoising at least parts of a secondary input signal using a second neural network, and transmitting at least parts of the secondary output signal to the hearing device.
17. The method according to claim 16, wherein denoising of the secondary input signal by the secondary neural network comprises separating the user voice signal from the secondary input signal.
18. The method according to claim 16, wherein the secondary output signal is at least partially included in the output signal by the audio processing unit of the hearing device.
19. The method according to claim 16, wherein the method further comprises the step calibrating the neural network and/or the secondary neural network using a calibration neural network being part of the secondary audio processing unit.
20. The method according to claim 19, wherein a calibration input signal is provided to and analyzed by the calibration neural network.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0065]
[0066]
[0067]
[0068]
[0069]
[0070]
DETAILED DESCRIPTION
[0071]
[0072] The hearing device 2 comprises a power supply 4 in form of a battery. The hearing device comprises a recording unit 5, an audio processing unit 6 and a playback unit 7. The recording unit 5 is configured to record an input signal I. The input signal I corresponds to sound, in particular ambient sound, which has been recorded with the recording unit 5. The audio processing unit 6 is configured to determine an output signal O. The playback unit 7 is configured to play back the output signal O to a user U.
[0073] The audio processing unit 6 comprises a neural network 8 and a classical audio signal processing means 9. The neural network 8 is an artificial neural network. The classical audio signal processing means 9 comprise computational means for audio processing which do not use a neural network. The classical audio signal processing means 9 can, for example, coincide with audio processing means used in known hearing aids such as digital signal processing algorithms carried about in a digital signal processor (DSP). The audio processing unit 6 is configured as an arithmetic unit on which the neural network 8 and/or the classical audio signal processing means 9 can be executed.
[0074] The neural network 8 is configured to separate a user voice signal u (e.g.,
[0075] The neural network 8 is highly specialized. It can be run efficiently with low computational requirements. Further, running the neural network 8 does not require high energy consumption. The neural network 8 can be reliably run on the hearing device 2 for long times on a single charge of the power supply 4. The neural network 8 can have any suitable architecture for neural net-works. An exemplary neural network 8 is a long short-term memory (LSTM) network with three layers. In an exemplary embodiment, each layer has 256 units.
[0076] The hearing device 2 comprises a sensor 10. The sensor 10 is a vibration sensor. The sensor 10 detects vibrations caused by the user U speaking. The sensor 10 can be used to measure a presence of the user voice signal u in the input signal I.
[0077] The hearing device 2 comprises a data interface 11. The secondary device 3 comprises a secondary data interface 12. The hearing device 2 and the secondary device 3 are connected via a wireless data connection 13, e.g., via a standard BLUETOOTH wireless data connection or via a wireless data connection implemented with a proprietary protocol such as the ROGER protocol or such as a proprietary protocol implemented via modifying the BLUETOOTH protocol. Proprietary protocol, such as ROGER, can present the advantage of permitting to reach a lower audio delay than the audio delay than can be achieved with standard protocols.
[0078] The secondary device 3 comprises a secondary power supply 14. The secondary device 3 comprises a secondary recording unit 15 and a secondary audio processing unit 16. The secondary recording unit 15 comprises one or more microphones to record a secondary input signal J. The secondary input signal J corresponds to sounds, in particular ambient sounds, which have been recorded with the secondary recording unit. Many modern mobile phones comprise several microphones which may be used by the secondary recording unit. Using several microphones, spatial information about the secondary input signal J. Further, the secondary input signal J can be recorded in stereo.
[0079] The secondary audio processing unit 16 is configured to determine a secondary output signal P. The secondary output signal P is determined based on the secondary input signal J. The secondary audio processing unit 16 comprises a secondary neural network 17. The secondary neural network 17 is configured to separate the user voice signal u from the secondary input signal J. To this end, the secondary neural network 17 uses the same user's speaker embedding as the neural network 8. In contrast to the neural network 8, the secondary neural network 17 does not return the user voice signal u, but the remaining audio signals contained in the secondary input signal J which do not correspond to the user voice signal u. The secondary neural network 17 removes the user voice signal u from the secondary input signal J. In other words, the secondary neural network 17 calculates the relative complement the user voice signal u in the secondary input signal J, i.e. J−u. The secondary neural network 17 is further configured to denoise the secondary input signal J. In other words, the secondary neural network filters noise and the user voice signal u from the secondary input signal J. The output of the secondary neural network 17 hence is the denoised relative complement of the user voice signal u, i.e. a denoised version of the audio signals (J−u). The secondary output signal P comprises the output of the secondary neural network 17.
[0080] The secondary neural network 17 can perform more advanced operations on the secondary input signal J than the neural network 8 performs on the input signal I. Hence, the secondary neural network 17 requires more computational power. This is possible, because the secondary device 3 does not have comparable constraints concerning computational capabilities and capacity of the power supply as the hearing device 2. Hence, the secondary device 3 is able to run the more complex secondary neural network 17.
[0081] Any suitable network architecture can be used for the secondary neural network 17. An exemplary secondary neural network is a long short-term memory (LSTM) network with four layers. Per layer, the secondary neural network may comprise 300 units. In other embodiments, the secondary audio processing unit 16 may comprise more than one secondary neural networks 17. In these embodiments, different of the secondary neural networks 17 may be specialized for different purposes. For example, one of the secondary neural networks 17 may be configured to remove the user voice signal u from the secondary input signal J. One or more different secondary neural networks may be specialized for denoising specific kinds of audio signals, for example voices, music and/or traffic noise.
[0082] The secondary audio processing unit 16 further comprises a calibration neural network 18. The calibration neural network is configured to calibrate the neural network 8 and the secondary neural network 17. The calibration neural network 18 calculates the user's speaker embedding needed identify the user voice signal. To this end, the calibration neural network 18 receives a calibration input signal containing information about the user's voice characteristics. In particular, the calibration neural network 18 uses Mel Frequency Cepstral Coefficients (MFCC) as well as two derivatives therefrom of examples of a user's voice. The calibration neural network 18 returns the user's speaker embedding, used as input variable in the neural network 8 as well as the secondary neural network 17.
[0083] Any suitable architecture can be used for the calibration neural network 18. An exemplary calibration neural network 18 is a long short-term memory (LSTM) network with three layers and 256 units per layer.
[0084] The secondary neural network 17 and the calibration neural network 18 are run on the secondary audio processing unit 16. In the shown embodiment, the secondary audio processing unit 16 comprises two secondary arithmetic units 19, on which the secondary neural network 17 and the calibration neural network 18 can be run respectively. In the shown embodiment, the secondary arithmetic units 19 are AI-chips of the secondary device 3. In alternative embodiments, the secondary neural network 17 and the calibration neural network 19 can be run on the same arithmetic unit. In such embodiments, the secondary audio processing unit 16 can be comprised of a single arithmetic unit.
[0085] The secondary device 3 further comprises a user interface 20. The user interface 20 of the secondary device is a touchscreen of the mobile phone. Via the user interface 20, information about the audio processing on the hearing device 2 and the secondary device 3 is submitted to the user U. Further, the user U can influence the audio processing, e.g. by setting preferences and changing operation modes. For example, the user U can set the degree of denoising and/or the amplification of the output signal.
[0086] The secondary device 3 comprises secondary device sensors 21. The secondary device sensors 21 collect user data. The audio processing can be adapted based on the user data. For example, the audio processing can be adapted to position and/or movement of the user. In embodiments with several neural networks 17, the user data can, for example, be used to select one or more of the secondary neural networks 17 which are best adapted to the surroundings of the user U.
[0087] In the shown embodiment, the hardware of the secondary device 3 is the usual hardware of a modern mobile phone. The functionality of the secondary device 3, in particular the functionality of the secondary audio processing unit 16, is provided by software, in particular an app, which is installed on the mobile phone. The software comprises the secondary neural network 17 as well as the calibration neural network 19. Further, the software provides a program surface displayed to the user U via the user interface 20.
[0088] With reference to
[0089] After the provision step 25, the system 1 is calibrated in a calibration step 26. In the calibration step 26, the calibration neural network 18 is used to calibrate the neural network 8 on the hearing device 2 as well as the secondary neural network 17 on the secondary device 3. Samples of the user's voice are recorded using the secondary recording unit 15. The secondary audio processing unit 16 calculates the Mel Frequency Cepstral Coefficients (MFCC) as well as two derivatives from the samples of the user's voice. The calibration neural network evaluates the calculated Mel Frequency Cepstral Coefficients and the derivatives 18 to calculate the user's speaker embedding. The calculated user's speaker embedding is provided to the secondary neural network 17. The calculated user's speaker embedding transferred to the hearing device 2, in particular the neural network 8, via the data connection 13.
[0090] The samples of the user's voice are recorded for a given amount of time, e.g. between 5 seconds and 30 minutes. For example, the samples may be recorded for about 3 minutes. The more samples, meaning the more time the samples are recorded, the more precise the calibration becomes. In the shown embodiment, the calibration is performed once, when the user U starts to use the system 1. In other embodiments, the calibration step 26 can also be repeated at later times, in order to gradually improve the user's speaker embedding and therefor the quality of the separation of the user voice signal u from the input signal I and the secondary input signal J respectively.
[0091] The calibrated system can be used for audio processing by the user in an audio processing step 27. In the audio processing step 27, the hearing device 2 is used to generate the output signal O which is played back to the user U. The system 1 provides different operation modes for the audio processing step 27. In the
[0092] A first operation mode 28, which is shown in
[0093] Suppose that the user is in a surrounding with the ambient sound S. The ambient sound S is recorded as the input signal I by the recording unit 5 of the hearing device 2 in an input recording step 30. The input signal I may comprise the user voice signal u and further audio signals marked with the letter R. The audio signals R are the relative complement of the user voice signal u in the input signal I: R=I−u. At the same time, the ambient sound S is recorded by the secondary recording unit 15 of the secondary device 3 in form of a secondary input signal J in a secondary input step 31. The secondary input signal J mainly coincides with the input signal I, e.g., it may contain the user voice signal u and the further audio signals R. The Possible differences between the input signal I and the secondary input signal J may be caused by the different positions of the recording unit 5 and the secondary recording unit 15 and/or their different recording quality.
[0094] In the following, the input signal I and the secondary input signal J are processed in parallel in the hearing device 2 and the secondary device 3. The secondary input signal J is passed to the secondary audio processing unit 16 for a secondary output signal determination step 32. In the secondary output signal determination step 32, the secondary neural network 17 removes the user voice signal u from the secondary input signal J in a user voice signal removal step 33. The remaining audio signals R are denoised in a denoising step 34 using the secondary neural network 17. In other embodiments, the user voice signal removal step 33 and the denoising step 34 can be executed in parallel by the secondary neural network 17. In further embodiments, the user voice signal removal step 33 and the denoising step 34 can be subsequently performed by two different secondary neural networks.
[0095] The denoised remaining audio signals are transmitted as the secondary output signal P to the secondary device 2 in a transmission step 35.
[0096] The audio processing unit step 6 of the hearing device 2 performs an output signal determination step 36. In the output signal determination step 36 the neural network 8 is used to separate the user signal u from the input signal I in a user voice signal separation step 37. After the user voice signal separation step 37, the user voice signal u is combined with the secondary output signal P which has been received from the secondary device 3 in a combination step 38. In the combination step 38, the user voice signal u and the denoised secondary output signal P can be mixed with varying amplitudes in order to adapt the output signal O to the preferences of the user U. The output signal O contains the user voice signal u and the secondary output signal P. The output signal O is transferred to the playback unit 7. The output signal O is played back to the user U in form of the processed sound S′ in a playback step 39.
[0097] Since the user voice signal u and the secondary output signal P can be amplified before being combined, the user can choose how loud the user voice signal is in respect to the remaining audio signals R. In particular, the user can choose that the user voice signal u is not being played back to him.
[0098] In the above described operation mode 28 of audio processing step 27, the user voice signal u as well as the rest of the audio signals R are processed by neural networks, i.e. the neural network 8 and the secondary neural network 17, respectively. Processing the user voice signal u directly on the hearing device 2 has the advantage that the processed user voice signal u has not to be transferred from the secondary device 3 to the hearing device 2.
[0099] Hence, the user voice signal can be processed and played back to the user with low latency. Disturbing echoing effects, which occur when the user hears his own voice and the processed version of the own voice. At the same time the rest of the audio signals R are denoised using the secondary neural network 17 on the secondary device 3, which ensures optimum quality of the output signal O and the processed sound S′. Processing the rest of the audio signals R on the secondary device 3 requires transmitting the secondary output signal P from the secondary device 3 to the hearing device 2. This increases the latency, with which the rest of the audio signals R are played back to the user. However, the echoing effect is less pronounced for audio signals which do not correspond to the user's voice, the increased latency of the playback of the rest of the audio signals does not disturb the user.
[0100] In this regard, it is important to mention that the audio processing step 27 is a continuous process in which the input signal I and the secondary input signal J are permanently recorded and processed. Due to the lower latency of the processing of the user voice signal u, the processed user voice signal u is combined with a secondary output signal P which corresponds to audio signals R which have been recorded slightly earlier than the user voice signal u.
[0101] In total, the latency, with which the user voice signal u is played back to the user, is 50 ms or less, in particular 25 ms or less, in particular 20 ms or less, in particular 15 ms or less, in particular 10 ms or less.
[0102] In the operation mode 28 shown in
[0103] With reference to
[0104]
[0105] In the output determining step 36a the input signal I is duplicated. One duplicate of the input signal I is processed in the user voice signal separation step 37 by the neural network 8. The user voice signal separation step 37 returns the user voice signal u in high quality. In parallel, a copy of the input signal I is classically denoised in a classical denoising step 40 using the classical audio signal processing means 9. The denoised input signal I′ is combined with the user voice signal u in a combination step 38a. The output signal O hence contains the user voice signal u and the classically denoised input signal I′. In operation mode 28a the neural network 8 and the classical audio signal processing means 9 are run in parallel by the audio processing unit 6. However, the output signal O contains the high-quality user voice signal and the entire classically denoised input signal I′ which itself contains the user voice signal u with less quality.
[0106]
[0107] In
[0108] In another operation mode, which is not shown in the figures, the output signal determination step 36 is performed without using the neural network 8. The neural network 8 may be temporarily deactivated, e.g., when the input signal I does not comprise the user voice signal u. In this use cases the neural network 8 is deactivated and the input signal I is simply processed by the classical audio signal processing means 9. This operation mode might be used to save energy, in particular when the charging state of the power supply 4 is low. This operation mode can also be used when the input signal I does not comprise the user voice signal u.
[0109] In a variant of the above-described operation modes, the output signal determination step comprises an additional pre-processing step for pre-processing the input signal I. In the preprocessing step the hearing device 2 can use sensor data of sensor 10 in order to measure whether the user voice signal u is present. To do so, the sensor 10 measures vibrations caused by the user speaking. Alternatively, the presence of the user voice signal u can be measured using the relative loudness of the user's voice in respect to other audio signals.
[0110] The different operation modes can be chosen by the user U, e.g., by a command input via the user interface 20. This way the user can choose whether he wants his own voice to be played back to him or not. Further, the user can choose with which quality the remaining audio signals R are denoised, in particular whether the remaining audio signals R are denoised using the secondary neural network 17 of the secondary device 3 or the classical audio signal processing means of the hearing device 2.
[0111] The system 3 can also automatically change between the different operation modes. For example, the hearing device 2 will automatically use one of the operation modes 28a, 28b, 28c discussed with reference to
[0112] In further embodiments which are not shown in the figures, the system comprises more than one hearing device, in particular two hearing devices.