VOICE GRAFTING USING MACHINE LEARNING
20230154450 · 2023-05-18
Inventors
Cpc classification
G10L15/25
PHYSICS
A61F2/20
HUMAN NECESSITIES
G10L21/00
PHYSICS
A61F2002/206
HUMAN NECESSITIES
International classification
G10L13/033
PHYSICS
G10L15/25
PHYSICS
Abstract
A process labeled “voice grafting” can be understood in terms of the source-filter model of speech production as follows: For a patient who has partially or completely lost the ability to phonate, but retained at least a partial ability to articulate, the techniques described herein computationally “graft” the patient's time varying filter function, i.e. articulation, onto a source function, i.e. phonation, which is based on the speech output of one or more healthy speakers, in order to synthesize natural sounding speech in real time.
Claims
1. A method for creating an artificial voice for a patient with missing or impaired phonation but at least residual articulation function, wherein an acoustic signal of one or more healthy speakers reading a known body of text out loud is recorded, at least one vocal tract signal of the patient mouthing the same known body of text is recorded, the acoustic signal and the at least one vocal tract signal are used to train a machine learning algorithm, and the machine learning algorithm is used in an electronic voice prosthesis measuring the patient's at least one vocal tract signal and converting it to an acoustic speech output in real time.
2. The method according to claim 1, wherein the patient is on mechanical ventilation, has undergone a partial or complete laryngectomy, or suffers from vocal fold paresis or paralysis.
3. The method according to claim 1, wherein at least one of the one or more healthy speakers is identical with the patient prior to impairment.
4. The method according to claim 1, wherein the one or more healthy speakers comprise a plurality of healthy speakers with different voice characteristics and a particular voice is chosen for the patient based on the patient's gender, age, natural pitch, other vocal characteristics prior to the impairment, and/or preferences.
5. The method according to claim 1, wherein the acoustic signal of the one or more healthy speaker and the at least one vocal tract signal of the patient are synchronized.
6-8. (canceled)
9. The method according to claim 1, wherein the machine learning algorithm is a convolutional neural network and wherein the convolutional neural network is trained to directly convert the recorded vocal tract signal to the acoustic speech output.
10. (canceled)
11. The method according to claim 1, wherein the machine learning algorithm is a convolutional neural network, and wherein the convolutional neural network is trained to convert the recorded vocal tract signal to elements of speech, such as phonemes, syllables or words, which are then synthesized to the acoustic speech output.
12. The method according to claim 1, wherein the machine learning algorithm is a convolutional neural network, and wherein the convolutional neural network is pre-trained based on the one or more healthy speakers and further impaired patients and re-trained for the patient.
13. The method according to claim 1, wherein the at least one vocal tract signal comprises an electromagnetic signal in the radio frequency range, optionally recorded using a radar transceiver.
14. The method according to claim 13, wherein electromagnetic waves in the frequency range of 1 kHz to 12 GHz, optionally microwaves between 1 GHz and 10 GHz are emitted, and reflected and/or transmitted and/or otherwise influenced waves are received using one or more antennas in contact with or proximity to the patient's skin.
15. (canceled)
16. The method according to claim 1, wherein the at least one vocal tract signal comprises one or more images of the patient's lips and/or face, recorded using a camera sensor.
17. (canceled)
18. The method according to claim 1, wherein the at least one vocal tract signal comprises a patient's residual voice output, measured using an acoustic microphone.
19-21. (canceled)
22. The method according to claim 1, wherein the at least one vocal tract signal comprises one or more ultrasound signals, and wherein low frequency ultrasound waves in the range between 20 and 100 kHz are emitted using a loudspeaker in contact with or in proximity to the patient's skin or near the patient's mouth and detected using a microphone.
23. The method according to claim 1, wherein the electronic voice prosthesis comprises a mobile computing device, and wherein the mobile computing device is a smart phone or a tablet carrying out the conversion of the at least one vocal tract signal to the acoustic speech output locally on the device.
24. (canceled)
25. The method according to claim 1, wherein the electronic voice prosthesis comprises a mobile computing device, and wherein the mobile computing device is a smart phone or a tablet connected to the internet and the conversion of the at least one vocal tract signal to the acoustic speech output is carried out on a remote computing platform.
26. (canceled)
27. The method according to claim 25, wherein the at least one vocal tract signal comprises one or more images of the patient's lips and/or face, recorded using a built-in camera sensor of the mobile computing device.
28-29. (canceled)
30. A device for a patient with missing or impaired phonation but at least residual articulation function, wherein the device is configured to measure at least one vocal tract signal of the patient and to convert it to an acoustic speech output in real time using a machine learning algorithm, the machine learning algorithm having been trained with data that includes an acoustic signal of one or more healthy persons reading a body of text out loud and at least one vocal tract signal of one or more persons mouthing the same body of text.
31. The method according to claim 23, wherein the at least one vocal tract signal comprises one or more images of the patient's lips and/or face, recorded using a built-in camera sensor of the mobile computing device.
32. The method according to claim 23, wherein the mobile computing device is connected to one or more external sensors via a wireless interface, the one or more external sensors configured to record one or more of the at least one vocal tract signal.
33. The method according to claim 25, wherein the mobile computing device is connected to one or more external sensors via a wireless interface, the one or more external sensors configured to record one or more of the at least one vocal tract signal.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
DETAILED DESCRIPTION OF EMBODIMENTS
[0049] Hereinafter, techniques of generating a speech output based on a residual articulation of the patient (voice grafting) are described. Various techniques are based on the finding that prior-art implementations of speech rehabilitation face certain restrictions and drawbacks. For example, for many patients they do not achieve the objective of restoring natural-sounding speech. Esophageal speech and speaking with the help of a speaking valve, voice prosthesis, or electrolarynx are difficult to learn for some patients and often results in distorted, unnatural speech. The need to hold and activate an electrolarynx device or cover a tracheostoma or valve opening with a finger makes these solutions cumbersome and obtrusive. In-dwelling prostheses also carry the risk of fungal infections.
[0050] For example, details with respect to the electrolarynx are described in: Kaye, Rachel, Christopher G. Tang, and Catherine F. Sinclair. “The electrolarynx: voice restoration after total laryngectomy.” Medical Devices (Auckland, NZ) 10 (2017): 133.
[0051] For example, details with respect to a speak valve are described in: Passy, Victor, et al. “Passy-Muir tracheostomy speaking valve on ventilator-dependent patients.” The Laryngoscope 103.6 (1993): 653-658. Also, see Kress, P., et al. “Are modern voice prostheses better? A lifetime comparison of 749 voice prostheses.” European Archives of Oto-Rhino-Laryngology 271.1 (2014): 133-140.
[0052] An overview of traditional voice prostheses is provided by: Reutter, Sabine. Prothetische Stimmrehabilitation nach totaler Kehlkopfentfernung-eine historische Abhandlung seit Billroth (1873). Diss. Universität Ulm, 2008.
[0053] Many different implementations of the voice grafting are possible. One implementation is shown schematically in
[0054] Step 1: Creating training data, i.e., training phase. A healthy speaker (25) providing the “target voice” reads a sample text (24) (also labelled reference text) out loud, while the voice output is recorded with a microphone (26). This can create one or more reference audio signals, or simply audio training data. The resulting audio training data (26), including the text and its audio recording, can be thought of as an “audio book”; in fact, it can also be an existing audio book recording or any other available speech corpus. (Step 1a,
[0055] The same text is then “read” by a patient with impaired phonation (29), while signals characterizing the patient's time-varying vocal tract configuration, i.e. the articulation, are recorded with suitable sensors (30), yielding vocal tract training data (31). If the patient is completely aphonic, “reading” the text here means silently “mouthing” it. To record the articulation, various options exist. For example, the patient's vocal tract is probed using electromagnetic and/or acoustic waves, and backscattered and/or transmitted waves are measured. Together, the measurement setup can be referred to as “vocal tract sensors” and to the measured signals as “vocal tract signals”. (Step 1b,
[0057] Step 2: Training the algorithm (
[0060] Step 3: Using the voice prosthesis (
[0061] Thus, as a general rule, a corresponding method can include receiving one or more live vocal-tract signals of the patient and then convert the one or more live vocal-tract signals into associated one or more live audio signals including speech output, based on the machine learning algorithm.
[0062] For each step described above, a wide range of implementations is possible, which will be described below.
[0063] Step 1a: Creating the audio training data. The audio training data representing the “healthy voice” can come from a range of different sources. In the most straightforward implementation it is the voice of a single healthy speaker. The training data set can also come from multiple speakers. In one implementation, a library of recordings of speakers with different vocal characteristics could be used. In the training step, training data of a matching target voice would be chosen for the impaired patient. The matching could happen based on gender, age, pitch, accent, dialect, and other characteristics, e.g., defined by a respective patient dataset. [0064] Thus, as a general rule, it would be possible to train multiple configurations of the machine learning algorithm, using multiple speech outputs having various speech characteristics and/or using multiple articulations of the reference text having various articulation. [0065] The method could further include selecting a configuration from the plurality of configurations of the machine learning algorithm based on a patient data set indicative of demographic and phonetic characteristics of the patient. [0066] The speech characteristics may specify characteristics of the speech output. Example speech characteristics include: pitch; gender; accent; age; etc. Accordingly, it would be possible that the known body of text has been pre-recorded with a plurality of healthy speakers with different voice characteristics. The articulation characteristic can specify characteristics of the articulation and/or its sensing. Example articulation characteristics include: type of vocal tract impairment; type of sensor technology used for recording the one or more vocal-tract signals; etc. [0067] The patient dataset may specify speech characteristics of the patient and/or articulation characteristics of the patient. Thereby, a tailored configuration of the machine learning algorithm can be selected, providing an appropriate speech output based on an the specific articulation of the patient. [0068] The audio training data can either be custom-generated for a specific patient, or serve for a range of patients, or it can be a pre-existing database of recordings. Several such databases are available, for example through the Bavarian Archive for Speech Signals or through OpenSLR. A custom voice sample could also be matched to the types of conversations the patient is likely to have. This may be especially advantageous with very severely handicapped patients for whom the therapy goal is to reliably communicate with a limited vocabulary. [0069] A preferred voice sample would be of the patient's own voice. This requires a recording of a sufficient body of text of the patient's original voice prior to injury or surgery, to be used as a training data set for the algorithm. For example, in cases of total laryngectomy it is conceivable that the patient's voice gets extensively recorded before the surgery. In such an implementation, the recording of the audio training data (Step 1a) and the recording of the corresponding vocal tract signals (Step 1b) can occur concurrently. [0070] Thus, it would also be possible that the speech output of the reference text, included in the one or more reference audio signals, is provided by the patient prior to the impairment. Accordingly, the healthy speaker could be identical with the impaired patient, prior to the impairment. Thereby, a particular accurate training of the machine learning algorithm and a unique speech output tailored to the patient can be provided.
[0071] Step 1b: Creating the vocal tract training data. The vocal tract training data can also come from a range of different sources. In the most straightforward implementation it comes from a single person, the same impaired patient who will use the voice prosthesis in Step 3. The vocal tract signal training data can also come from multiple persons, who do not all have to be impaired patients. For example, the training can be performed in two steps: a first training step can train the machine learning algorithm with a large body of training data consisting of audio data from multiple healthy speakers and vocal tract data from multiple healthy and/or impaired persons. A second training step can then re-train the network with a smaller body of training data containing the vocal tract signals of the impaired patient who will use the voice prosthesis. [0072] Thus, as a general rule, it would be possible that the machine learning algorithm is pre-trained based on a plurality of healthy speakers and impaired patients, and then re-trained for a particular impaired patient. I.e., it would be possible that the one or more vocal-tract signals based on which the training of the machine learning algorithm is executed are at least partially associated with the patient (and optionally also partially associated with one or more other persons, as already described above). [0073] To measure articulatory movement, a range of different electromagnetic or acoustic sensors, or both, can be used to probe the vocal tract and characterize its time-varying shape, as shown schematically in
[0082] Step 1c: Synchronizing audio and the vocal tract training data. As a general rule, the method may further include synchronizing a timing of the one or more reference audio signals with a further timing of the one or more vocal-tract signals. By synchronizing the timings, an accurate training of the machine learning algorithm is enabled. The timing of a signal can correspond to the duration between corresponding information content. For example, the timing of the one or more reference audio signals may be characterized by the time duration required to cover a certain fraction of the reference text by the speech output; similarly, the further timing of the one or more vocal-tract signals can correspond to the time duration required to cover a certain fraction of the reference text by the articulation. [0083] Time synchronization between the audio training data and the vocal tract training data can be achieved in a variety of different ways, either during recording or afterwards. If the two data sets are acquired at the same time from the same subject, synchronization is achieved automatically. [0084] If the two data sets are acquired consecutively, synchronization can be accomplished by providing visual or auditory cues to the subject recording the vocal tract training data. This can be done, for example, by displaying the sample text to be recorded on a screen, with a cursor moving along at the speed of the audio recording of the target voice, or by quietly playing back that recording. In each case, the subject whose vocal tract signals are recorded aims to match the given speed. Thus, for example, it would be possible to control a human-machine-interface (HMI) to provide temporal guidance to the patient when articulating the reference text in accordance with the timing of the one or more reference audio signals. For example, it would be possible that the synchronization is achieved by providing optical or acoustic cues to the impaired patient while the vocal-tract signals are being recorded.
[0085] Alternatively or additionally, it would also be possible that said synchronizing includes controlling the HMI to obtain a temporal guidance from the patient when articulating the reference text. For example, the impaired patient could provide synchronization information by pointing at the part of the text being articulated, while the one or more vocal-tract signals are being recorded. Gesture detection or eye tracking may be employed. The position of an electronic pointer, e.g., mouse curser, could be analyzed. A third approach is to synchronize the audio training data and the vocal tract data computationally after recording, by selectively slowing down or speeding up one of the two data recordings. This requires both data streams to be annotated with timing cues. For the vocal tract signals, the subject recording the training set can provide these timing cues themselves by moving an input device, such as a mouse or a stylus, through the text at the speed of his or her reading. For the audio training data, the timing cues can be generated in a similar way, or by manually annotating the data after recording, or with the help of state-of-the-art speech recognition software. Thus, as a general rule, it would be possible that said synchronizing includes postprocessing at least one of the reference audio signal and a vocal-tract signal by changing a respective timing. In other words, it would be possible that said synchronizing is implemented electronically after recording of the one or more reference audio signals and/or the one or more vocal-tract signals, e.g., by selectively speeding up/accelerating or slowing down/decelerating the one or more reference audio signals and/or the one or more vocal-tract signals.
[0086] Step 2: Training the machine learning algorithm. A range of different algorithms commonly used in speech recognition and speech synthesis can be adapted to the task of transforming vocal tract signals into acoustic speech output. The transformation can either be done via intermediate representations of speech, or end-to-end, omitting any explicit intermediate steps. Intermediate representations can be, for example, elements of speech such as phonemes, syllables, or words, or acoustic speech parameters such as mel-frequency cepstral coefficients (MFCCs). [0087] Different options for transforming input vocal tract signals into acoustic speech output are illustrated in
[0095] Step 3: Using the voice prosthesis. The trained neural network can then be used to realize an electronic voice prosthesis, a medical device that can alternatively be referred to as a “voice graft”, an “artificial voice”, or a “voice transplant”. A wide range of implementations are possible for the voice prosthesis. In practice, the choice will depend on the target patient scenario, i.e. the type of vocal impairment, the patient's residual vocal capabilities, the therapy goals, and aspects of the target setting, such as in-hospital vs. at home; bedridden vs. mobile. [0096] Four elements can interact to provide a voice prosthesis: Vocal tract sensors, preferably light-weight, compact and unobtrusive; a computing device, ideally mobile and wirelessly connected; the machine learning algorithm; and an acoustic output device, ideally unobtrusive but in proximity to the patient. Next, implementation choices for each of these elements are discussed. [0097] The first element are vocal tract sensors. A wide range of implementation options was discussed in the section “Step 1b: Creating the vocal tract training data”, above. The choice and placement of sensors during algorithm training and use of the voice prosthesis are typically the same. The optimal choice of sensors depends on the patient scenario. Electromagnetic or ultrasonic sensing of the vocal tract are chosen based on the reliability with which elements of speech can be recognized for the target patient type and setting. An auxiliary lip and facial camera will be advantageous in many scenarios to increase the reliability of recognition. In scenarios where the patient has any residual vocal output, such as residual phonation or the ability to whisper, a microphone will be an advantageous sensor modality. If the extrinsic laryngeal muscles and neck musculature are active, surface EMG can be an advantageous auxiliary sensor. Sensors should be light-weight and compact so as to not impede the patients movement and articulation. In most mobile and at-home settings, unobtrusiveness will be an aspect of emotional and social importance to patients. [0098] The second element of the voice prosthesis is a local computing device. It provides the computing power to carry out the trained machine learning algorithm or connects with a cloud-based computing platform where the algorithm is deployed, connects the algorithm with the acoustic output device, i.e. the loudspeaker, and provides a user interface. The requirements for portability and connectivity of the computing device depend on the patient scenario: For use at home, a mobile computing device is preferred. Compactness, affordability, and easy usability make a smartphone or tablet a preferred choice. It helps that a smartphone is not perceived as a prosthetic device, but as an item of daily use. In an ICU setting, by contrast, compactness and unobtrusiveness play a lesser role and the computing device can be integrated into a bedside unit. For home settings, a wireless connection, such as Bluetooth, between the sensor and the computing device will be desirable. In an ICU setting, on the other hand, wired connections are more acceptable. Thus, as a general rule, it is possible that a conversion of one or more is locally implemented on at least one mobile computing device of the patient, or is remotely implemented using cloud-computing. [0099] The third element is the deployed trained machine learning algorithm. Depending on the needed computing power and the available transmission bandwidth, it can be deployed on the local computing device, or remotely, i.e. in the cloud. In a mobile, smartphone-based implementation, cloud deployment of the algorithm can be advantageous. In a stationary bedside setting, the algorithm can run locally. A wide range of algorithm types known from the fields of speech recognition and speech synthesis was discussed in the section “Step 2: Training the algorithm”, above. The choice of algorithm depends on the type and number of sensors, the amount of training data available, and the degree to which the speech output needs to be customized to an individual patient. Generally, thanks to progress in neural network architecture and the increasing availability of computing power, end-to-end DNNs are becoming an increasingly attractive choice. [0100] The fourth element is an acoustic output device, for example a loudspeaker. Ideally, this loudspeaker is both unobtrusive and in proximity to the patient's mouth, to make for a natural appearance of the artificial voice output. The closest proximity can be achieved by integrating a loudspeaker in the sensor unit, located at the patient's throat, under the mandible, or on a headset cantilever in front of the patient's face. Alternatively, a simpler solution for smartphone based implementations would be to use the loudspeaker output of the smartphone.
[0101] Based on the range of the implementation options for each step above, a wide range of embodiments of the techniques described herein is possible. We describe four preferred embodiments of the invention for different patient scenarios. It is understood that combinations of various aspects of these embodiments can also be advantageous in these and other scenarios and that more embodiments of the invention can be generated from the implementation options discussed above. Also, the preferred embodiments described can apply to scenarios other than the ones mentioned in the description.
Preferred Embodiment 1: Radar and Video Based Method for Bedridden Patients
[0102] For a bedridden patient with no laryngeal airflow, such as a patient who is mechanically ventilated through a cuffed tracheostomy tube, embodiment 1 is a preferred embodiment. Such patients generally have no residual voice output and are not able to whisper. Therefore, the combination of radar sensing to obtain robust vocal tract signals and a video camera to capture lip and facial movements is preferred.
[0103] The main elements of the corresponding voice prosthesis are shown in
[0104] Two or more antennas (36) are used to collect reflected and transmitted radar signals that encode the time-varying vocal tract shape. The antennas are placed in proximity to the patient's vocal tract, e.g. under the right and left jaw bone. To keep their position stable relative to the vocal tract they can be attached directly to the patient's skin as patch antennas. Each antenna can send and receive modulated electromagnetic signals in a frequency band between 1 kHz and 12 GHz, optionally 1 GHz and 12 GHz, so that (complex) reflection and transmission can be measured. Possible modulations of the signal are: frequency sweep-, stepped frequency sweep-, pulse-, frequency comb-, frequency-, phase-, or amplitude modulation. In addition, a video camera (48) captures a video stream of the patient's face, containing information about the patient's lip and facial movements. The video camera is mounted in front of the patient's face on a cantilever (62) attached to the patient bed. The same cantilever can support the loudspeaker (63) for acoustic speech output.
[0105] The computing device (58) contained in the bedside unit (60) locally provides the necessary computing power to receive signals from the signal processing electronics (57) and the video camera (48), run the machine learning algorithm, output acoustic waveforms to the audio amplifier (59), and communicate wirelessly with the portable touchscreen device (61) serving as the user interface. The machine learning algorithm uses a deep neural network to transform the pre-processed radar signals and the stream of video images into an acoustic waveform in real time. The acoustic waveform is sent via the audio amplifier (59) to the loudspeaker (63).
[0106] The corresponding method for creating an artificial voice is as follows. An existing speech database is used to obtain audio training data for multiple target voices with different characteristics such as gender, age, and pitch. To create a corresponding body of vocal tract data, the sample text of the audio training data is read by a number of different speakers without speech impairment while their vocal tract signals are being recorded with the same radar sensor and video camera setup as for the eventual voice prosthesis. As the speakers read the sample text off a display screen, they follow the text along with an input stylus and timing cues are recorded. The timing cues are used to synchronize the vocal tract training data with the audio training data.
[0107] The audio training data sets of different target voices are separately combined with the synchronized vocal tract training data and used to train a deep neural network algorithm to convert radar and video data into the target voice. This results in a number of different DNNs, one for each target voice. The voice prosthesis is pre-equipped with these pre-trained DNNs.
[0108] To deal with the subject-to-subject variation in vocal tract signals, a pre-trained DNN is re-trained for a particular patient before use. To this end, first the pre-trained DNN that best matches the intended voice for the patient is selected. Then, the patient creates a patient-specific set vocal tract training data, by mouthing an excerpt of the sample text that was used to pre-train the DNNs, while vocal tract data are being recorded. This second vocal tract training data set is synchronized and combined with the corresponding audio sample of the selected target voice. This smaller, patient-specific second set of training data is now used to re-train the DNN. The resulting patient specific DNN is used in the voice prosthesis to transform the patient's vocal tract signal to voice output with the characteristics of the selected target voice.
Preferred Embodiment 2: Radar and Video Based Method for Mobile Patients
[0109] For a mobile patient with no laryngeal airflow, such as a patient whose larynx has been surgically removed, embodiment 2 is a preferred embodiment. Like the patient in embodiment 1, such patients also have no residual voice output and are not able to whisper. Therefore, the combination of radar sensing and a video camera to capture lip and facial movements is preferred in this case, too.
[0110] The main elements of the corresponding voice prosthesis are shown in
[0111] As in embodiment 1, two or more antennas (36) are used to collect reflected and transmitted radar signals that encode the time-varying vocal tract shape. The antennas are placed in proximity to the patient's vocal tract, e.g. under the right and left jaw bone. To keep their position stable relative to the vocal tract they can be attached directly to the patient's skin. Each antenna can send and receive modulated electromagnetic signals in a frequency band between 1 kHz and 12 GHz, so that (complex) reflection and transmission can be measured. Possible preferred modulations of the signal are: frequency sweep-, stepped frequency sweep-, pulse-, frequency comb-, frequency-, phase-, or amplitude modulation.
[0112] In addition, a video camera (48) captures a video stream of the patient's face, containing information about the patient's lip and facial movements. For portability the video camera is mounted in front of the patient's face on a cantilever (62) worn by the patient like a microphone headset.
[0113] The portable touchscreen device (61) is also the computing device that locally provides the necessary computing power to receive the processed radar signals and the video images from the wireless transmitter (65), run the machine learning algorithm, output the acoustic speech waveforms via the built-in speaker (63), and provide the user interface on the touchscreen. The machine learning algorithm uses a deep neural network to transform the pre-processed radar signals and the stream of video images into an acoustic waveform in real time.
[0114] The corresponding method for creating an artificial voice is the same as in embodiment 1.
Preferred Embodiment 3: Low-Frequency Ultrasound and Video Based Method for Mobile Patients
[0115] For a mobile patient with no laryngeal airflow, such as a patient whose larynx has been surgically removed, embodiment 3 is an alternative preferred embodiment. Instead of radar sensing, in this embodiment low-frequency ultrasound is used to characterize the time-varying shape of the vocal tract.
[0116] The main elements of the corresponding voice prosthesis are shown in
[0117] A low-frequency ultrasound loudspeaker (42) is used to emit ultrasound signals in the range of 20 to 30 kHz that are directed at the patient's mouth and nose. The ultrasound signals reflected from the patient's vocal tract are captured by an ultrasound microphone (45). The ultrasound loudspeaker and microphone are mounted in front of the patient's face on a cantilever (62) worn by the patient like a microphone headset.
[0118] With this setup, the complex reflection coefficient can be measured as a function of frequency. The frequency dependence of the reflection or transmission is measured by sending signals in a continuous frequency sweep, or in a series of wave packets with stepwide increasing frequencies, or by sending a short pulse and measuring the impulse response in a time-resolved manner.
[0119] In addition, a video camera (48) captures a video stream of the patient's face, containing information about the patient's lip and facial movements. The video camera is mounted on the same cantilever (62) as the ultrasound loudspeaker and microphone.
[0120] As in embodiment 2, the portable touchscreen device (61) is also the computing device. It locally provides the necessary computing power to receive the ultrasound signals converted by the analog-to-digital converter (68) and the video images via the wireless transmitter (65), run the machine learning algorithm, output the acoustic speech waveforms via the built-in speaker (63), and provide the user interface on the touchscreen. The machine learning algorithm uses a DNN to transform the pre-processed ultrasound signals and the stream of video images into an acoustic waveform in real time.
[0121] The corresponding method for creating an artificial voice is the same as in embodiments 1 and 2.
Preferred Embodiment 4: Audio and Video Based Method for Mobile Patients with Residual Voice
[0122] For a mobile patient with residual voice output, such as residual phonation, a whisper voice, or a pure whisper without phonation, embodiment 4 is a preferred embodiment. For such a patient, the combination of an acoustic microphone to pick up the residual voice output and a video camera to capture lip and facial movements is preferred.
[0123] The main elements of the corresponding voice prosthesis are shown in
[0124] A microphone (52) capturing the acoustic signal of the residual voice and a video camera (48) capturing lip and facial movements are placed in front of the patient's face on a cantilever (62) worn by the patient like a microphone headset. The microphone and camera signals are sent to the computing device (59) which runs the machine learning algorithm and outputs the acoustic speech output via the audio amplifier (59) and a loudspeaker (63) that is also mounted on the cantilever in front of the patient's face. The machine learning algorithm uses a DNN to transform the acoustic and video vocal tract signals into an acoustic waveform in real time.
[0125] The corresponding method for creating an artificial voice differs from the previous embodiments. Since the residual voice depends strongly on the patient's condition and may even change over time, a patient specific DNN algorithm is trained for each patient.
[0126] An existing speech database is used to obtain audio training data for a target voice that matches the patient in characteristics such as gender, age, and pitch. To create a corresponding body of vocal tract data, the sample text of the audio training data is read by the patient with the same microphone and video camera setup as for the eventual voice prosthesis. As the patient reads the sample text off a display screen, he or she follows the text along with an input stylus and timing cues are recorded. The timing cues are used to synchronize the vocal tract training data with the audio training data.
[0127] The combined training data set is used to train the DNN algorithm to transform the patient's vocal tract signals, i.e. residual voice and lip and facial movements, into acoustic speech output. If over time the patient's residual voice output changes enough to degrade the quality of the speech output, the algorithm can be re-trained by recording a new set of vocal tract training data.
[0128] Summarizing, at least the following examples have been described above.
[0129] EXAMPLE 1. A method, comprising: [0130] training a machine learning algorithm based on one or more reference audio signals of a speech output of a reference text, and one or more vocal tract signals associated with an articulation of the reference text by a patient.
[0131] EXAMPLE 2. The method of EXAMPLE 1, [0132] wherein multiple configurations of the machine learning algorithm are trained using at least one of multiple speech outputs having varying speech characteristics, or multiple articulations of the reference text having varying articulation characteristics, [0133] wherein the method further comprises: [0134] selecting a configuration from the multiple configurations of the machine learning algorithm based on a patient dataset of the patient.
[0135] EXAMPLE 3. The method of EXAMPLE 1 or 2, further comprising: [0136] synchronizing a timing of the one or more reference audio signals with a further timing of the one or more vocal tract signals.
[0137] EXAMPLE 4. The method of EXAMPLE 3, wherein said synchronizing comprises: [0138] controlling a human-machine interface to provide temporal guidance to the patient when articulating the reference text in accordance with the timing of the one or more reference audio signals.
[0139] EXAMPLE 5. The method of EXAMPLE 3 or 4, wherein said synchronizing comprises: [0140] controlling a human-machine-interface to obtain a temporal guidance from the patient when articulating the reference text.
[0141] EXAMPLE 6.The method of any one of EXAMPLEs 3 to 5, wherein said synchronizing comprises: [0142] postprocessing at least one of the one or more reference audio signals and the one or more vocal-tract signals by changing a respective time.
[0143] EXAMPLE 7. The method of any one of EXAMPLEs 1 to 6, [0144] wherein the machine learning algorithm is trained end-to-end to convert a live articulation of the patient to a live speech output.
[0145] EXAMPLE 8. The method of any one of EXAMPLEs 1 to 6, [0146] wherein the machine learning algorithm is trained end-to-end to convert a live articulation of the patient to fragments of a live speech output.
[0147] EXAMPLE 9. The method of any one of the preceding EXAMPLEs, [0148] wherein the one or more reference audio signals and/or the one or more vocal-tract signals are provided by at least one of the patient or one or more other persons.
[0149] EXAMPLE 10. The method of any one of the preceding EXAMPLEs, further comprising: [0150] receiving one or more live vocal tract signals of the patient, and [0151] based on the machine learning algorithm, converting the one or more live vocal-tract signals into associated one or more live audio signals comprising speech output.
[0152] EXAMPLE 11. The method of EXAMPLE 10, [0153] wherein said converting is locally implemented on at least one mobile computing device of the patient, or is remotely implemented using cloud-computing.
[0154] EXAMPLE 12. The method of EXAMPLE 10 or 11, further comprising: [0155] recording at least a part of the one or more live vocal-tract signals using one or more sensors associated with at least one mobile computing device.
[0156] EXAMPLE 13. The method of EXAMPLE 12, wherein the one or more sensors are selected from the group comprising: a lip camera; a facial camera; a headset microphone; an ultrasound transceiver; a neck or larynx surface electromyogram; and a radar transceiver.
[0157] EXAMPLE 14. The method of any one of EXAMPLEs 10 to 13, further comprising: [0158] outputting the one or more live audio signals using a speaker of at least one mobile computing device of the patient.
[0159] EXAMPLE 15. The method of any one of the preceding EXAMPLEs, wherein the patient is on mechanical ventilation through a tracheostomy, has undergone a partial or complete laryngectomy, or suffers from vocal fold paresis or paralysis.
[0160] EXAMPLE 16. The method of EXAMPLE 15, wherein the speech output of the reference text is provided by the patient prior to speech impairment.
[0161] EXAMPLE 17. A device comprising a control circuitry configured to: [0162] receive one or more live vocal tract signals of a patient, [0163] based on a machine learning algorithm, convert the one or more live vocal tract signals into one or more associated live audio signals comprising a speech output, the machine learning algorithm being trained based on one or more reference audio signals of a speech output of a reference text, and one or more reference vocal tract signals of a patient associated with an articulation of the reference text by a patient.
[0164] EXAMPLE 18. The device of EXAMPLE 17, wherein the control circuitry is configured to execute the method of any one of the EXAMPLES 1 to 16.
[0165] Although the invention has been shown and described with respect to certain preferred embodiments, equivalents and modifications will occur to others skilled in the art upon the reading and understanding of the specification. The present invention includes all such equivalents and modifications and is limited only by the scope of the appended claims.
[0166] For instance, various examples have been described with respect to certain sensors used to record one or more vocal tract signals. Depending on the patient's condition, residual vocal capabilities, the therapy goals and the setting, different vocal tract sensors can be used. They can be unobtrusive and wearable: light weight and compactness; preferably low power consumption and wireless operation.
[0167] For further illustration, various examples have been described with respect to a trained machine learning algorithm. Depending on the computing power requirements in the transmission bandwidth, modifications are possible: for example, the trained machine learning algorithm could be deployed locally (i.e., on a mobile computing device) or remotely, i.e., using a cloud computing service. The mobile computing device can be used to connect one or more sensors with a platform executing the machine learning algorithm. The mobile computing device can also be used to output, via a loudspeaker, one or more audio signals including speech output determined based on the machine learning algorithm.
[0168] For further illustration, various examples have been described in which multiple configurations of the machine learning algorithm are trained using varying speech characteristics and/or varying articulation characteristics. In this regard, many levels of matching the speech characteristic to the patient characteristic are conceivable: gender, age, pitch, accent or dialect, etc. The matching can be done by selecting from a “library” of configurations of the machine learning algorithm, by modifying an existing configuration, or by custom recording the voice of a “voice donor”.
[0169] For still further illustration, the particular type or sets of sensors is not germane to the functioning of the subject techniques. Different sensor types are advantageous in different situations: (i) lip/facial cameras. A camera recording the motion of the lips and facial features will be useful in most cases, since these cues are available in most disease scenarios, are fairly information-rich (cf. lip reading), and are easy to pick up with a light-weight, relatively unobtrusive setup. A modified microphone headset with one or more miniature CCD cameras mounted on the cantilever may be used. Multiple CCD cameras or depth-sensing cameras, such as cameras using time-of-flight technology may be advantageous to enable stereoscopic image analysis. (ii) Radar transceiver. Short-range radar operating in the frequency range between 1 and 12 GHz is an attractive technology for measuring the internal vocal tract configuration. These frequencies penetrate several centimeters to tens of centimeters into tissue and are safe for continuous use at the extremely low average power levels (microwatts) required. The radar signal can be emitted into a broad beam and detected either with a single antenna or in a spatially (i.e. angularly) resolved manner with multiple antennas. (iii) ultrasound transceiver. Ultrasound can be an alternative to radar sensing in measuring the vocal tract configuration. At frequencies in the range of 1-5 MHz, ultrasound also penetrates and images the pertinent tissues well and can be operated safely in a continuous way. Ultra-compact, chip based phased-array ultrasound transceivers are available for endoscopic applications. Ultrasound can also be envisioned to be used in a non-imaging mode. (iv) surface EMG sensors. Surface EMG sensors may provide complementary data to the vocal tract shape information, especially in cases where the extrinsic laryngeal musculature is present and active. In those cases, EMG may help by providing information on intended loudness (i.e. adding dynamic range to the speech output) and, more fundamentally, distinguishing speech from silence. The latter is a fundamental need in speech recognition, as the shape of the vocal tract alone does not reveal whether or not acoustic excitation (phonation) is present. (v) acoustic microphone. Acoustic microphones make sense as (additional) sensors in all cases with residual voice present. Note that in this context, “residual voice” may include a whispering voice. Whispering needs air flow through the vocal tract, but does not involve phonation (i.e. vocal fold motion). In many cases, picking up a whispered voice, perhaps in combination with observing lip motion, may be enough to reconstruct and synthesize natural sounding speech. In many scenarios, this would greatly simplify speech therapy, as it reduces the challenge from getting the patient to speak to teaching the patient to whisper. Microphones could attach to the patient's throat, under the mandible, or in front of the mouth (e.g. on the same headset cantilever as a lip/facial camera).
[0170] For still further illustration, various examples have been described in connection with using a machine learning algorithm to transform vocal-tract signals into audio signals associated with speech. It is not mandatory to use a machine learning algorithm; other types of algorithms may be used for the transformation.
LIST OF REFERENCE NUMERALS
FIG. 1: Schematic of the Anatomy Relevant to Physiologic Voice Production and its Impairments
[0171] 1 anatomical structures involved in phonation: lungs (not shown), trachea, and larynx (“source”)
[0172] 2 anatomical structures involved in articulation: vocal tract (“filter”)
[0173] 3 trachea
[0174] 4 larynx
[0175] 4a glottis
[0176] 5 epiglottis
[0177] 6 pharynx
[0178] 7 velum
[0179] 8 oral cavity
[0180] 9 tongue
[0181] 10a upper teeth
[0182] 10b lower teeth
[0183] 11a upper lip
[0184] 11b lower lip
[0185] 12 nasal cavity
[0186] 13 nostrils
[0187] 14 esophagus
[0188] 15 thyroid
[0189] 16 recurrent laryngeal nerve
FIG. 2: Schematic of Different Causes of Aphonia
[0190] (a) Tracheostomy
[0191] 17 tracheostomy for mechanical ventilation
[0192] 17a tracheostomy tube
[0193] 17b inflated cuff
[0194] (b) Laryngectomy
[0195] 18 tracheostoma after laryngectomy
[0196] 3 trachea
[0197] 14 esophagus
[0198] (c) Recurrent nerve injury
[0199] 19 laryngeal nerve injury after thyroidectomy
[0200] 16 recurrent laryngeal nerve
[0201] 16a nerve injury
FIG. 3: Schematic of Different Voice Rehabilitation Options
[0202] (a) Tracheoesphageal puncture (TEP)
[0203] 20 tracheoesophageal puncture and valve
[0204] 21 finger
[0205] 22 vibrations
[0206] (b) Esophageal speech
[0207] 22 vibrations
[0208] (c) Electrolarynx
[0209] 23 electrolarynx
[0210] 22 vibrations
FIG. 4: Schematic of an an Example Implementation
[0211] (a) Step 1a: Creating the audio training data
[0212] 24 sample text
[0213] 25 healthy speaker
[0214] 26 microphone
[0215] 27 audio training data
[0216] (b) Step 1b: Creating the vocal tract training data
[0217] 28 display with sample text
[0218] 29 impaired patient
[0219] 30 vocal tract sensors
[0220] 31 vocal tract training data
[0221] (c) Step 1c: Synchronizing audio and vocal tract training data
[0222] 27 audio training data
[0223] 31 vocal tract training data
[0224] (d) Step 2: Training the algorithm
[0225] 27 audio training data
[0226] 31 vocal tract training data
[0227] 32 trained machine learning algorithm
[0228] (e) Step 3: Using the voice prosthesis
[0229] 29 impaired patient
[0230] 30 vocal tract sensors
[0231] 32 trained machine learning algorithm
[0232] 33 wireless connection
[0233] 34 mobile computing device
[0234] 35 acoustic speech output
FIG. 5: Schematic of Different Implementation Options for Vocal Tract Sensors
[0235] (a) Microwave radar sensing
[0236] 36 radar antenna
[0237] 37 emitted radar signal
[0238] 38 backscattered/transmitted radar signal
[0239] (b) Ultrasound sensing
[0240] 39 ultrasound transducer
[0241] 40 emitted ultrasound signal
[0242] 41 backscattered ultrasound signal
[0243] (c) Low-frequency ultrasound
[0244] 42 ultrasound loudspeaker
[0245] 43 emitted ultrasound signal
[0246] 44 reflected ultrasound signal
[0247] 45 ultrasound microphone
[0248] (d) Lip and facial camera
[0249] 46 ambient light
[0250] 47 reflected light
[0251] 48 video camera
[0252] (e) Surface electromyography
[0253] 49 surface electromyography sensors (for extralaryngeal musculature)
[0254] 50 surface electromyography sensors (for neck and facial musculature)
[0255] (f) Acoustic microphone
[0256] 51 residual acoustic voice signal
[0257] 52 acoustic microphone
FIG. 6: Schematic of Different Implementation Options for Processing Vocal Tract Signals
[0258] (a) using elements of speech and MFCCs as intermediate representations of speech
[0259] 70 vocal tract data: series of frames
[0260] 71 data pre-processing
[0261] 72 time series of feature vectors
[0262] 73 speech recognition algorithm
[0263] 74 elements of speech: phonemes, syllables, words
[0264] 75 speech synthesis algorithm
[0265] 76 mel-frequency cepstral coefficients
[0266] 77 acoustic waveform synthesis
[0267] 78 acoustic speech waveform
[0268] (b) using MFCCs as intermediate representations of speech
[0269] 70 vocal tract data: series of frames
[0270] 71 data pre-processing
[0271] 72 time series of feature vectors
[0272] 76 mel-frequency cepstral coefficients
[0273] 77 acoustic waveform synthesis
[0274] 78 acoustic speech waveform
[0275] 79 deep neural network algorithm
[0276] (c) End-to-end machine learning algorithm using no intermediate representations of speech
[0277] 70 vocal tract data: series of frames
[0278] 71 data pre-processing
[0279] 72 time series of feature vectors
[0280] 78 acoustic speech waveform
[0281] 80 end-to-end deep neural network algorithm
[0282] (d) End-to-end machine learning algorithm using no explicit pre-processing and no intermediate representations of speech
[0283] 70 vocal tract data: series of frames
[0284] 78 acoustic speech waveform
[0285] 80 end-to-end deep neural network algorithm
FIG. 7: Schematic of Voice Prosthesis for Preferred Embodiment 1: Radar and Video Based Method for Bedridden Patients
[0286] 36 radar antennas
[0287] 48 video camera
[0288] 53 bedridden patient
[0289] 54 patient bed
[0290] 55 power supply
[0291] 56 radar transmission and receiving electronics
[0292] 57 signal processing electronics
[0293] 58 computing device
[0294] 59 audio amplifier
[0295] 60 bedside unit
[0296] 61 mobile computing device with touchscreen
[0297] 62 cantilever
[0298] 63 loudspeaker
FIG. 8: Schematic of Voice Prosthesis for Preferred Embodiment 2: Radar and Video Based Method for Mobile Patients
[0299] 36 radar antennas
[0300] 48 video camera
[0301] 55 power supply
[0302] 56 radar transmission and receiving electronics
[0303] 57 signal processing electronics
[0304] 61 mobile computing device with touchscreen
[0305] 62 cantilever
[0306] 63 loudspeaker
[0307] 64 mobile patient
[0308] 65 wireless transmitter and receiver
[0309] 66 portable electronics unit
FIG. 9: Schematic of Voice Prosthesis for Preferred Embodiment 3: Low-Frequency Ultrasound and Video Based Method for Mobile Patients
[0310] 42 ultrasound loudspeaker
[0311] 45 ultrasound microphone
[0312] 48 video camera
[0313] 55 power supply
[0314] 57 signal processing electronics
[0315] 61 mobile computing device with touchscreen
[0316] 62 cantilever
[0317] 63 loudspeaker
[0318] 64 mobile patient
[0319] 65 wireless transmitter and receiver
[0320] 66 portable electronics unit
[0321] 67 ultrasound waveform generator
[0322] 68 analog-to-digital converter
[0323] 48 video camera
[0324] 52 microphone
[0325] 55 power supply
[0326] 58 computing device
[0327] 59 audio amplifier
[0328] 62 cantilever
[0329] 63 loudspeaker
[0330] 64 mobile patient
[0331] 66 portable electronics unit
[0332] 69 user interface