VOICE GRAFTING USING MACHINE LEARNING

Abstract

A process labeled “voice grafting” can be understood in terms of the source-filter model of speech production as follows: For a patient who has partially or completely lost the ability to phonate, but retained at least a partial ability to articulate, the techniques described herein computationally “graft” the patient's time varying filter function, i.e. articulation, onto a source function, i.e. phonation, which is based on the speech output of one or more healthy speakers, in order to synthesize natural sounding speech in real time.

Claims

1. A method for creating an artificial voice for a patient with missing or impaired phonation but at least residual articulation function, wherein an acoustic signal of one or more healthy speakers reading a known body of text out loud is recorded, at least one vocal tract signal of the patient mouthing the same known body of text is recorded, the acoustic signal and the at least one vocal tract signal are used to train a machine learning algorithm, and the machine learning algorithm is used in an electronic voice prosthesis measuring the patient's at least one vocal tract signal and converting it to an acoustic speech output in real time.

2. The method according to claim 1, wherein the patient is on mechanical ventilation, has undergone a partial or complete laryngectomy, or suffers from vocal fold paresis or paralysis.

3. The method according to claim 1, wherein at least one of the one or more healthy speakers is identical with the patient prior to impairment.

4. The method according to claim 1, wherein the one or more healthy speakers comprise a plurality of healthy speakers with different voice characteristics and a particular voice is chosen for the patient based on the patient's gender, age, natural pitch, other vocal characteristics prior to the impairment, and/or preferences.

5. The method according to claim 1, wherein the acoustic signal of the one or more healthy speaker and the at least one vocal tract signal of the patient are synchronized.

6-8. (canceled)

9. The method according to claim 1, wherein the machine learning algorithm is a convolutional neural network and wherein the convolutional neural network is trained to directly convert the recorded vocal tract signal to the acoustic speech output.

10. (canceled)

11. The method according to claim 1, wherein the machine learning algorithm is a convolutional neural network, and wherein the convolutional neural network is trained to convert the recorded vocal tract signal to elements of speech, such as phonemes, syllables or words, which are then synthesized to the acoustic speech output.

12. The method according to claim 1, wherein the machine learning algorithm is a convolutional neural network, and wherein the convolutional neural network is pre-trained based on the one or more healthy speakers and further impaired patients and re-trained for the patient.

13. The method according to claim 1, wherein the at least one vocal tract signal comprises an electromagnetic signal in the radio frequency range, optionally recorded using a radar transceiver.

14. The method according to claim 13, wherein electromagnetic waves in the frequency range of 1 kHz to 12 GHz, optionally microwaves between 1 GHz and 10 GHz are emitted, and reflected and/or transmitted and/or otherwise influenced waves are received using one or more antennas in contact with or proximity to the patient's skin.

15. (canceled)

16. The method according to claim 1, wherein the at least one vocal tract signal comprises one or more images of the patient's lips and/or face, recorded using a camera sensor.

17. (canceled)

18. The method according to claim 1, wherein the at least one vocal tract signal comprises a patient's residual voice output, measured using an acoustic microphone.

19-21. (canceled)

22. The method according to claim 1, wherein the at least one vocal tract signal comprises one or more ultrasound signals, and wherein low frequency ultrasound waves in the range between 20 and 100 kHz are emitted using a loudspeaker in contact with or in proximity to the patient's skin or near the patient's mouth and detected using a microphone.

23. The method according to claim 1, wherein the electronic voice prosthesis comprises a mobile computing device, and wherein the mobile computing device is a smart phone or a tablet carrying out the conversion of the at least one vocal tract signal to the acoustic speech output locally on the device.

24. (canceled)

25. The method according to claim 1, wherein the electronic voice prosthesis comprises a mobile computing device, and wherein the mobile computing device is a smart phone or a tablet connected to the internet and the conversion of the at least one vocal tract signal to the acoustic speech output is carried out on a remote computing platform.

26. (canceled)

27. The method according to claim 25, wherein the at least one vocal tract signal comprises one or more images of the patient's lips and/or face, recorded using a built-in camera sensor of the mobile computing device.

28-29. (canceled)

30. A device for a patient with missing or impaired phonation but at least residual articulation function, wherein the device is configured to measure at least one vocal tract signal of the patient and to convert it to an acoustic speech output in real time using a machine learning algorithm, the machine learning algorithm having been trained with data that includes an acoustic signal of one or more healthy persons reading a body of text out loud and at least one vocal tract signal of one or more persons mouthing the same body of text.

31. The method according to claim 23, wherein the at least one vocal tract signal comprises one or more images of the patient's lips and/or face, recorded using a built-in camera sensor of the mobile computing device.

32. The method according to claim 23, wherein the mobile computing device is connected to one or more external sensors via a wireless interface, the one or more external sensors configured to record one or more of the at least one vocal tract signal.

33. The method according to claim 25, wherein the mobile computing device is connected to one or more external sensors via a wireless interface, the one or more external sensors configured to record one or more of the at least one vocal tract signal.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0039] FIG. 1 schematically illustrates the anatomy relevant to physiologic voice production and its impairments.

[0040] FIG. 2 schematically illustrates different causes of aphonia: (a) tracheostomy, (b) laryngectomy, (c) recurrent nerve injury.

[0041] FIG. 3 schematically illustrates different voice rehabilitation options: (a) TEP, (b) esophageal speech, (c) electrolarynx.

[0042] FIG. 4 is a flow-schematic of a method according to various examples; (4a) Step 1a: Creating audio training data; (4b) step 1b: Creating vocal tract training data; (4c) step 1c: synchronizing audio and vocal tract training data; (4d) step 2: training the algorithm; (4e): step 3: using the voice prosthesis.

[0043] FIG. 5 schematically illustrates different implementation options for vocal tract sensors. (a) microwave radar sensing; (b) ultrasound sensing; (c) low-frequency ultrasound; (d) lip and facial camera; (e) surface electromyography; (f) acoustic microphone.

[0044] FIG. 6 schematically illustrates flowcharts for multiple implementation options for a machine learning algorithm according to various examples. (a) Uses elements of speech and MFCCs as intermediate representation of speech; (b) uses MFCCs as intermediate representation of speech; (c) is an end-to-end machine learning algorithm using no intermediate representation of speech; and (d) is an end-to-end machine learning algorithm using no explicit pre-processing and no intermediate representation of speech.

[0045] FIG. 7 schematically illustrates a voice prosthesis employing radar and video sensors for bedridden patients according to various examples.

[0046] FIG. 8 schematically illustrates a voice prosthesis employing radar and video sensors for mobile patients according to various examples.

[0047] FIG. 9 schematically illustrates a voice prosthesis employing low-frequency ultrasound and video sensors for mobile patients according to various examples.

[0048] FIG. 10 schematically illustrates a voice prosthesis employing an audio and video sensor for mobile patients according to various examples.

DETAILED DESCRIPTION OF EMBODIMENTS

[0049] Hereinafter, techniques of generating a speech output based on a residual articulation of the patient (voice grafting) are described. Various techniques are based on the finding that prior-art implementations of speech rehabilitation face certain restrictions and drawbacks. For example, for many patients they do not achieve the objective of restoring natural-sounding speech. Esophageal speech and speaking with the help of a speaking valve, voice prosthesis, or electrolarynx are difficult to learn for some patients and often results in distorted, unnatural speech. The need to hold and activate an electrolarynx device or cover a tracheostoma or valve opening with a finger makes these solutions cumbersome and obtrusive. In-dwelling prostheses also carry the risk of fungal infections.

[0050] For example, details with respect to the electrolarynx are described in: Kaye, Rachel, Christopher G. Tang, and Catherine F. Sinclair. “The electrolarynx: voice restoration after total laryngectomy.” Medical Devices (Auckland, NZ) 10 (2017): 133.

[0051] For example, details with respect to a speak valve are described in: Passy, Victor, et al. “Passy-Muir tracheostomy speaking valve on ventilator-dependent patients.” The Laryngoscope 103.6 (1993): 653-658. Also, see Kress, P., et al. “Are modern voice prostheses better? A lifetime comparison of 749 voice prostheses.” European Archives of Oto-Rhino-Laryngology 271.1 (2014): 133-140.

[0052] An overview of traditional voice prostheses is provided by: Reutter, Sabine. Prothetische Stimmrehabilitation nach totaler Kehlkopfentfernung-eine historische Abhandlung seit Billroth (1873). Diss. Universität Ulm, 2008.

[0053] Many different implementations of the voice grafting are possible. One implementation is shown schematically in FIG. 4. It includes three steps, as follows.

[0054] Step 1: Creating training data, i.e., training phase. A healthy speaker (25) providing the “target voice” reads a sample text (24) (also labelled reference text) out loud, while the voice output is recorded with a microphone (26). This can create one or more reference audio signals, or simply audio training data. The resulting audio training data (26), including the text and its audio recording, can be thought of as an “audio book”; in fact, it can also be an existing audio book recording or any other available speech corpus. (Step 1a, FIG. 4a).

[0055] The same text is then “read” by a patient with impaired phonation (29), while signals characterizing the patient's time-varying vocal tract configuration, i.e. the articulation, are recorded with suitable sensors (30), yielding vocal tract training data (31). If the patient is completely aphonic, “reading” the text here means silently “mouthing” it. To record the articulation, various options exist. For example, the patient's vocal tract is probed using electromagnetic and/or acoustic waves, and backscattered and/or transmitted waves are measured. Together, the measurement setup can be referred to as “vocal tract sensors” and to the measured signals as “vocal tract signals”. (Step 1b, FIG. 4b). [0056] A third aspect of creating training data is a way of synchronizing the audio training data (27) and the vocal tract training data (31). This can be done either during the data recording, or afterwards by selectively speeding up or slowing down the recorded signals for synchronization. (Step 1c, FIG. 4c).

[0057] Step 2: Training the algorithm (FIG. 4d), i.e., training phase. The second step in the overall process is training a machine learning algorithm, to transform vocal tract signals into the acoustic waveform of acoustic speech output. The audio training data (27) and the vocal tract training data (31) are used to train the machine learning algorithm (32) such as a deep neural net to transform vocal tract data into acoustic speech output. [0058] Thus, as a general rule, a method includes training a machine learning algorithm based on, firstly, one or more reference audio signals. The reference audio signals include a speech output of a reference text. The machine learning algorithm is, secondly, trained based on one or more vocal-tract signals. The one or more vocal-tract signals are associated with an articulation of the reference text by a patient. [0059] The training of the machine learning algorithm according to step 1 and step 2 is distinct from inference using the trained machine learning algorithm. Inference is described in connection with step 3 below.

[0060] Step 3: Using the voice prosthesis (FIG. 4e), inference phase. The third step is using the trained machine learning algorithm to transform measurements of the impaired patient's vocal tract signals into acoustic speech output in real time in a voice prosthesis system. Here, one or more live vocal-tract signals can be acquired using one or more sensors associated with at least one mobile computing device. In this step, the patient (29) typically wears the same or similar vocal tract sensor configuration (30) that was used in creating the training data. For example, the measured vocal tract signals are transmitted to a mobile computing device (34) via a wireless connection (33) and fed into the trained machine learning algorithm (32) to be converted to an acoustic speech waveform in real time, i.e., one or more live audio signals are generated by the machine-learning algorithm. The acoustic speech waveform (35) is output using the device loudspeaker.

[0061] Thus, as a general rule, a corresponding method can include receiving one or more live vocal-tract signals of the patient and then convert the one or more live vocal-tract signals into associated one or more live audio signals including speech output, based on the machine learning algorithm.

[0062] For each step described above, a wide range of implementations is possible, which will be described below.

[0063] Step 1a: Creating the audio training data. The audio training data representing the “healthy voice” can come from a range of different sources. In the most straightforward implementation it is the voice of a single healthy speaker. The training data set can also come from multiple speakers. In one implementation, a library of recordings of speakers with different vocal characteristics could be used. In the training step, training data of a matching target voice would be chosen for the impaired patient. The matching could happen based on gender, age, pitch, accent, dialect, and other characteristics, e.g., defined by a respective patient dataset. [0064] Thus, as a general rule, it would be possible to train multiple configurations of the machine learning algorithm, using multiple speech outputs having various speech characteristics and/or using multiple articulations of the reference text having various articulation. [0065] The method could further include selecting a configuration from the plurality of configurations of the machine learning algorithm based on a patient data set indicative of demographic and phonetic characteristics of the patient. [0066] The speech characteristics may specify characteristics of the speech output. Example speech characteristics include: pitch; gender; accent; age; etc. Accordingly, it would be possible that the known body of text has been pre-recorded with a plurality of healthy speakers with different voice characteristics. The articulation characteristic can specify characteristics of the articulation and/or its sensing. Example articulation characteristics include: type of vocal tract impairment; type of sensor technology used for recording the one or more vocal-tract signals; etc. [0067] The patient dataset may specify speech characteristics of the patient and/or articulation characteristics of the patient. Thereby, a tailored configuration of the machine learning algorithm can be selected, providing an appropriate speech output based on an the specific articulation of the patient. [0068] The audio training data can either be custom-generated for a specific patient, or serve for a range of patients, or it can be a pre-existing database of recordings. Several such databases are available, for example through the Bavarian Archive for Speech Signals or through OpenSLR. A custom voice sample could also be matched to the types of conversations the patient is likely to have. This may be especially advantageous with very severely handicapped patients for whom the therapy goal is to reliably communicate with a limited vocabulary. [0069] A preferred voice sample would be of the patient's own voice. This requires a recording of a sufficient body of text of the patient's original voice prior to injury or surgery, to be used as a training data set for the algorithm. For example, in cases of total laryngectomy it is conceivable that the patient's voice gets extensively recorded before the surgery. In such an implementation, the recording of the audio training data (Step 1a) and the recording of the corresponding vocal tract signals (Step 1b) can occur concurrently. [0070] Thus, it would also be possible that the speech output of the reference text, included in the one or more reference audio signals, is provided by the patient prior to the impairment. Accordingly, the healthy speaker could be identical with the impaired patient, prior to the impairment. Thereby, a particular accurate training of the machine learning algorithm and a unique speech output tailored to the patient can be provided.

[0071] Step 1b: Creating the vocal tract training data. The vocal tract training data can also come from a range of different sources. In the most straightforward implementation it comes from a single person, the same impaired patient who will use the voice prosthesis in Step 3. The vocal tract signal training data can also come from multiple persons, who do not all have to be impaired patients. For example, the training can be performed in two steps: a first training step can train the machine learning algorithm with a large body of training data consisting of audio data from multiple healthy speakers and vocal tract data from multiple healthy and/or impaired persons. A second training step can then re-train the network with a smaller body of training data containing the vocal tract signals of the impaired patient who will use the voice prosthesis. [0072] Thus, as a general rule, it would be possible that the machine learning algorithm is pre-trained based on a plurality of healthy speakers and impaired patients, and then re-trained for a particular impaired patient. I.e., it would be possible that the one or more vocal-tract signals based on which the training of the machine learning algorithm is executed are at least partially associated with the patient (and optionally also partially associated with one or more other persons, as already described above). [0073] To measure articulatory movement, a range of different electromagnetic or acoustic sensors, or both, can be used to probe the vocal tract and characterize its time-varying shape, as shown schematically in FIG. 5. As a general rule, sensors could be any, or a combination, of the following: radar transceivers (including phased arrays), radar reflectors (including active and passive), ultrasound transceivers (including phased arrays), ultrasound reflectors (including active and passive), cameras, surface electromyography electrodes, and/or microphones. [0074] An example for electromagnetic sensors includes radar transceivers, operating in the radio frequency range—e.g., the microwave range—of the electromagnetic spectrum (FIG. 5a). One or more radar antennas (36) can be placed externally near the subject's vocal tract, e.g., at the cheeks or under the mandible. At frequencies between 1 kHz and 12 GHz emitted electromagnetic waves (37) (optionally at frequencies between 1 and 12 GHz) will penetrate several centimeters to tens of centimeters into tissue. The waves backscattered and transmitted or otherwise influenced by tissue (38) are detected with the same or a dedicated set of antennas used for emission. At the extremely low average milliwatt power levels required, they are safe for continuous use on humans. The electromagnetic signal can be emitted into a broad beam and detected either with a single antenna or in a spatially or angularly resolved manner, with a phased array antenna configuration. A multiple input, multiple output (MIMO) configuration can be employed as well. The received, time-varying electromagnetic signal encodes information about the time-varying shape of the vocal tract which can be used in the machine learning algorithm. In the following, the term “radar” and “radar signal” is used in the generalized form introduced above. [0075] An example of acoustic waves sensors are ultrasound transceivers as used, for example, in medical imaging (FIG. 5b). An ultrasound transducer (39) is placed in contact with the patients skin, e.g. under the mandible. In the frequency range of 1 to 5 MHz, ultrasound penetrates the pertinent tissues well and can also be operated safely in a continuous way. Emitted ultrasound waves (40) are backscattered by tissue, and the backscattered waves (41) are detected with the transducer (39). Ultrasound sensing can be used either in a two- or three-dimensional imaging mode, for example using a phased array transducer, or in a non-imaging mode, where features of the backscattered ultrasound signals are directly used in the machine learning algorithm. Ultra-compact, chip-based phased-array ultrasound transceivers are available, for example, for endoscopic applications. The time-varying backscattered ultrasound wave encodes information about the time-varying shape of the vocal tract which can be used in the machine learning algorithm. [0076] Another possibility for acoustic sensing of the vocal tract configuration is using low-frequency ultrasound waves (FIG. 5c) in the range of 20 to 100 kHz. An ultrasound loudspeaker (42) in front of the subject's mouth can be used to emit low-frequency ultrasound waves (43) which penetrate the vocal tract. The reflected ultrasound signal (44) can be detected using an ultrasound microphone (45) in front of the subject's mouth. The frequency-dependent sound wave reflection coefficient from the speaker to the microphone encodes information about the time-varying shape of the vocal tract which can be used in the machine learning algorithm. [0077] Further, to deal with specific challenges it can be advantageous to introduce an auxiliary sensing modality. Multi-modality sensing (sometimes referred to as sensor fusion) can reduce the effects of inter-subject and inter-session variability by introducing additional redundancy into the measured signals. In addition, some characteristics of human speech such as pitch, volume, and timbre—referred to by linguists as prosodic variables—are not explicitly encoded in the vocal tract but are a result of the phonation process. To reconstruct prosody, multi-modality sensing may also be advantageous. The type of auxiliary sensing modality to be used in conjunction depends on the type of impairment causing the patient's dysphonia or aphonia. It is possible to use combinations of such auxiliary sensing modalities. In particular, it would be possible that the machine learning algorithm accepts multiple types of speech-related sensor signals as inputs, e.g., acquired using different sensors and/or monitoring different physical observables, as explained above. [0078] An example of an electromagnetic auxiliary sensing modality is a video camera recording the motion of the lips and facial features during speech (FIG. 5d). The video camera (48) captures light reflected (47) scattered off the patient's face under ambient or emitted illumination (46). As can be seen from some deaf people's ability to “lip read”, lip and facial movements are a rich source of information about what is being said and how. Also, a camera can be realized in a compact, light-weight, unobtrusive setup. Since in most of the types of impairments considered here, lip and facial movements remain unimpaired, a video camera is thus a preferred implementation for an auxiliary sensing modality. Also, multiple video cameras or depth-sensing cameras, such as cameras using time-of-flight technology, can be used in order to reconstruct three-dimensional facial geometry. [0079] Another example of an electromagnetic auxiliary sensing modality is surface electromyography (EMG) (FIG. 5e). Surface EMG can measure the action potentials of the musculature involved in speech production, providing complementary information to the vocal tract configuration, e.g. by encoding intended loudness. A combination of surface EMG sensors for the extrinsic laryngeal musculature (49) and the neck and facial musculature (50) can be used. Surface EMG can be particularly useful in cases where the extrinsic laryngeal musculature is present and active. [0080] An example of an acoustic auxiliary sensing modality is an acoustic microphone (FIG. 5f). A microphone as a complementary source of information about articulation, and possibly residual phonation, makes sense in all cases of impairment where a residual voice or a whisper is present. The microphone (51) picks up the acoustic waves associated with the residual voice of whisper (52). Whispering requires air flow through the vocal tract, but does not require phonation, i.e. vocal fold motion. Like a video camera, a microphone can be compact, light-weight, and unobtrusive. It can be positioned on the outside of the patient's throat, under the mandible, or in front of the mouth. A microphone can also be used in combination with a lip and facial camera, for example on the same headset cantilever in front of the patient's mouth. [0081] In situations where the impaired patient has retained significant residual voice, for example a clearly articulated whisper or residual phonation, a microphone as an acoustic sensing modality in combination with a lip and facial camera as an auxiliary sensing modality can be advantageous.

[0082] Step 1c: Synchronizing audio and the vocal tract training data. As a general rule, the method may further include synchronizing a timing of the one or more reference audio signals with a further timing of the one or more vocal-tract signals. By synchronizing the timings, an accurate training of the machine learning algorithm is enabled. The timing of a signal can correspond to the duration between corresponding information content. For example, the timing of the one or more reference audio signals may be characterized by the time duration required to cover a certain fraction of the reference text by the speech output; similarly, the further timing of the one or more vocal-tract signals can correspond to the time duration required to cover a certain fraction of the reference text by the articulation. [0083] Time synchronization between the audio training data and the vocal tract training data can be achieved in a variety of different ways, either during recording or afterwards. If the two data sets are acquired at the same time from the same subject, synchronization is achieved automatically. [0084] If the two data sets are acquired consecutively, synchronization can be accomplished by providing visual or auditory cues to the subject recording the vocal tract training data. This can be done, for example, by displaying the sample text to be recorded on a screen, with a cursor moving along at the speed of the audio recording of the target voice, or by quietly playing back that recording. In each case, the subject whose vocal tract signals are recorded aims to match the given speed. Thus, for example, it would be possible to control a human-machine-interface (HMI) to provide temporal guidance to the patient when articulating the reference text in accordance with the timing of the one or more reference audio signals. For example, it would be possible that the synchronization is achieved by providing optical or acoustic cues to the impaired patient while the vocal-tract signals are being recorded.

[0085] Alternatively or additionally, it would also be possible that said synchronizing includes controlling the HMI to obtain a temporal guidance from the patient when articulating the reference text. For example, the impaired patient could provide synchronization information by pointing at the part of the text being articulated, while the one or more vocal-tract signals are being recorded. Gesture detection or eye tracking may be employed. The position of an electronic pointer, e.g., mouse curser, could be analyzed. A third approach is to synchronize the audio training data and the vocal tract data computationally after recording, by selectively slowing down or speeding up one of the two data recordings. This requires both data streams to be annotated with timing cues. For the vocal tract signals, the subject recording the training set can provide these timing cues themselves by moving an input device, such as a mouse or a stylus, through the text at the speed of his or her reading. For the audio training data, the timing cues can be generated in a similar way, or by manually annotating the data after recording, or with the help of state-of-the-art speech recognition software. Thus, as a general rule, it would be possible that said synchronizing includes postprocessing at least one of the reference audio signal and a vocal-tract signal by changing a respective timing. In other words, it would be possible that said synchronizing is implemented electronically after recording of the one or more reference audio signals and/or the one or more vocal-tract signals, e.g., by selectively speeding up/accelerating or slowing down/decelerating the one or more reference audio signals and/or the one or more vocal-tract signals.

[0086] Step 2: Training the machine learning algorithm. A range of different algorithms commonly used in speech recognition and speech synthesis can be adapted to the task of transforming vocal tract signals into acoustic speech output. The transformation can either be done via intermediate representations of speech, or end-to-end, omitting any explicit intermediate steps. Intermediate representations can be, for example, elements of speech such as phonemes, syllables, or words, or acoustic speech parameters such as mel-frequency cepstral coefficients (MFCCs). [0087] Different options for transforming input vocal tract signals into acoustic speech output are illustrated in FIG. 6. [0088] In any case, the input includes one or more the vocal tract signals, e.g., partitioned into a time series of frames (cf. blocks 70). The length of each frame is typically on the order of 5-50 ms, during which the vocal tract configuration can be assumed to be approximately constant. Depending on the type of sensors used, the data in each frame can be received electromagnetic waves or ultrasound signals, an optical image, or an acoustic spectrum. [0089] Through suitable pre-processing (the pre-processing is generally optional, cf. blocks 71), a feature vector can be extracted from each frame. The task of the machine learning algorithm is then to transform the time series of feature vectors (cf. blocks 72), which implicitly encode the vocal tract configuration, into an acoustic waveform representing the speech output. [0090] If elements of speech, such as phonemes, syllables, or words, and MFCCs are used as intermediate representations (cf. block 74), the task can be divided into two subtasks: a speech recognition task (cf. block 73), recognizing the intermediate representation from the time series of feature vectors, and a speech synthesis tasks (cf. block 75), synthesizing an acoustic speech waveform (cf. block 78) from the intermediate representation (FIG. 6a). It is possible that block 73 and/or block 75 are implemented by a machine learning algorithm trained as described throughout. [0091] Next, various examples for implementing the machine learning algorithm are described. The recognition task can be accomplished using the types of statistical algorithms commonly used in state-of-the-art speech recognition, such as Hidden Markov Models (HMMs), Gaussian Mixture Models (GMMs), or Deep Neural Networks (DNNs). In each of these cases, a statistical model is created the predict the probabilities that a certain time series of feature vectors corresponds to a certain representation, e.g. a certain phoneme, syllable or word. The probabilities are “learned” during the training process by using the feature vectors corresponding to the vocal tract signals in the training data set and the representations of the corresponding sample text as a statistical sample. [0092] The synthesis task (cf. block 75) can be accomplished using established speech synthesis algorithms. In Unit Selection Synthesis algorithms, elements of speech are selected from a pre-recorded body of elements and are joined together to from speech output. Statistical models such as HMMs and DNNs are now also commonly used to create the acoustic waveform of speech output from representations of speech, such as phonemes, syllables, or words. This can be done via acoustic speech parameters, such as mel-frequency cepstral coeffients as an intermediate step, or directly—as, for example, in Google's WaveNet and Tacotron speech synthesis systems. [0093] If no elements of speech—such as phonemes, syllables, or words—are used as intermediate steps, a machine learning algorithm such as a DNN (cf. block 79) can be trained to transform the series of feature vectors corresponding to the vocal tract signals directly into a representation of the speech output, such as MFCCs (cf. block 76), which are in turn converted to an acoustic speech waveform using acoustic waveform synthesis (cf. block 77-78) (FIG. 6b). [0094] If the encoding of the acoustic speech output, e.g. in MFCCs, is omitted, a DNN or another machine learning algorithm (cf. block 80) can be trained to transform a time series of vocal tract feature vectors directly into the acoustic speech output waveform (cf. block 78) (FIG. 6c). In a fully end-to-end model, the time series of frames of vocal tract data would not be pre-processed to feature vectors. Instead, the DNN or another machine learning algorithm (cf. block 80) can be trained to directly generate the acoustic waveform (cf. block 78) from the time series of vocal tract data frames (FIG. 6d).

[0095] Step 3: Using the voice prosthesis. The trained neural network can then be used to realize an electronic voice prosthesis, a medical device that can alternatively be referred to as a “voice graft”, an “artificial voice”, or a “voice transplant”. A wide range of implementations are possible for the voice prosthesis. In practice, the choice will depend on the target patient scenario, i.e. the type of vocal impairment, the patient's residual vocal capabilities, the therapy goals, and aspects of the target setting, such as in-hospital vs. at home; bedridden vs. mobile. [0096] Four elements can interact to provide a voice prosthesis: Vocal tract sensors, preferably light-weight, compact and unobtrusive; a computing device, ideally mobile and wirelessly connected; the machine learning algorithm; and an acoustic output device, ideally unobtrusive but in proximity to the patient. Next, implementation choices for each of these elements are discussed. [0097] The first element are vocal tract sensors. A wide range of implementation options was discussed in the section “Step 1b: Creating the vocal tract training data”, above. The choice and placement of sensors during algorithm training and use of the voice prosthesis are typically the same. The optimal choice of sensors depends on the patient scenario. Electromagnetic or ultrasonic sensing of the vocal tract are chosen based on the reliability with which elements of speech can be recognized for the target patient type and setting. An auxiliary lip and facial camera will be advantageous in many scenarios to increase the reliability of recognition. In scenarios where the patient has any residual vocal output, such as residual phonation or the ability to whisper, a microphone will be an advantageous sensor modality. If the extrinsic laryngeal muscles and neck musculature are active, surface EMG can be an advantageous auxiliary sensor. Sensors should be light-weight and compact so as to not impede the patients movement and articulation. In most mobile and at-home settings, unobtrusiveness will be an aspect of emotional and social importance to patients. [0098] The second element of the voice prosthesis is a local computing device. It provides the computing power to carry out the trained machine learning algorithm or connects with a cloud-based computing platform where the algorithm is deployed, connects the algorithm with the acoustic output device, i.e. the loudspeaker, and provides a user interface. The requirements for portability and connectivity of the computing device depend on the patient scenario: For use at home, a mobile computing device is preferred. Compactness, affordability, and easy usability make a smartphone or tablet a preferred choice. It helps that a smartphone is not perceived as a prosthetic device, but as an item of daily use. In an ICU setting, by contrast, compactness and unobtrusiveness play a lesser role and the computing device can be integrated into a bedside unit. For home settings, a wireless connection, such as Bluetooth, between the sensor and the computing device will be desirable. In an ICU setting, on the other hand, wired connections are more acceptable. Thus, as a general rule, it is possible that a conversion of one or more is locally implemented on at least one mobile computing device of the patient, or is remotely implemented using cloud-computing. [0099] The third element is the deployed trained machine learning algorithm. Depending on the needed computing power and the available transmission bandwidth, it can be deployed on the local computing device, or remotely, i.e. in the cloud. In a mobile, smartphone-based implementation, cloud deployment of the algorithm can be advantageous. In a stationary bedside setting, the algorithm can run locally. A wide range of algorithm types known from the fields of speech recognition and speech synthesis was discussed in the section “Step 2: Training the algorithm”, above. The choice of algorithm depends on the type and number of sensors, the amount of training data available, and the degree to which the speech output needs to be customized to an individual patient. Generally, thanks to progress in neural network architecture and the increasing availability of computing power, end-to-end DNNs are becoming an increasingly attractive choice. [0100] The fourth element is an acoustic output device, for example a loudspeaker. Ideally, this loudspeaker is both unobtrusive and in proximity to the patient's mouth, to make for a natural appearance of the artificial voice output. The closest proximity can be achieved by integrating a loudspeaker in the sensor unit, located at the patient's throat, under the mandible, or on a headset cantilever in front of the patient's face. Alternatively, a simpler solution for smartphone based implementations would be to use the loudspeaker output of the smartphone.

[0101] Based on the range of the implementation options for each step above, a wide range of embodiments of the techniques described herein is possible. We describe four preferred embodiments of the invention for different patient scenarios. It is understood that combinations of various aspects of these embodiments can also be advantageous in these and other scenarios and that more embodiments of the invention can be generated from the implementation options discussed above. Also, the preferred embodiments described can apply to scenarios other than the ones mentioned in the description.

Preferred Embodiment 1: Radar and Video Based Method for Bedridden Patients

[0102] For a bedridden patient with no laryngeal airflow, such as a patient who is mechanically ventilated through a cuffed tracheostomy tube, embodiment 1 is a preferred embodiment. Such patients generally have no residual voice output and are not able to whisper. Therefore, the combination of radar sensing to obtain robust vocal tract signals and a video camera to capture lip and facial movements is preferred.

[0103] The main elements of the corresponding voice prosthesis are shown in FIG. 7. The patient (53) is confined to the patient bed (54), typically an ICU bed. A power supply (55), radar transmission and receiving electronics (56), signal processing electronics (57), a computing device (58), and audio amplifier (59) are contained in a bedside unit (60). A portable touchscreen device (61) such as a tablet serves as the user interface through which patient and care staff can interact with the system.

[0104] Two or more antennas (36) are used to collect reflected and transmitted radar signals that encode the time-varying vocal tract shape. The antennas are placed in proximity to the patient's vocal tract, e.g. under the right and left jaw bone. To keep their position stable relative to the vocal tract they can be attached directly to the patient's skin as patch antennas. Each antenna can send and receive modulated electromagnetic signals in a frequency band between 1 kHz and 12 GHz, optionally 1 GHz and 12 GHz, so that (complex) reflection and transmission can be measured. Possible modulations of the signal are: frequency sweep-, stepped frequency sweep-, pulse-, frequency comb-, frequency-, phase-, or amplitude modulation. In addition, a video camera (48) captures a video stream of the patient's face, containing information about the patient's lip and facial movements. The video camera is mounted in front of the patient's face on a cantilever (62) attached to the patient bed. The same cantilever can support the loudspeaker (63) for acoustic speech output.

[0105] The computing device (58) contained in the bedside unit (60) locally provides the necessary computing power to receive signals from the signal processing electronics (57) and the video camera (48), run the machine learning algorithm, output acoustic waveforms to the audio amplifier (59), and communicate wirelessly with the portable touchscreen device (61) serving as the user interface. The machine learning algorithm uses a deep neural network to transform the pre-processed radar signals and the stream of video images into an acoustic waveform in real time. The acoustic waveform is sent via the audio amplifier (59) to the loudspeaker (63).

[0106] The corresponding method for creating an artificial voice is as follows. An existing speech database is used to obtain audio training data for multiple target voices with different characteristics such as gender, age, and pitch. To create a corresponding body of vocal tract data, the sample text of the audio training data is read by a number of different speakers without speech impairment while their vocal tract signals are being recorded with the same radar sensor and video camera setup as for the eventual voice prosthesis. As the speakers read the sample text off a display screen, they follow the text along with an input stylus and timing cues are recorded. The timing cues are used to synchronize the vocal tract training data with the audio training data.

[0107] The audio training data sets of different target voices are separately combined with the synchronized vocal tract training data and used to train a deep neural network algorithm to convert radar and video data into the target voice. This results in a number of different DNNs, one for each target voice. The voice prosthesis is pre-equipped with these pre-trained DNNs.

[0108] To deal with the subject-to-subject variation in vocal tract signals, a pre-trained DNN is re-trained for a particular patient before use. To this end, first the pre-trained DNN that best matches the intended voice for the patient is selected. Then, the patient creates a patient-specific set vocal tract training data, by mouthing an excerpt of the sample text that was used to pre-train the DNNs, while vocal tract data are being recorded. This second vocal tract training data set is synchronized and combined with the corresponding audio sample of the selected target voice. This smaller, patient-specific second set of training data is now used to re-train the DNN. The resulting patient specific DNN is used in the voice prosthesis to transform the patient's vocal tract signal to voice output with the characteristics of the selected target voice.

Preferred Embodiment 2: Radar and Video Based Method for Mobile Patients

[0109] For a mobile patient with no laryngeal airflow, such as a patient whose larynx has been surgically removed, embodiment 2 is a preferred embodiment. Like the patient in embodiment 1, such patients also have no residual voice output and are not able to whisper. Therefore, the combination of radar sensing and a video camera to capture lip and facial movements is preferred in this case, too.

[0110] The main elements of the corresponding voice prosthesis are shown in FIG. 8. The patient (64) is mobile, so all elements of the voice prosthesis should be portable. A power supply (55), radar transmission and receiving electronics (56), signal processing electronics (57), and a wireless transmitter and receiver (65) are contained in a portable electronics unit (66). A portable touchscreen device (61) with a built-in loudspeaker (63) serves as the user interface for the patient.

[0111] As in embodiment 1, two or more antennas (36) are used to collect reflected and transmitted radar signals that encode the time-varying vocal tract shape. The antennas are placed in proximity to the patient's vocal tract, e.g. under the right and left jaw bone. To keep their position stable relative to the vocal tract they can be attached directly to the patient's skin. Each antenna can send and receive modulated electromagnetic signals in a frequency band between 1 kHz and 12 GHz, so that (complex) reflection and transmission can be measured. Possible preferred modulations of the signal are: frequency sweep-, stepped frequency sweep-, pulse-, frequency comb-, frequency-, phase-, or amplitude modulation.

[0112] In addition, a video camera (48) captures a video stream of the patient's face, containing information about the patient's lip and facial movements. For portability the video camera is mounted in front of the patient's face on a cantilever (62) worn by the patient like a microphone headset.

[0113] The portable touchscreen device (61) is also the computing device that locally provides the necessary computing power to receive the processed radar signals and the video images from the wireless transmitter (65), run the machine learning algorithm, output the acoustic speech waveforms via the built-in speaker (63), and provide the user interface on the touchscreen. The machine learning algorithm uses a deep neural network to transform the pre-processed radar signals and the stream of video images into an acoustic waveform in real time.

[0114] The corresponding method for creating an artificial voice is the same as in embodiment 1.

Preferred Embodiment 3: Low-Frequency Ultrasound and Video Based Method for Mobile Patients

[0115] For a mobile patient with no laryngeal airflow, such as a patient whose larynx has been surgically removed, embodiment 3 is an alternative preferred embodiment. Instead of radar sensing, in this embodiment low-frequency ultrasound is used to characterize the time-varying shape of the vocal tract.

[0116] The main elements of the corresponding voice prosthesis are shown in FIG. 9. The patient (64) is mobile, so again all elements of the voice prosthesis should be portable. A power supply (55), an ultrasound waveform generator (67), an analog-to-digital converter (68), signal processing electronics (57), and a wireless transmitter and receiver (65) are contained in a portable electronics unit (66). A portable touchscreen device (61) with a built-in loudspeaker (63) serves as the user interface.

[0117] A low-frequency ultrasound loudspeaker (42) is used to emit ultrasound signals in the range of 20 to 30 kHz that are directed at the patient's mouth and nose. The ultrasound signals reflected from the patient's vocal tract are captured by an ultrasound microphone (45). The ultrasound loudspeaker and microphone are mounted in front of the patient's face on a cantilever (62) worn by the patient like a microphone headset.

[0118] With this setup, the complex reflection coefficient can be measured as a function of frequency. The frequency dependence of the reflection or transmission is measured by sending signals in a continuous frequency sweep, or in a series of wave packets with stepwide increasing frequencies, or by sending a short pulse and measuring the impulse response in a time-resolved manner.

[0119] In addition, a video camera (48) captures a video stream of the patient's face, containing information about the patient's lip and facial movements. The video camera is mounted on the same cantilever (62) as the ultrasound loudspeaker and microphone.

[0120] As in embodiment 2, the portable touchscreen device (61) is also the computing device. It locally provides the necessary computing power to receive the ultrasound signals converted by the analog-to-digital converter (68) and the video images via the wireless transmitter (65), run the machine learning algorithm, output the acoustic speech waveforms via the built-in speaker (63), and provide the user interface on the touchscreen. The machine learning algorithm uses a DNN to transform the pre-processed ultrasound signals and the stream of video images into an acoustic waveform in real time.

[0121] The corresponding method for creating an artificial voice is the same as in embodiments 1 and 2.

Preferred Embodiment 4: Audio and Video Based Method for Mobile Patients with Residual Voice

[0122] For a mobile patient with residual voice output, such as residual phonation, a whisper voice, or a pure whisper without phonation, embodiment 4 is a preferred embodiment. For such a patient, the combination of an acoustic microphone to pick up the residual voice output and a video camera to capture lip and facial movements is preferred.

[0123] The main elements of the corresponding voice prosthesis are shown in FIG. 10. As in embodiments 2 and 3, the patient is mobile, so all elements of the voice prosthesis should be portable. To minimize the number of separate components and maximize unobtrusiveness, no portable touchscreen device is used as a user interface and all electronics are contained in a portable electronics unit (66): a power supply (55), a computing device (58), an audio amplifier (59), and a user interface (69) such as a touch screen.

[0124] A microphone (52) capturing the acoustic signal of the residual voice and a video camera (48) capturing lip and facial movements are placed in front of the patient's face on a cantilever (62) worn by the patient like a microphone headset. The microphone and camera signals are sent to the computing device (59) which runs the machine learning algorithm and outputs the acoustic speech output via the audio amplifier (59) and a loudspeaker (63) that is also mounted on the cantilever in front of the patient's face. The machine learning algorithm uses a DNN to transform the acoustic and video vocal tract signals into an acoustic waveform in real time.

[0125] The corresponding method for creating an artificial voice differs from the previous embodiments. Since the residual voice depends strongly on the patient's condition and may even change over time, a patient specific DNN algorithm is trained for each patient.

[0126] An existing speech database is used to obtain audio training data for a target voice that matches the patient in characteristics such as gender, age, and pitch. To create a corresponding body of vocal tract data, the sample text of the audio training data is read by the patient with the same microphone and video camera setup as for the eventual voice prosthesis. As the patient reads the sample text off a display screen, he or she follows the text along with an input stylus and timing cues are recorded. The timing cues are used to synchronize the vocal tract training data with the audio training data.

[0127] The combined training data set is used to train the DNN algorithm to transform the patient's vocal tract signals, i.e. residual voice and lip and facial movements, into acoustic speech output. If over time the patient's residual voice output changes enough to degrade the quality of the speech output, the algorithm can be re-trained by recording a new set of vocal tract training data.

[0128] Summarizing, at least the following examples have been described above.

[0129] EXAMPLE 1. A method, comprising: [0130] training a machine learning algorithm based on one or more reference audio signals of a speech output of a reference text, and one or more vocal tract signals associated with an articulation of the reference text by a patient.

[0131] EXAMPLE 2. The method of EXAMPLE 1, [0132] wherein multiple configurations of the machine learning algorithm are trained using at least one of multiple speech outputs having varying speech characteristics, or multiple articulations of the reference text having varying articulation characteristics, [0133] wherein the method further comprises: [0134] selecting a configuration from the multiple configurations of the machine learning algorithm based on a patient dataset of the patient.

[0135] EXAMPLE 3. The method of EXAMPLE 1 or 2, further comprising: [0136] synchronizing a timing of the one or more reference audio signals with a further timing of the one or more vocal tract signals.

[0137] EXAMPLE 4. The method of EXAMPLE 3, wherein said synchronizing comprises: [0138] controlling a human-machine interface to provide temporal guidance to the patient when articulating the reference text in accordance with the timing of the one or more reference audio signals.

[0139] EXAMPLE 5. The method of EXAMPLE 3 or 4, wherein said synchronizing comprises: [0140] controlling a human-machine-interface to obtain a temporal guidance from the patient when articulating the reference text.

[0141] EXAMPLE 6.The method of any one of EXAMPLEs 3 to 5, wherein said synchronizing comprises: [0142] postprocessing at least one of the one or more reference audio signals and the one or more vocal-tract signals by changing a respective time.

[0143] EXAMPLE 7. The method of any one of EXAMPLEs 1 to 6, [0144] wherein the machine learning algorithm is trained end-to-end to convert a live articulation of the patient to a live speech output.

[0145] EXAMPLE 8. The method of any one of EXAMPLEs 1 to 6, [0146] wherein the machine learning algorithm is trained end-to-end to convert a live articulation of the patient to fragments of a live speech output.

[0147] EXAMPLE 9. The method of any one of the preceding EXAMPLEs, [0148] wherein the one or more reference audio signals and/or the one or more vocal-tract signals are provided by at least one of the patient or one or more other persons.

[0149] EXAMPLE 10. The method of any one of the preceding EXAMPLEs, further comprising: [0150] receiving one or more live vocal tract signals of the patient, and [0151] based on the machine learning algorithm, converting the one or more live vocal-tract signals into associated one or more live audio signals comprising speech output.

[0152] EXAMPLE 11. The method of EXAMPLE 10, [0153] wherein said converting is locally implemented on at least one mobile computing device of the patient, or is remotely implemented using cloud-computing.

[0154] EXAMPLE 12. The method of EXAMPLE 10 or 11, further comprising: [0155] recording at least a part of the one or more live vocal-tract signals using one or more sensors associated with at least one mobile computing device.

[0156] EXAMPLE 13. The method of EXAMPLE 12, wherein the one or more sensors are selected from the group comprising: a lip camera; a facial camera; a headset microphone; an ultrasound transceiver; a neck or larynx surface electromyogram; and a radar transceiver.

[0157] EXAMPLE 14. The method of any one of EXAMPLEs 10 to 13, further comprising: [0158] outputting the one or more live audio signals using a speaker of at least one mobile computing device of the patient.

[0159] EXAMPLE 15. The method of any one of the preceding EXAMPLEs, wherein the patient is on mechanical ventilation through a tracheostomy, has undergone a partial or complete laryngectomy, or suffers from vocal fold paresis or paralysis.

[0160] EXAMPLE 16. The method of EXAMPLE 15, wherein the speech output of the reference text is provided by the patient prior to speech impairment.

[0161] EXAMPLE 17. A device comprising a control circuitry configured to: [0162] receive one or more live vocal tract signals of a patient, [0163] based on a machine learning algorithm, convert the one or more live vocal tract signals into one or more associated live audio signals comprising a speech output, the machine learning algorithm being trained based on one or more reference audio signals of a speech output of a reference text, and one or more reference vocal tract signals of a patient associated with an articulation of the reference text by a patient.

[0164] EXAMPLE 18. The device of EXAMPLE 17, wherein the control circuitry is configured to execute the method of any one of the EXAMPLES 1 to 16.

[0165] Although the invention has been shown and described with respect to certain preferred embodiments, equivalents and modifications will occur to others skilled in the art upon the reading and understanding of the specification. The present invention includes all such equivalents and modifications and is limited only by the scope of the appended claims.

[0166] For instance, various examples have been described with respect to certain sensors used to record one or more vocal tract signals. Depending on the patient's condition, residual vocal capabilities, the therapy goals and the setting, different vocal tract sensors can be used. They can be unobtrusive and wearable: light weight and compactness; preferably low power consumption and wireless operation.

[0167] For further illustration, various examples have been described with respect to a trained machine learning algorithm. Depending on the computing power requirements in the transmission bandwidth, modifications are possible: for example, the trained machine learning algorithm could be deployed locally (i.e., on a mobile computing device) or remotely, i.e., using a cloud computing service. The mobile computing device can be used to connect one or more sensors with a platform executing the machine learning algorithm. The mobile computing device can also be used to output, via a loudspeaker, one or more audio signals including speech output determined based on the machine learning algorithm.

[0168] For further illustration, various examples have been described in which multiple configurations of the machine learning algorithm are trained using varying speech characteristics and/or varying articulation characteristics. In this regard, many levels of matching the speech characteristic to the patient characteristic are conceivable: gender, age, pitch, accent or dialect, etc. The matching can be done by selecting from a “library” of configurations of the machine learning algorithm, by modifying an existing configuration, or by custom recording the voice of a “voice donor”.

[0169] For still further illustration, the particular type or sets of sensors is not germane to the functioning of the subject techniques. Different sensor types are advantageous in different situations: (i) lip/facial cameras. A camera recording the motion of the lips and facial features will be useful in most cases, since these cues are available in most disease scenarios, are fairly information-rich (cf. lip reading), and are easy to pick up with a light-weight, relatively unobtrusive setup. A modified microphone headset with one or more miniature CCD cameras mounted on the cantilever may be used. Multiple CCD cameras or depth-sensing cameras, such as cameras using time-of-flight technology may be advantageous to enable stereoscopic image analysis. (ii) Radar transceiver. Short-range radar operating in the frequency range between 1 and 12 GHz is an attractive technology for measuring the internal vocal tract configuration. These frequencies penetrate several centimeters to tens of centimeters into tissue and are safe for continuous use at the extremely low average power levels (microwatts) required. The radar signal can be emitted into a broad beam and detected either with a single antenna or in a spatially (i.e. angularly) resolved manner with multiple antennas. (iii) ultrasound transceiver. Ultrasound can be an alternative to radar sensing in measuring the vocal tract configuration. At frequencies in the range of 1-5 MHz, ultrasound also penetrates and images the pertinent tissues well and can be operated safely in a continuous way. Ultra-compact, chip based phased-array ultrasound transceivers are available for endoscopic applications. Ultrasound can also be envisioned to be used in a non-imaging mode. (iv) surface EMG sensors. Surface EMG sensors may provide complementary data to the vocal tract shape information, especially in cases where the extrinsic laryngeal musculature is present and active. In those cases, EMG may help by providing information on intended loudness (i.e. adding dynamic range to the speech output) and, more fundamentally, distinguishing speech from silence. The latter is a fundamental need in speech recognition, as the shape of the vocal tract alone does not reveal whether or not acoustic excitation (phonation) is present. (v) acoustic microphone. Acoustic microphones make sense as (additional) sensors in all cases with residual voice present. Note that in this context, “residual voice” may include a whispering voice. Whispering needs air flow through the vocal tract, but does not involve phonation (i.e. vocal fold motion). In many cases, picking up a whispered voice, perhaps in combination with observing lip motion, may be enough to reconstruct and synthesize natural sounding speech. In many scenarios, this would greatly simplify speech therapy, as it reduces the challenge from getting the patient to speak to teaching the patient to whisper. Microphones could attach to the patient's throat, under the mandible, or in front of the mouth (e.g. on the same headset cantilever as a lip/facial camera).

[0170] For still further illustration, various examples have been described in connection with using a machine learning algorithm to transform vocal-tract signals into audio signals associated with speech. It is not mandatory to use a machine learning algorithm; other types of algorithms may be used for the transformation.

LIST OF REFERENCE NUMERALS

FIG. 1: Schematic of the Anatomy Relevant to Physiologic Voice Production and its Impairments

[0171] 1 anatomical structures involved in phonation: lungs (not shown), trachea, and larynx (“source”)

[0172] 2 anatomical structures involved in articulation: vocal tract (“filter”)

[0173] 3 trachea

[0174] 4 larynx

[0175] 4a glottis

[0176] 5 epiglottis

[0177] 6 pharynx

[0178] 7 velum

[0179] 8 oral cavity

[0180] 9 tongue

[0181] 10a upper teeth

[0182] 10b lower teeth

[0183] 11a upper lip

[0184] 11b lower lip

[0185] 12 nasal cavity

[0186] 13 nostrils

[0187] 14 esophagus

[0188] 15 thyroid

[0189] 16 recurrent laryngeal nerve

FIG. 2: Schematic of Different Causes of Aphonia

[0190] (a) Tracheostomy

[0191] 17 tracheostomy for mechanical ventilation

[0192] 17a tracheostomy tube

[0193] 17b inflated cuff

[0194] (b) Laryngectomy

[0195] 18 tracheostoma after laryngectomy

[0196] 3 trachea

[0197] 14 esophagus

[0198] (c) Recurrent nerve injury

[0199] 19 laryngeal nerve injury after thyroidectomy

[0200] 16 recurrent laryngeal nerve

[0201] 16a nerve injury

FIG. 3: Schematic of Different Voice Rehabilitation Options

[0202] (a) Tracheoesphageal puncture (TEP)

[0203] 20 tracheoesophageal puncture and valve

[0204] 21 finger

[0205] 22 vibrations

[0206] (b) Esophageal speech

[0207] 22 vibrations

[0208] (c) Electrolarynx

[0209] 23 electrolarynx

[0210] 22 vibrations

FIG. 4: Schematic of an an Example Implementation

[0211] (a) Step 1a: Creating the audio training data

[0212] 24 sample text

[0213] 25 healthy speaker

[0214] 26 microphone

[0215] 27 audio training data

[0216] (b) Step 1b: Creating the vocal tract training data

[0217] 28 display with sample text

[0218] 29 impaired patient

[0219] 30 vocal tract sensors

[0220] 31 vocal tract training data

[0221] (c) Step 1c: Synchronizing audio and vocal tract training data

[0222] 27 audio training data

[0223] 31 vocal tract training data

[0224] (d) Step 2: Training the algorithm

[0225] 27 audio training data

[0226] 31 vocal tract training data

[0227] 32 trained machine learning algorithm

[0228] (e) Step 3: Using the voice prosthesis

[0229] 29 impaired patient

[0230] 30 vocal tract sensors

[0231] 32 trained machine learning algorithm

[0232] 33 wireless connection

[0233] 34 mobile computing device

[0234] 35 acoustic speech output

FIG. 5: Schematic of Different Implementation Options for Vocal Tract Sensors

[0235] (a) Microwave radar sensing

[0236] 36 radar antenna

[0237] 37 emitted radar signal

[0238] 38 backscattered/transmitted radar signal

[0239] (b) Ultrasound sensing

[0240] 39 ultrasound transducer

[0241] 40 emitted ultrasound signal

[0242] 41 backscattered ultrasound signal

[0243] (c) Low-frequency ultrasound

[0244] 42 ultrasound loudspeaker

[0245] 43 emitted ultrasound signal

[0246] 44 reflected ultrasound signal

[0247] 45 ultrasound microphone

[0248] (d) Lip and facial camera

[0249] 46 ambient light

[0250] 47 reflected light

[0251] 48 video camera

[0252] (e) Surface electromyography

[0253] 49 surface electromyography sensors (for extralaryngeal musculature)

[0254] 50 surface electromyography sensors (for neck and facial musculature)

[0255] (f) Acoustic microphone

[0256] 51 residual acoustic voice signal

[0257] 52 acoustic microphone

FIG. 6: Schematic of Different Implementation Options for Processing Vocal Tract Signals

[0258] (a) using elements of speech and MFCCs as intermediate representations of speech

[0259] 70 vocal tract data: series of frames

[0260] 71 data pre-processing

[0261] 72 time series of feature vectors

[0262] 73 speech recognition algorithm

[0263] 74 elements of speech: phonemes, syllables, words

[0264] 75 speech synthesis algorithm

[0265] 76 mel-frequency cepstral coefficients

[0266] 77 acoustic waveform synthesis

[0267] 78 acoustic speech waveform

[0268] (b) using MFCCs as intermediate representations of speech

[0269] 70 vocal tract data: series of frames

[0270] 71 data pre-processing

[0271] 72 time series of feature vectors

[0272] 76 mel-frequency cepstral coefficients

[0273] 77 acoustic waveform synthesis

[0274] 78 acoustic speech waveform

[0275] 79 deep neural network algorithm

[0276] (c) End-to-end machine learning algorithm using no intermediate representations of speech

[0277] 70 vocal tract data: series of frames

[0278] 71 data pre-processing

[0279] 72 time series of feature vectors

[0280] 78 acoustic speech waveform

[0281] 80 end-to-end deep neural network algorithm

[0282] (d) End-to-end machine learning algorithm using no explicit pre-processing and no intermediate representations of speech

[0283] 70 vocal tract data: series of frames

[0284] 78 acoustic speech waveform

[0285] 80 end-to-end deep neural network algorithm

FIG. 7: Schematic of Voice Prosthesis for Preferred Embodiment 1: Radar and Video Based Method for Bedridden Patients

[0286] 36 radar antennas

[0287] 48 video camera

[0288] 53 bedridden patient

[0289] 54 patient bed

[0290] 55 power supply

[0291] 56 radar transmission and receiving electronics

[0292] 57 signal processing electronics

[0293] 58 computing device

[0294] 59 audio amplifier

[0295] 60 bedside unit

[0296] 61 mobile computing device with touchscreen

[0297] 62 cantilever

[0298] 63 loudspeaker

FIG. 8: Schematic of Voice Prosthesis for Preferred Embodiment 2: Radar and Video Based Method for Mobile Patients

[0299] 36 radar antennas

[0300] 48 video camera

[0301] 55 power supply

[0302] 56 radar transmission and receiving electronics

[0303] 57 signal processing electronics

[0304] 61 mobile computing device with touchscreen

[0305] 62 cantilever

[0306] 63 loudspeaker

[0307] 64 mobile patient

[0308] 65 wireless transmitter and receiver

[0309] 66 portable electronics unit

FIG. 9: Schematic of Voice Prosthesis for Preferred Embodiment 3: Low-Frequency Ultrasound and Video Based Method for Mobile Patients

[0310] 42 ultrasound loudspeaker

[0311] 45 ultrasound microphone

[0312] 48 video camera

[0313] 55 power supply

[0314] 57 signal processing electronics

[0315] 61 mobile computing device with touchscreen

[0316] 62 cantilever

[0317] 63 loudspeaker

[0318] 64 mobile patient

[0319] 65 wireless transmitter and receiver

[0320] 66 portable electronics unit

[0321] 67 ultrasound waveform generator

[0322] 68 analog-to-digital converter

FIG. 10: Schematic of Voice Prosthesis for Preferred Embodiment 4: Audio and Video Based Method for Mobile Patients with Residual Voice

[0323] 48 video camera

[0324] 52 microphone

[0325] 55 power supply

[0326] 58 computing device

[0327] 59 audio amplifier

[0328] 62 cantilever

[0329] 63 loudspeaker

[0330] 64 mobile patient

[0331] 66 portable electronics unit

[0332] 69 user interface

VOICE GRAFTING USING MACHINE LEARNING

Inventors

Cpc classification

Classification Explorer

G10L2021/0135

PHYSICS

Classification Explorer

G10L15/25

PHYSICS

Classification Explorer

A61F2/20

HUMAN NECESSITIES

Classification Explorer

G10L21/00

PHYSICS

Classification Explorer

G10L25/30

PHYSICS

Classification Explorer

A61F2002/206

HUMAN NECESSITIES

Classification Explorer

A61F2250/0002

HUMAN NECESSITIES

Classification Explorer

G10L13/033

PHYSICS

International classification

Classification Explorer

G10L13/033

PHYSICS

Classification Explorer

G10L25/30

PHYSICS

Classification Explorer

G10L15/25

PHYSICS

Classification Explorer

A61F2/20

HUMAN NECESSITIES

Abstract

Claims

Description