Patent classifications
G10L25/30
Emitting word timings with end-to-end models
A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.
Apparatus and method for encoding/decoding audio signal using information of previous frame
Disclosed is an apparatus and method for encoding/decoding an audio signal using information of a previous frame. An audio signal encoding method includes: generating a current latent vector by reducing dimension of a current frame of an audio signal; generating a concatenation vector by concatenating a previous latent vector generated by reducing dimension of a previous frame of the audio signal with the current latent vector; and encoding and quantizing the concatenation vector.
Apparatus and method for encoding/decoding audio signal using information of previous frame
Disclosed is an apparatus and method for encoding/decoding an audio signal using information of a previous frame. An audio signal encoding method includes: generating a current latent vector by reducing dimension of a current frame of an audio signal; generating a concatenation vector by concatenating a previous latent vector generated by reducing dimension of a previous frame of the audio signal with the current latent vector; and encoding and quantizing the concatenation vector.
METHOD AND APPARATUS FOR AUTOMATIC COUGH DETECTION
A method for identifying cough sounds in an audio recording of a subject including: operating at least one electronic processor to identify potential cough sounds in the audio recording; operating the at least one electronic processor to transform one or more of the potential cough sounds into corresponding one or more image representations; operating the at least one electronic processor to apply the one or more image representations to a representation pattern classifier trained to confirm that a potential cough sound is a cough sound or is not a cough sound; and operating the at least one electronic processor to flag one or more of the potential cough sounds as confirmed cough sounds based on an output of the representation pattern classifier.
METHOD AND APPARATUS FOR AUTOMATIC COUGH DETECTION
A method for identifying cough sounds in an audio recording of a subject including: operating at least one electronic processor to identify potential cough sounds in the audio recording; operating the at least one electronic processor to transform one or more of the potential cough sounds into corresponding one or more image representations; operating the at least one electronic processor to apply the one or more image representations to a representation pattern classifier trained to confirm that a potential cough sound is a cough sound or is not a cough sound; and operating the at least one electronic processor to flag one or more of the potential cough sounds as confirmed cough sounds based on an output of the representation pattern classifier.
Systems and Methods for Assisted Translation and Lip Matching for Voice Dubbing
Systems and methods for generating candidate translations for use in creating synthetic or human-acted voice dubbings, aiding human translators in generating translations that match the corresponding video, automatically grading how well a candidate translation matches the corresponding video, suggesting modifications to the speed and/or timing of the translated text to improve the grading of a candidate translation, and suggesting modifications to the voice dubbing and/or video to improve the grading of a candidate translation. In that regard, the present technology may be used to fully automate the process of generating lip-matched translations and associated voice dubbings, or as an aid for human-in-the-loop processes that may reduce or eliminate the time and effort required from translators, adapters, voice actors, and/or audio editors to generate voice dubbings.
DIFFICULT AIRWAY EVALUATION METHOD AND DEVICE BASED ON MACHINE LEARNING VOICE TECHNOLOGY
The present disclosure relates to a difficult airway evaluation method and device based on a machine learning voice technology. The method includes the following steps: acquiring voice data of a patient; carrying out feature extraction on the voice data, obtaining a pitch period of pronunciations, and acquiring a voiced sound feature and unvoiced sound features based on the pitch period of pronunciations; and constructing a difficult airway evaluation classifier based on the machine learning voice technology, analyzing the received voiced sound feature and unvoiced sound features by the trained difficult airway evaluation classifier, and carrying out scoring on the severity of a difficult airway to obtain an evaluation result of the difficult airway.
DIFFICULT AIRWAY EVALUATION METHOD AND DEVICE BASED ON MACHINE LEARNING VOICE TECHNOLOGY
The present disclosure relates to a difficult airway evaluation method and device based on a machine learning voice technology. The method includes the following steps: acquiring voice data of a patient; carrying out feature extraction on the voice data, obtaining a pitch period of pronunciations, and acquiring a voiced sound feature and unvoiced sound features based on the pitch period of pronunciations; and constructing a difficult airway evaluation classifier based on the machine learning voice technology, analyzing the received voiced sound feature and unvoiced sound features by the trained difficult airway evaluation classifier, and carrying out scoring on the severity of a difficult airway to obtain an evaluation result of the difficult airway.
Method and System for Dereverberation of Speech Signals
A system and method for reverberation reduction is disclosed. A first Deep Neural Network (DNN) produces a first estimate of a target direct-path signal from a mixture of acoustic signals that include the target direct-path signal and a reverberation of the target direct-path signal. A filter modeling a room impulse response (RIR) for the first estimate is estimated. The filter when applied to the first estimate of the target direct-path signal generates a result closest to a residual between the mixture of the acoustic signals and the first estimate of the target direct-path signal according to a distance function. A mixture with reduced reverberation of the target direct-path signal is obtained by removing the result of applying the filter to the first estimate of the target direct-path signal from the received mixture. A second DNN produces a second estimate of the target direct-path signal from the mixture with reduced reverberation.
Method and System for Dereverberation of Speech Signals
A system and method for reverberation reduction is disclosed. A first Deep Neural Network (DNN) produces a first estimate of a target direct-path signal from a mixture of acoustic signals that include the target direct-path signal and a reverberation of the target direct-path signal. A filter modeling a room impulse response (RIR) for the first estimate is estimated. The filter when applied to the first estimate of the target direct-path signal generates a result closest to a residual between the mixture of the acoustic signals and the first estimate of the target direct-path signal according to a distance function. A mixture with reduced reverberation of the target direct-path signal is obtained by removing the result of applying the filter to the first estimate of the target direct-path signal from the received mixture. A second DNN produces a second estimate of the target direct-path signal from the mixture with reduced reverberation.