Patent classifications
G10L25/30
Method and system for speech enhancement
A method and a system for speech enhancement including a time synchronization unit configured to synchronize microphone signals sent from at least two microphones; a source separation unit configured to separate the synchronized microphone signals and output a separated speech signal, which corresponds to a speech source; and a noise reduction unit including a feature extraction unit configured to extract a speech feature of the separated speech signal and a neural network configured to receive the speech feature and output a clean speech feature.
Method and system for speech enhancement
A method and a system for speech enhancement including a time synchronization unit configured to synchronize microphone signals sent from at least two microphones; a source separation unit configured to separate the synchronized microphone signals and output a separated speech signal, which corresponds to a speech source; and a noise reduction unit including a feature extraction unit configured to extract a speech feature of the separated speech signal and a neural network configured to receive the speech feature and output a clean speech feature.
Training Speech Synthesis to Generate Distinct Speech Sounds
A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.
Audio reconstruction method and device which use machine learning
Provided are an audio reconstruction method and device for providing improved sound quality by reconstructing a decoding parameter or an audio signal obtained from a bitstream, by using machine learning. The audio reconstruction method includes obtaining a plurality of decoding parameters of a current frame by decoding a bitstream, determining characteristics of a second parameter included in the plurality of decoding parameters and associated with a first parameter, based on the first parameter included in the plurality of decoding parameters, obtaining a reconstructed second parameter by applying a machine learning model to at least one of the plurality of decoding parameters, the second parameter, and the characteristics of the second parameter, and decoding an audio signal, based on the reconstructed second parameter.
Multimodal sentiment classification
Sentiment classification can be implemented by an entity-level multimodal sentiment classification neural network. The neural network can include left, right, and target entity subnetworks. The neural network can further include an image network that generates representation data that is combined and weighted with data output by the left, right, and target entity subnetworks to output a sentiment classification for an entity included in a network post.
METHOD AND APPARATUS FOR TARGET EXAGGERATION FOR DEEP LEARNING-BASED SPEECH ENHANCEMENT
The present disclosure relates to a speech enhancement apparatus, and specifically, to a method and apparatus for a target exaggeration for deep learning-based speech enhancement. According to an embodiment of the present disclosure, the apparatus for a target exaggeration for deep learning-based speech enhancement can preserve a speech signal from a noisy speech signal and can perform speech enhancement for removing a noise signal.
Deep multi-channel acoustic modeling using multiple microphone array geometries
Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-geometry/multi-channel DNN) that is trained using a plurality of microphone array geometries. Thus, the first model may receive a variable number of microphone channels, generate multiple outputs using multiple microphone array geometries, and select the best output as a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. The DNN front-end enables improved performance despite a reduction in microphones.
Deep multi-channel acoustic modeling using multiple microphone array geometries
Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-geometry/multi-channel DNN) that is trained using a plurality of microphone array geometries. Thus, the first model may receive a variable number of microphone channels, generate multiple outputs using multiple microphone array geometries, and select the best output as a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. The DNN front-end enables improved performance despite a reduction in microphones.
Synthetic speech processing
A speech-processing system receives input data representing text. An input encoder processes the input data to determine first embedding data representing the text. A local attention encoder processes a subset of the first embedding data in accordance with a predicted size to determine second embedding data. An attention encoder processes the second embedding data to determine third embedding data. A decoder processes the third embedding data to determine audio data corresponding to the text.
Airport noise classification method and system
An aircraft noise monitoring system uses a set of geographically distributed noise sensors to receive data corresponding to events captured by the noise sensors. Each event corresponds to noise that exceeds a threshold level. For each event, the system will receive a classification of the event as an aircraft noise event or a non-aircraft noise event. It will then use the data corresponding to the events and the received classifications to train a convolutional neural network (CNN) in a classification process. After training, when the system receives a new noise event, it will use the CNN to classify the new noise event as an aircraft noise event or a non-aircraft noise event, and it will generate an output indicating whether the new noise event is an aircraft noise event or a non-aircraft noise event.