G10L15/16

Systems and methods for generating labeled data to facilitate configuration of network microphone devices

Systems and methods for generating training data are described herein. Pieces of metadata captured by a plurality of networked sensor systems can be captured, where each piece of metadata is associated with a specific set of sensor data captured by one of the plurality of networked sensor systems and includes a set of characteristics for the specific set of captured sensor data. A probabilistic model can be generated based on the received metadata and simulations can be performed based upon a training corpus by generating multiple scenarios, and, for each scenario, a scenario specific version of a particular annotated sample is generated by performing a simulation using the particular annotated sample. The scenario specific versions of annotated samples from the training corpus can be stored as a training data set on the at least one network device.

Systems and methods for generating labeled data to facilitate configuration of network microphone devices

Systems and methods for generating training data are described herein. Pieces of metadata captured by a plurality of networked sensor systems can be captured, where each piece of metadata is associated with a specific set of sensor data captured by one of the plurality of networked sensor systems and includes a set of characteristics for the specific set of captured sensor data. A probabilistic model can be generated based on the received metadata and simulations can be performed based upon a training corpus by generating multiple scenarios, and, for each scenario, a scenario specific version of a particular annotated sample is generated by performing a simulation using the particular annotated sample. The scenario specific versions of annotated samples from the training corpus can be stored as a training data set on the at least one network device.

Wearable speech input-based vision to audio interpreter
11551688 · 2023-01-10 · ·

An eyewear device with camera-based compensation that improves the user experience for user's having partial blindness or complete blindness. The camera-based compensation determines features, such as objects, and then converts the determined objects to audio that is indicative of the objects and that is perceptible to the eyewear user. The camera-based compensation may use a region-based convolutional neural network (RCNN) to generate a feature map including text that is indicative of objects in images captured by a camera. The feature map is then processed through a speech to audio algorithm featuring a natural language processor to generate audio indicative of the objects in the processed images.

Wearable speech input-based vision to audio interpreter
11551688 · 2023-01-10 · ·

An eyewear device with camera-based compensation that improves the user experience for user's having partial blindness or complete blindness. The camera-based compensation determines features, such as objects, and then converts the determined objects to audio that is indicative of the objects and that is perceptible to the eyewear user. The camera-based compensation may use a region-based convolutional neural network (RCNN) to generate a feature map including text that is indicative of objects in images captured by a camera. The feature map is then processed through a speech to audio algorithm featuring a natural language processor to generate audio indicative of the objects in the processed images.

Electronic device and method for controlling the electronic device thereof

An electronic device is provided. The electronic device includes a memory configured to store a speech translation model and at least one processor electronically connected with the memory. The at least one processor is configured to train the speech translation model based on first information related to conversion between a speech in a first language and a text corresponding to the speech in the first language, and second information related to conversion between a text in the first language and a text in a second language corresponding to the text in the first language, and the speech translation model is trained to convert a speech in the first language into a text in the second language and output the text.

Electronic device and method for controlling the electronic device thereof

An electronic device is provided. The electronic device includes a memory configured to store a speech translation model and at least one processor electronically connected with the memory. The at least one processor is configured to train the speech translation model based on first information related to conversion between a speech in a first language and a text corresponding to the speech in the first language, and second information related to conversion between a text in the first language and a text in a second language corresponding to the text in the first language, and the speech translation model is trained to convert a speech in the first language into a text in the second language and output the text.

METHOD AND APPARATUS FOR TARGET EXAGGERATION FOR DEEP LEARNING-BASED SPEECH ENHANCEMENT

The present disclosure relates to a speech enhancement apparatus, and specifically, to a method and apparatus for a target exaggeration for deep learning-based speech enhancement. According to an embodiment of the present disclosure, the apparatus for a target exaggeration for deep learning-based speech enhancement can preserve a speech signal from a noisy speech signal and can perform speech enhancement for removing a noise signal.

METHOD AND APPARATUS FOR TARGET EXAGGERATION FOR DEEP LEARNING-BASED SPEECH ENHANCEMENT

The present disclosure relates to a speech enhancement apparatus, and specifically, to a method and apparatus for a target exaggeration for deep learning-based speech enhancement. According to an embodiment of the present disclosure, the apparatus for a target exaggeration for deep learning-based speech enhancement can preserve a speech signal from a noisy speech signal and can perform speech enhancement for removing a noise signal.

Method and apparatus for determining output token
11574190 · 2023-02-07 · ·

A method for determining an output token includes predicting a first probability of each of candidate output tokens of a first model, predicting a second probability of each of the candidate output tokens of a second model interworking with the first model, adjusting the second probability of each of the candidate output tokens based on the first probability, and determining the output token among the candidate output tokens based on the first probability and the adjusted second probability.

Deep multi-channel acoustic modeling using multiple microphone array geometries

Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-geometry/multi-channel DNN) that is trained using a plurality of microphone array geometries. Thus, the first model may receive a variable number of microphone channels, generate multiple outputs using multiple microphone array geometries, and select the best output as a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. The DNN front-end enables improved performance despite a reduction in microphones.