G10L15/16

Deep multi-channel acoustic modeling using multiple microphone array geometries

Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-geometry/multi-channel DNN) that is trained using a plurality of microphone array geometries. Thus, the first model may receive a variable number of microphone channels, generate multiple outputs using multiple microphone array geometries, and select the best output as a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. The DNN front-end enables improved performance despite a reduction in microphones.

Attention-based joint acoustic and text on-device end-to-end model

A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.

Attention-based joint acoustic and text on-device end-to-end model

A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.

METHOD AND DEVICE FOR INFORMATION PROCESSING
20180005624 · 2018-01-04 ·

An information processing method and an electronic device are provided. The method includes: obtaining audio data collected by a slave device; obtaining contextual data corresponding to the slave device; and obtaining a recognition result of recognizing the audio data based on the contextual data. The contextual data characterizes a voice environment of the audio data collected by the slave device.

Method and System for Facilitating the Detection of Time Series Patterns
20180012120 · 2018-01-11 ·

According to a first aspect of the present disclosure, a method for facilitating the detection of one or more time series patterns is conceived, comprising building one or more artificial neural networks, wherein, for at least one time series pattern to be detected, a specific one of said artificial neural networks is built. According to a second aspect of the present disclosure, a corresponding computer program is provided. According to a third aspect of the present disclosure, a non-transitory computer-readable medium is provided that comprises a computer program of the kind set forth. According to a fourth aspect of the present disclosure, a corresponding system for facilitating the detection of one or more time series patterns is provided.

Information-processing device, vehicle, computer-readable storage medium, and information-processing method
11710499 · 2023-07-25 · ·

An information-processing device includes a first feature-value information acquiring unit for acquiring an acoustic feature-value vector and a language feature-value vector extracted from a user's spoken voice. The information-processing device includes a second feature-value information acquiring unit for acquiring an image feature-value vector extracted from the user's facial image. The information-processing device includes an emotion estimating unit including a learned model including: a first attention layer using, as inputs, a first vector generated from the acoustic feature-value vector and a second vector generated from the image feature-value vector; and a second attention layer using, as an input, an output vector from the first attention layer and a third vector generated from the language feature-value vector, wherein the emotion estimating unit is for estimating the user's emotion based on the output vector from the second attention layer.

Information-processing device, vehicle, computer-readable storage medium, and information-processing method
11710499 · 2023-07-25 · ·

An information-processing device includes a first feature-value information acquiring unit for acquiring an acoustic feature-value vector and a language feature-value vector extracted from a user's spoken voice. The information-processing device includes a second feature-value information acquiring unit for acquiring an image feature-value vector extracted from the user's facial image. The information-processing device includes an emotion estimating unit including a learned model including: a first attention layer using, as inputs, a first vector generated from the acoustic feature-value vector and a second vector generated from the image feature-value vector; and a second attention layer using, as an input, an output vector from the first attention layer and a third vector generated from the language feature-value vector, wherein the emotion estimating unit is for estimating the user's emotion based on the output vector from the second attention layer.

SYSTEMS AND METHODS FOR IMPROVED USER INTERFACE

Aspects of the present disclosure relate to systems and methods for a voice-centric virtual or soft keyboard (or keypad). Unlike other keyboards, embodiments of the present disclosure prioritize the voice keyboard, meanwhile providing users with a quick and uniform navigation to other keyboards (e.g., alphabet, punctuations, symbols, emoji's, etc.). In addition, in embodiments, common actions, such as delete and return are also easily accessible. In embodiments, the keyboard is also configurable to allow a user to organize buttons according to their desired use and layout. Embodiments of such a keyboard provide a voice-centric, seamless, and powerful interface experience for users.

Graphical user interface and parametric equalizer in gaming systems

A system that incorporates the subject disclosure may include, for example, a gaming system that cooperates with a graphical user interface to enable user modification and enhancement of one or more audio streams associated with the gaming system. In embodiments, the audio streams may include a game audio stream, a chat audio stream of conversation among players of a video game, and a microphone audio stream of a player of the video game. Additional embodiments are disclosed.

Graphical user interface and parametric equalizer in gaming systems

A system that incorporates the subject disclosure may include, for example, a gaming system that cooperates with a graphical user interface to enable user modification and enhancement of one or more audio streams associated with the gaming system. In embodiments, the audio streams may include a game audio stream, a chat audio stream of conversation among players of a video game, and a microphone audio stream of a player of the video game. Additional embodiments are disclosed.