G10L15/04

ELECTRONIC DEVICE AND METHOD FOR PROCESSING USER INPUT
20220413988 · 2022-12-29 ·

Disclosed herein are an electronic device and method thereof. The electronic device includes an input device; an output device; a processor operatively connected to the input device and the output device; and a memory operatively connected to the processor, wherein the memory stores instructions. The instructions are executable by the processor to implement the method, including identifying a function for executing the user input, based at least on input data corresponding to the user input; classifying the identified function based on executability of the function; determining a completion possibility of the classified function; and providing response data via the output device based on a result of the classifying and the completion probability.

Processing speech signals in voice-based profiling
11538472 · 2022-12-27 · ·

This document describes a data processing system for processing a speech signal for voice-based profiling. The data processing system segments the speech signal into a plurality of segments, with each segment representing a portion of the speech signal. For each segment, the data processing system generates a feature vector comprising data indicative of one or more features of the portion of the speech signal represented by that segment and determines whether the feature vector comprises data indicative of one or more features with a threshold amount of confidence. For each of a subset of the generated feature vectors, the system processes data in that feature vector to generate a prediction of a value of a profile parameter and transmits an output responsive to machine executable code that generates a visual representation of the prediction of the value of the profile parameter.

Processing speech signals in voice-based profiling
11538472 · 2022-12-27 · ·

This document describes a data processing system for processing a speech signal for voice-based profiling. The data processing system segments the speech signal into a plurality of segments, with each segment representing a portion of the speech signal. For each segment, the data processing system generates a feature vector comprising data indicative of one or more features of the portion of the speech signal represented by that segment and determines whether the feature vector comprises data indicative of one or more features with a threshold amount of confidence. For each of a subset of the generated feature vectors, the system processes data in that feature vector to generate a prediction of a value of a profile parameter and transmits an output responsive to machine executable code that generates a visual representation of the prediction of the value of the profile parameter.

AUDIO INFORMATION PROCESSING METHOD, APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM
20220406311 · 2022-12-22 ·

The present disclosure relates to an audio information processing method, an apparatus, an electronic device and a computer-readable storage medium. The audio information processing method includes: determining whether an audio recording start condition is satisfied; collecting audio information associated with an electronic device in response to determining that the audio recording start condition is satisfied; performing word segmentation on text information corresponding to the audio information to obtain word-segmented text information; and displaying the word-segmented text information on a user interface of the electronic device.

DETECTION APPARATUS, METHOD AND PROGRAM FOR THE SAME

A detection device includes a labeling acoustic feature calculation unit configured to calculate a labeling acoustic feature from voice data, a time information acquisition unit configured to acquire a label with time information corresponding to the voice data from a label with no time information corresponding to the voice data and the labeling acoustic feature through a use of a labeling acoustic model configured to receive, as inputs, a label with no time information and a labeling acoustic feature and output a label with time information, an acoustic feature prediction unit configured to predict an acoustic feature corresponding to the label with time information and acquire a predicted value through a use of an acoustic model configured to receive, as an input, a label with time information and output an acoustic feature, an acoustic feature calculation unit configured to calculate an acoustic feature from the voice data, a difference calculation unit configured to determine an acoustic difference between the acoustic feature and the predicted value, and a detection unit configured to detect a labeling error on a basis of a relationship regarding which of the difference and a predetermined threshold value is larger or smaller than the other.

DETECTION APPARATUS, METHOD AND PROGRAM FOR THE SAME

A detection device includes a labeling acoustic feature calculation unit configured to calculate a labeling acoustic feature from voice data, a time information acquisition unit configured to acquire a label with time information corresponding to the voice data from a label with no time information corresponding to the voice data and the labeling acoustic feature through a use of a labeling acoustic model configured to receive, as inputs, a label with no time information and a labeling acoustic feature and output a label with time information, an acoustic feature prediction unit configured to predict an acoustic feature corresponding to the label with time information and acquire a predicted value through a use of an acoustic model configured to receive, as an input, a label with time information and output an acoustic feature, an acoustic feature calculation unit configured to calculate an acoustic feature from the voice data, a difference calculation unit configured to determine an acoustic difference between the acoustic feature and the predicted value, and a detection unit configured to detect a labeling error on a basis of a relationship regarding which of the difference and a predetermined threshold value is larger or smaller than the other.

MULTI-ENCODER END-TO-END AUTOMATIC SPEECH RECOGNITION (ASR) FOR JOINT MODELING OF MULTIPLE INPUT DEVICES

An end-to-end automatic speech recognition (ASR) system includes: a first encoder configured for close-talk input captured by a close-talk input mechanism; a second encoder configured for far-talk input captured by a far-talk input mechanism; and an encoder selection layer configured to select at least one of the first and second encoders for use in producing ASR output. The selection is made based on at least one of short-time Fourier transform (STFT), Mel-frequency Cepstral Coefficient (MFCC) and filter bank derived from at least one of the close-talk input and the far-talk input. If signals from both the close-talk input mechanism and the far-talk input mechanism are present for a speech segment, the encoder selection layer dynamically selects between the close-talk encoder and the far-talk encoder to select the encoder that better recognizes the speech segment. An encoder-decoder model is used to produce the ASR output.

MULTI-ENCODER END-TO-END AUTOMATIC SPEECH RECOGNITION (ASR) FOR JOINT MODELING OF MULTIPLE INPUT DEVICES

An end-to-end automatic speech recognition (ASR) system includes: a first encoder configured for close-talk input captured by a close-talk input mechanism; a second encoder configured for far-talk input captured by a far-talk input mechanism; and an encoder selection layer configured to select at least one of the first and second encoders for use in producing ASR output. The selection is made based on at least one of short-time Fourier transform (STFT), Mel-frequency Cepstral Coefficient (MFCC) and filter bank derived from at least one of the close-talk input and the far-talk input. If signals from both the close-talk input mechanism and the far-talk input mechanism are present for a speech segment, the encoder selection layer dynamically selects between the close-talk encoder and the far-talk encoder to select the encoder that better recognizes the speech segment. An encoder-decoder model is used to produce the ASR output.

Transportation vehicle control with phoneme generation
11530930 · 2022-12-20 · ·

A transportation vehicle having a navigation system and an operating system connected to the navigation system for data transmission via a bus system. The transportation vehicle has a microphone and includes a phoneme generation module for generating phonemes from an acoustic voice signal or the output signal of the microphone; the phonemes are part of a predefined selection of exclusively monosyllabic phonemes; and a phoneme-to-grapheme module for generating inputs to operate the transportation vehicle based on monosyllabic phonemes generated by the phoneme generation module.

Detecting a trigger of a digital assistant

Systems and processes for operating an intelligent automated assistant are provided. In accordance with one example, a method includes, at an electronic device with one or more processors, memory, and a plurality of microphones, sampling, at each of the plurality of microphones of the electronic device, an audio signal to obtain a plurality of audio signals; processing the plurality of audio signals to obtain a plurality of audio streams; and determining, based on the plurality of audio streams, whether any of the plurality of audio signals corresponds to a spoken trigger. The method further includes, in accordance with a determination that the plurality of audio signals corresponds to the spoken trigger, initiating a session of the digital assistant; and in accordance with a determination that the plurality of audio signals does not correspond to the spoken trigger, foregoing initiating a session of the digital assistant.