G10L2015/025

ELECTRONIC DEVICE AND METHOD FOR CONTROLLING THEREOF
20220343939 · 2022-10-27 ·

An electronic device and a method for controlling thereof is provided. The electronic device includes a memory storing a neural network model and a processor configured to input, to the neural network model, input data to obtain output data, and, based on comparison between first output data based on input first modality and second output data based on input second modality, in response to the second modality being input, the neural network model is trained to output the first modality corresponding to the first output data, and the second modality may include at least one masking element.

SYSTEM AND METHOD FOR ANALYSING AN AUDIO TO MEASURE ORAL READING FLUENCY
20220343790 · 2022-10-27 ·

A system (1) for analyzing an audio to measure oral reading fluency or progress in oral reading fluency (2) in a text illustrated through the audio. The system (1) includes an input unit (3) which receives a target audio (4) from a user. The target audio (4) relates to an oral reading of the text by the user. The system (1) further includes a transcribing unit (5) which receives and processes the target audio (4) and generates a target transcription (6) of the target audio (4). The system (1) also includes a processing unit (7) which receives and processes at least one of the target transcription (6), the text (8), the target audio (4), or a reference audio (9), or combination thereof, and generates a primary metrics (10) having various parameters measuring reading fluencies. The system supports user specific dictionary customization to incorporate non-dictionary words in the analysis.

TECHNIQUES FOR LANGUAGE INDEPENDENT WAKE-UP WORD DETECTION
20230082944 · 2023-03-16 · ·

A method for a user device, including receiving a first acoustic input of a user speaking a wake-up word in the target language; providing a first acoustic feature derived from the first acoustic input to an acoustic model stored on the user device to obtain a first sequence of speech units corresponding to the wake-up word spoken by the user in the target language, the acoustic model trained on a corpus of training data in a source language different than the target language; receiving a second acoustic input including the wake-up word in the target language; providing a second acoustic feature derived from the second acoustic input to the acoustic model to obtain a second sequence of speech units corresponding to the wake-up word in the target language; and comparing the first and second sequences of speech units to recognize the wake-up word in the target language.

SYSTEMS, METHODS, DEVICES AND APPARATUSES FOR DETECTING FACIAL EXPRESSION

A system, method and apparatus for detecting facial expressions according to EMG signals.

Automatic speech recognition correction

Systems, methods, and computer-readable media for correcting transcriptions created through automatic speech recognition. A transcription of speech created using an automatic speech recognition system can be received. One or more domain-specific contexts associated with the speech can be identified and a text span that includes a mistranscribed entry can be recognized from the speech based on the one or more domain-specific contexts. Additionally, features can be extracted from the mistranscribed entry and the extracted features can be matched against an index of domain-specific entries to identify a correct entry of the mistranscribed entry. Subsequently, the transcription can be corrected by replacing with the mistranscribed entry with the correct entry.

SYSTEMS AND METHODS FOR CORRECTING AUTOMATIC SPEECH RECOGNITION ERRORS

A system may include processor(s), and memory in communication with the processor(s) and storing instructions configured to cause the system to correct ASR errors. The system may receive a transcription comprising transcribed word(s) and may determine whether the transcribed word(s) exceed associated predefined confidence level(s). Responsive to determining a transcribed word does not exceed a predefined confidence level, the system may generate a predicted word. The system may calculate a distance between numerical representations of the transcribed word and the predicted word and may determine whether the distance exceeds a predefined threshold. Responsive to determining the distance exceeds the predefined threshold, the system may determine whether at least one red flag word of a list of red flag words corresponds to a context of the transcription, and, responsive to making that determination, may classify the transcription as associated with a first category.

Artificial intelligence-based animation character drive method and related apparatus

This application disclose an artificial intelligence (AI) based animation character drive method. A first expression base of a first animation character corresponding to a speaker is determined by acquiring media data including a facial expression change when the speaker says a speech, and the first expression base may reflect different expressions of the first animation character. After target text information is obtained, an acoustic feature and a target expression parameter corresponding to the target text information are determined according to the target text information, the foregoing acquired media data, and the first expression base. A second animation character having a second expression base may be driven according to the acoustic feature and the target expression parameter, so that the second animation character may simulate the speaker's sound and facial expression when saying the target text information, thereby improving experience of interaction between the user and the animation character.

Method and system for parametric speech synthesis
11605371 · 2023-03-14 · ·

Embodiments of the present systems and methods may provide techniques for synthesizing speech in any voice in any language in any accent. For example, in an embodiment, a text-to-speech conversion system may comprise a text converter adapted to convert input text to at least one phoneme selected from a plurality of phonemes stored in memory, a machine-learning model storing voice patterns for a plurality of individuals and adapted to receive the at least one phoneme and an identity of a speaker and to generate acoustic features for each phoneme, and a decoder adapted to receive the generated acoustic features and to generate a speech signal simulating a voice of the identified speaker in a language.

Dialog device, dialog method, and dialog computer program

The dialog device according to the present invention includes a prediction unit 254 configured to predict an utterance length attribute of a user utterance in response to a the machine utterance, a selection unit 256 configured to use the utterance length attribute to select, as a feature model for usage in an end determination of the user utterance, at least one of an acoustic feature model or a lexical feature model, and an estimation unit 258 configured to estimate an end point in the user utterance using the selected model. By using this dialog device, it is possible to shorten the waiting time until a response is output to a user utterance by a machine, and to realize a more natural conversation between a user and a machine.

SPEECH RECOGNITION MODEL STRUCTURE INCLUDING CONTEXT-DEPENDENT OPERATIONS INDEPENDENT OF FUTURE DATA

A speech recognition method includes obtaining a speech recognition model including a plurality of feature aggregation nodes connected via a first type operation element, where a context-dependent operation of the first type operation element is based on past speech data and is independent of future speech data. The method further includes receiving streaming speech data, the speech data comprising audio data including speech, and processing the streaming speech data via the speech recognition model to obtain a speech recognition text corresponding to the streaming speech data, and outputting the speech recognition text.