G10L15/142

Robust audio identification with interference cancellation

Audio distortion compensation methods to improve accuracy and efficiency of audio content identification are described. The method is also applicable to speech recognition. Methods to detect the interference from speakers and sources, and distortion to audio from environment and devices, are discussed. Additional methods to detect distortion to the content after performing search and correlation are illustrated. The causes of actual distortion at each client are measured and registered and learnt to generate rules for determining likely distortion and interference sources. The learnt rules are applied at the client, and likely distortions that are detected are compensated or heavily distorted sections are ignored at audio level or signature and feature level based on compute resources available. Further methods to subtract the likely distortions in the query at both audio level and after processing at signature and feature level are described.

COMPUTER SYSTEMS EXHIBITING IMPROVED COMPUTER SPEED AND TRANSCRIPTION ACCURACY OF AUTOMATIC SPEECH TRANSCRIPTION (AST) BASED ON A MULTIPLE SPEECH-TO-TEXT ENGINES AND METHODS OF USE THEREOF

In some embodiments, an exemplary inventive system for improving computer speed and accuracy of automatic speech transcription includes at least components of: a computer processor configured to perform: generating a recognition model specification for a plurality of distinct speech-to-text transcription engines; where each distinct speech-to-text transcription engine corresponds to a respective distinct speech recognition model; receiving at least one audio recording representing a speech of a person; segmenting the audio recording into a plurality of audio segments; determining a respective distinct speech-to-text transcription engine to transcribe a respective audio segment; receiving, from the respective transcription engine, a hypothesis for the respective audio segment; accepting the hypothesis to remove a need to submit the respective audio segment to another distinct speech-to-text transcription engine, resulting in the improved computer speed and the accuracy of automatic speech transcription and generating a transcript of the audio recording from respective accepted hypotheses for the plurality of audio segments.

ALTERNATE NATURAL LANGUAGE INPUT GENERATION
20230110205 · 2023-04-13 ·

Techniques for handling errors during processing of natural language inputs are described. A system may process a natural language input to generate an ASR hypothesis or NLU hypothesis. The system may use more than one data searching technique (e.g., deep neural network searching, convolutional neural network searching, etc.) to generate an alternate ASR hypothesis or NLU hypothesis, depending on the type of hypothesis input for alternate hypothesis processing.

Methods for natural language model training in natural language understanding (NLU) systems

Systems and methods for determining to perform an action of a query using a trained natural language model of a natural language understanding (NLU) system are disclosed herein. A text string corresponding to a prescribed action includes at least a content entity is received. A determination is made as to whether the text string corresponds to an audio input of a first group. In response to determining the text string corresponds to an audio input of a first group, a determination is made as to whether the text string includes an obsequious expression. In response to determining the text string corresponds to an audio input of a first group and in response to determining the text string includes an obsequious expression, a determination is made to perform the prescribed action. In response to determining the text string corresponds to an audio input of a first group and in response to determining the text string does not include the obsequious expression, a determination is made to not perform the prescribed action.

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, MOBILE OBJECT CONTROL DEVICE, AND MOBILE OBJECT CONTROL METHOD
20220319514 · 2022-10-06 ·

An information processing apparatus capable of controlling a mobile object on the basis of an instruction by an utterance of a user identifies which scene a use scene of a target user is among a plurality of use scenes in a case where the mobile object is used, acquires utterance information of the target user, and selects a different machine learning model according to the identified use scene of the target user. The information processing apparatus estimates an intent of an utterance of the target user by using the selected machine learning model.

EFFICIENT EMPIRICAL DETERMINATION, COMPUTATION, AND USE OF ACOUSTIC CONFUSABILITY MEASURES
20230206914 · 2023-06-29 ·

A computer-implemented method includes generating an empirically derived acoustic confusability measure by processing example utterances and iterating from an initial estimate of the acoustic confusability measure to improve the measure. The method can further include using the acoustic confusability measure to selectively limit phrases to make recognizable by a speech recognition application.

Mandarin and dialect mixed modeling and speech recognition

The present disclosure provides a modeling method for speech recognition and a device. The method includes: determining N types of tags; training a neural network according to speech data of Mandarin to generate a recognition model whose outputs are the N types of tags; inputting speech data of each dialect into the recognition model to obtain an output tag of each frame of the speech data of each dialect; determining, according to the output tags and tagged true tags, error rates of the N types of tags for the each dialect, generating M types of target tags according to tags with error rates greater than a preset threshold; and training an acoustic model according to third speech data of Mandarin and third speech data of the P dialects, outputs of the acoustic model being the N types of tags and the M types of target tags corresponding to each dialect.

SPEECH PROCESSING SYSTEM AND SPEECH PROCESSING METHOD

A speech intelligibility enhancing system for enhancing speech, the system comprising: a speech input for receiving speech to be enhanced; an enhanced speech output to output the enhanced speech; and a processor configured to convert speech received by the speech input to enhanced speech to be output by the enhanced speech output, the processor being configured to: extract a portion of the speech received by the speech input; calculate the power of the portion; estimate a contribution due to late reverberation to the power of the portion of the speech when reverbed; calculate a target late reverberation power; determine a time t.sub.i for the estimated contribution due to late reverberation to decay to the target late reverberation power; calculate a pause duration, wherein the pause duration is calculated using the time t.sub.i; insert a pause having the calculated duration into the speech received by the speech input at a first location, wherein the first location is followed by the portion.

Dynamic speech recognition methods and systems with user-configurable performance

Methods and systems are provided for assisting operation of a vehicle using speech recognition. One method involves identifying a user-configured speech recognition performance setting value selected from among a plurality of speech recognition performance setting values, selecting a speech recognition model configuration corresponding to the user-configured speech recognition performance setting value from among a plurality of speech recognition model configurations, where each speech recognition model configuration of the plurality of speech recognition model configurations corresponds to a respective one of the plurality of speech recognition performance setting values, and recognizing an audio input as an input state using the speech recognition model configuration corresponding to the user-configured speech recognition performance setting value.

ROBUST AUDIO IDENTIFICATION WITH INTERFERENCE CANCELLATION

Audio distortion compensation methods to improve accuracy and efficiency of audio content identification are described. The method is also applicable to speech recognition. Methods to detect the interference from speakers and sources, and distortion to audio from environment and devices, are discussed. Additional methods to detect distortion to the content after performing search and correlation are illustrated. The causes of actual distortion at each client are measured and registered and learnt to generate rules for determining likely distortion and interference sources. The learnt rules are applied at the client, and likely distortions that are detected are compensated or heavily distorted sections are ignored at audio level or signature and feature level based on compute resources available. Further methods to subtract the likely distortions in the query at both audio level and after processing at signature and feature level are described.