G10L15/144

TWO STAGE USER CUSTOMIZABLE WAKE WORD DETECTION

Described herein are devices, methods, and systems for detecting a phrase from uttered speech. A processing device may determine a first model for phrase recognition based on a likelihood ratio using a set of training utterances. The set of utterances may be analyzed by the first model to determine a second model, the second model comprising a training state sequence for each of the set of training utterances, and wherein each training state sequence indicates a likely state for each time interval of a corresponding training utterance. A determination of whether a detected utterance corresponds to the phrase may be based on a concatenation of the first model and the second model.

Two stage user customizable wake word detection

Described herein are devices, methods, and systems for detecting a phrase from uttered speech. A processing device may determine a first model for phrase recognition based on a likelihood ratio using a set of training utterances. The set of utterances may be analyzed by the first model to determine a second model, the second model comprising a training state sequence for each of the set of training utterances, and wherein each training state sequence indicates a likely state for each time interval of a corresponding training utterance. A determination of whether a detected utterance corresponds to the phrase may be based on a concatenation of the first model and the second model.

Unsupervised learning system and method for performing weighting for improvement in speech recognition performance and recording medium for performing the method
11164565 · 2021-11-02 · ·

A learning system and method for updating recognition performance by assigning weights according to a confidence level of data are discussed. The unsupervised learning system includes a memory configured to store speech data received from a server that performs speech recognition; and a processor configured to measure confidence levels of pieces of learnable data stored in the memory and classify the pieces of learnable data into learning data and adaptation data, generate a learning model by performing unsupervised learning on the learning data, generate an adaption model using the adaptation data, and evaluate speech recognition performance for the learning model and the adaptation model, wherein the processor is configured to assign weights by applying the measured confidence levels to the learning model and the adaptation model and update recognition performance with the learning model and the adaptation model to which the weights are applied.

Machine learning used to detect alignment and misalignment in conversation
11817086 · 2023-11-14 · ·

Digitized media is received that records a conversation between individuals. Cues are extracted from the digitized media that indicate properties of the conversation. The cues are entered as training data into a machine learning module to create a trained machine learning model. The trained machine learning model is used in a processor to detect other misalignments in subsequent digitized conversations.

DISPLAY-BASED CONTEXTUAL NATURAL LANGUAGE PROCESSING
20220246139 · 2022-08-04 ·

Multi-modal natural language processing systems are provided. Some systems are context-aware systems that use multi-modal data to improve the accuracy of natural language understanding as it is applied to spoken language input. Machine learning architectures are provided that jointly model spoken language input (“utterances”) and information displayed on a visual display (“on-screen information”). Such machine learning architectures can improve upon, and solve problems inherent in, existing spoken language understanding systems that operate in multi-modal contexts.

Interaction data and processing natural language inputs

Techniques for determining and using interaction affinity data are described. Interaction affinity data may indicate a latent affinity between information corresponding to an interaction, such as, intents, entities, device type from which a user input is received, domain, etc. A system may use the interaction affinity data to determine an alternative input representation for a spoken input to cause output of a desired response to the spoken input. The system may also use the interaction affinity data to recommend an action to a user.

MASKING SYSTEMS AND METHODS

Term masking is performed by generating a time-alignment value for a plurality of identifiable units of sound in vocal audio content contained in a mixed audio track, force-aligning each of the plurality of identifiable units of sound to the vocal audio content based on the time-alignment value, thereby generating a plurality of force-aligned identifiable units of sound, identifying from the plurality of force-aligned identifiable units of sound a force-aligned identifiable unit of sound to be muddled, and audio muddling the force-aligned identifiable unit of sound to be muddled.

ENHANCING SIGNATURE WORD DETECTION IN VOICE ASSISTANTS
20210327420 · 2021-10-21 ·

Systems and methods detecting a spoken sentence in a speech recognition system are disclosed herein. Speech data is buffered based on an audio signal captured at a computing device operating in an active mode. The speech data is buffered irrespective of whether the speech data comprises a signature word. The buffered speech data is processed to detect a presence of the sentence comprising at least one command and a query for the computing device. Processing the buffered speech data includes detecting the signature word in the buffered speech data, and in response to detecting the signature word in the speech data, initiating detection of the sentence in the buffered speech data.

Multi-language mixed speech recognition method

The invention discloses a multi-language mixed speech recognition method, which belongs to the technical field of speech recognition; the method comprises: step S1, configuring a multi-language mixed dictionary including a plurality of different languages; step S2, performing training according to the multi-language mixed dictionary and multi-language speech data including a plurality of different languages to form an acoustic recognition model; step S3, performing training according to multi-language text corpus including a plurality of different languages to form a language recognition model; step S4, forming the speech recognition system by using the multi-language mixed dictionary, the acoustic recognition model and the language recognition model; and subsequently, recognizing mixed speech by using the speech recognition system, and outputting a corresponding recognition result. The above technical solution has the beneficial effects of being able to support the recognition of mixed speech in multiple languages, improving the accuracy and efficiency of recognition, and thus improving the performance of the speech recognition system.

METHOD AND SYSTEM FOR PROVIDING ADJUNCT SENSORY INFORMATION TO A USER

A method for providing information to a user, the method including: receiving an input signal from a sensing device associated with a sensory modality of the user; generating a preprocessed signal upon preprocessing the input signal with a set of preprocessing operations; extracting a set of features from the preprocessed signal; processing the set of features with a neural network system; mapping outputs of the neural network system to a device domain associated with a device including a distribution of haptic actuators in proximity to the user; and at the distribution of haptic actuators, cooperatively producing a haptic output representative of at least a portion of the input signal, thereby providing information to the user.