Patent classifications
G10L2015/0635
Token-wise training for attention based end-to-end speech recognition
A method of attention-based end-to-end (A-E2E) automatic speech recognition (ASR) training, includes performing cross-entropy training of a model, based on one or more input features of a speech signal, determining a posterior probability vector at a time of a first wrong token among one or more output tokens of the model of which the cross-entropy training is performed, and determining a loss of the first wrong token at the time, based on the determined posterior probability vector. The method further includes determining a total loss of a training set of the model of which the cross-entropy training is performed, based on the determined loss of the first wrong token, and updating the model of which the cross-entropy training is performed, based on the determined total loss of the training set.
VOICE ASSISTANCE SYSTEM
A voice assistance system is described, comprising a microphone, a processor with memory instructions configured to receive an audio input of at least one user from the microphone to identify at least one object associated with a symbol of a database, determine the preferred language associated with the symbol, transmit a wireless signal to at least one smart device, this smart device being able to interact with an object associated with a database symbol, either through a signal managed by an infrared ray activation module, and by means of a signal managed by an activation module housed in electrical derivation boxes, a power transistor providing to drive a classic relay connected to the object associated with the database symbol, such system comprising at least one further microphone, at least one further processor with memory instructions configured to receive an input put audio, and an additional loudspeaker.
INTERFACING WITH APPLICATIONS VIA DYNAMICALLY UPDATING NATURAL LANGUAGE PROCESSING
Dynamic interfacing with applications is provided. For example, a system receives a first input audio signal. The system processes, via a natural language processing technique, the first input audio signal to identify an application. The system activates the application for execution on the client computing device. The application declares a function the application is configured to perform. The system modifies the natural language processing technique responsive to the function declared by the application. The system receives a second input audio signal. The system processes, via the modified natural language processing technique, the second input audio signal to detect one or more parameters. The system determines that the one or more parameters are compatible for input into an input field of the application. The system generates an action data structure for the application. The system inputs the action data structure into the application, which executes the action data structure.
FINE-TUNING MULTI-HEAD NETWORK FROM A SINGLE TRANSFORMER LAYER OF PRE-TRAINED LANGUAGE MODEL
Techniques are provided for customizing or fine-tuning a pre-trained version of a machine-learning model that includes multiple layers and is configured to process audio or textual language input. Each of the multiple layers is configured with a plurality of layer-specific pre-trained parameter values corresponding to a plurality of parameters, and each of the multiple layers is configured to implement multi-head attention. An incomplete subset of the multiple layers is identified for which corresponding layer-specific pre-trained parameter values are to be fine-tuned using a client data set. The machine-learning model is fine-tuned using the client data set to generate an updated version of the machine-learning model, where the layer-specific pre-trained parameter values configured for each layer of one of more of the multiple layers not included in the incomplete subset are frozen during the fine-tuning. Use of the updated version of the machine-learning model is facilitated.
Communication System And Related Methods
Communication system and related methods, in particular a method of operating a communication system is disclosed. The method comprises obtaining audio data representative of one or more voices, the audio data including first audio data of a first voice; obtaining first voice data based on the first audio data; wherein obtaining first voice data comprises applying a voice model on the first audio data; wherein the first voice data includes first speaker metric data; outputting a first voice representation indicative of the first voice data; obtaining first voice validation data, based on the first voice representation, from a first validator; obtaining second voice validation data, based on the first voice representation, from a second validator; determining an agreement metric based on the first voice validation data and the second voice validation data; determining a first validation score based on the agreement metric; and outputting the first validation score.
METHOD AND SYSTEM FOR UNSUPERVISED DISCOVERY OF UNIGRAMS IN SPEECH RECOGNITION SYSTEMS
A system and method of automatically discovering unigrams in a speech data element may include receiving a language model that includes a plurality of n-grams, where each n-gram includes one or more unigrams; applying an acoustic machine-learning (ML) model on one or more speech data elements to obtain a character distribution function; applying a greedy decoder on the character distribution function, to predict an initial corpus of unigrams; filtering out one or more unigrams of the initial corpus to obtain a corpus of candidate unigrams, where the candidate unigrams are not included in the language model; analyzing the one or more first speech data elements, to extract at least one n-gram that comprises a candidate unigram; and updating the language model to include the extracted at least one n-gram.
Recognizing accented speech
Techniques and apparatuses for recognizing accented speech are described. In some embodiments, an accent module recognizes accented speech using an accent library based on device data, uses different speech recognition correction levels based on an application field into which recognized words are set to be provided, or updates an accent library based on corrections made to incorrectly recognized speech.
METHOD AND DEVICE FOR PERFORMING VOICE RECOGNITION USING GRAMMAR MODEL
A method of updating speech recognition data including a language model used for speech recognition, the method including obtaining language data including at least one word; detecting a word that does not exist in the language model from among the at least one word; obtaining at least one phoneme sequence regarding the detected word; obtaining components constituting the at least one phoneme sequence by dividing the at least one phoneme sequence into predetermined unit components; determining information regarding probabilities that the respective components constituting each of the at least one phoneme sequence appear during speech recognition; and updating the language model based on the determined probability information.
Acoustic model training using corrected terms
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for speech recognition. One of the methods includes receiving first audio data corresponding to an utterance; obtaining a first transcription of the first audio data; receiving data indicating (i) a selection of one or more terms of the first transcription and (ii) one or more of replacement terms; determining that one or more of the replacement terms are classified as a correction of one or more of the selected terms; in response to determining that the one or more of the replacement terms are classified as a correction of the one or more of the selected terms, obtaining a first portion of the first audio data that corresponds to one or more terms of the first transcription; and using the first portion of the first audio data that is associated with the one or more terms of the first transcription to train an acoustic model for recognizing the one or more of the replacement terms.
Systems and methods for crowdsourced actions and commands
Embodiments facilitate the intuitive creation, maintenance, and distribution of action datasets that include computing events or tasks that can be reproduced when a command is received by a digital assistant. The digital assistant can generate new action datasets, on-board new action datasets, and receive new action datasets or updates to existing action datasets locally or via a digital assistant server, among other things. The digital assistant server can also receive action datasets, maintain action datasets, and distribute action datasets to one or more digital assistants, among other things.