Patent classifications
G10L15/183
Electronic device and method for controlling the electronic device
Disclosed are an electronic device capable of efficiently performing speech recognition and natural language understanding and a method for controlling thereof. The electronic device includes: a microphone; a non-volatile memory configured to store virtual assistant model data comprising data that is classified according to a plurality of domains and data that is commonly used for the plurality of domains; a volatile memory; and a processor configured to: based on receiving, through the microphone, a trigger input to perform speech recognition for a user speech, initiate loading the virtual assistant model data from the non-volatile memory into the volatile memory, load, into the volatile memory, first data from among the data classified according to the plurality of domains and, while loading the first data into the volatile memory, load at least a part of the data commonly used for the plurality of domains into the volatile memory.
System and method using parameterized speech synthesis to train acoustic models
A method for removing private data from an acoustic model includes capturing speech from a large population of users, creating a text-to-speech voice from at least a portion of the large population of users, discarding speech data from a database of speech, creating text-to-speech waveforms from the text-to-speech voice and the new database of speech with the discarded speech data and generating an automatic speech recognition model using the text-to-speech waveforms.
System and method using parameterized speech synthesis to train acoustic models
A method for removing private data from an acoustic model includes capturing speech from a large population of users, creating a text-to-speech voice from at least a portion of the large population of users, discarding speech data from a database of speech, creating text-to-speech waveforms from the text-to-speech voice and the new database of speech with the discarded speech data and generating an automatic speech recognition model using the text-to-speech waveforms.
INTERACTIVE CONTENT OUTPUT
Techniques for outputting interactive content and processing interactions with respect to the interactive content are described. While outputting requested content, a system may determine that interactive content is to be outputted. The system may determine output data including a first portion indicating that interactive content is going to be output and a second portion representing content corresponding to an item. The system may send the output data to the device. A user may interact with the output data, for example, by requesting performance of an action with respect to the item.
INTERACTIVE CONTENT OUTPUT
Techniques for outputting interactive content and processing interactions with respect to the interactive content are described. While outputting requested content, a system may determine that interactive content is to be outputted. The system may determine output data including a first portion indicating that interactive content is going to be output and a second portion representing content corresponding to an item. The system may send the output data to the device. A user may interact with the output data, for example, by requesting performance of an action with respect to the item.
NEURAL NETWORK MEMORY FOR AUDIO
Techniques for utilizing memory for a neural network are described. For example, some techniques utilize a plurality of memory types to respond to a query from a neural network including a short-term memory to store fine-grained information for recent text of a document and receiving a first value in response, an episodic long-term memory to store information discarded from the short-term memory in a compressed form and receiving a second value in response, and a semantic long-term memory to store relevant facts per entity in the document.
NEURAL NETWORK MEMORY FOR AUDIO
Techniques for utilizing memory for a neural network are described. For example, some techniques utilize a plurality of memory types to respond to a query from a neural network including a short-term memory to store fine-grained information for recent text of a document and receiving a first value in response, an episodic long-term memory to store information discarded from the short-term memory in a compressed form and receiving a second value in response, and a semantic long-term memory to store relevant facts per entity in the document.
CROSS-LINGUAL KNOWLEDGE TRANSFER LEARNING
Methods and systems for training a neural network include training language-specific teacher models using different respective source language datasets. A student model is trained, using the different respective source language datasets and soft labels generated by the language-specific teacher models, including shuffling the source language datasets and shuffling weights of language-dependent layers in language-specific parts of the student model. Weights of language-independent layers of the student model are copied to a language-independent layers of a target model to initialize language-independent layers of the target model. The target model is trained with a target language dataset.
MULTI-TIER SPEECH PROCESSING AND CONTENT OPERATIONS
A multi-tier architecture is provided for processing user voice queries and making routing decisions for generating responses, including responses to book browsing requests and other content requests. When an utterance is associated with multiple applications in a given domain, the applications may be organized into a subdomain and a tier of routing decisions may be added to the inter-domain and intra-domain routing decision system. The system uses contextual signals to make subdomain routing decisions, including signals regarding content items that are already in a user's content catalog, consumption status of individual content items in the user's catalog, and the like
MULTI-TIER SPEECH PROCESSING AND CONTENT OPERATIONS
A multi-tier architecture is provided for processing user voice queries and making routing decisions for generating responses, including responses to book browsing requests and other content requests. When an utterance is associated with multiple applications in a given domain, the applications may be organized into a subdomain and a tier of routing decisions may be added to the inter-domain and intra-domain routing decision system. The system uses contextual signals to make subdomain routing decisions, including signals regarding content items that are already in a user's content catalog, consumption status of individual content items in the user's catalog, and the like