Patent classifications
G10L15/148
Systems and Methods for Generating Labeled Data to Facilitate Configuration of Network Microphone Devices
Systems and methods for generating training data are described herein. Pieces of metadata captured by a plurality of networked sensor systems can be captured, where each piece of metadata is associated with a specific set of sensor data captured by one of the plurality of networked sensor systems and includes a set of characteristics for the specific set of captured sensor data. A probabilistic model can be generated based on the received metadata and simulations can be performed based upon a training corpus by generating multiple scenarios, and, for each scenario, a scenario specific version of a particular annotated sample is generated by performing a simulation using the particular annotated sample. The scenario specific versions of annotated samples from the training corpus can be stored as a training data set on the at least one network device.
WORD HASH LANGUAGE MODEL
A language model may be used in a variety of natural language processing tasks, such as speech recognition, machine translation, sentence completion, part-of-speech tagging, parsing, handwriting recognition, or information retrieval. A natural language processing task may use a vocabulary of words, and a word hash vector may be created for each word in the vocabulary. A sequence of input words may be received, and a hash vector may be obtained for each word in the sequence. A language model may process the hash vectors for the sequence of input words to generate an output hash vector that describes words that are likely to follow the sequence of input words. One or words may then be selected using the output word hash vector and used for a natural language processing task.
PULSE-BASED AUTOMATIC SPEECH RECOGNITION
Various examples are provided related to speech recognition. In one example, a method includes converting an auditory signal into a pulse train, segmenting the pulse train into a series of frames having a predefined duration, and identifying a portion of the auditory signal by applying at least a portion of the series of frames segmented from the pulse train to a kernel adaptive autoregressive-moving-average (KAARMA) network. In another example, a speech recognition system includes processing circuitry configured to convert an auditory signal into a pulse train, segment the pulse train into a secured of frames, and identifying a portion of the auditory signal by applying at least a portion of the series of frames segmented from the pulse train to a KAARMA network. The series of frames segmented from the pulse train can be applied to a KAARMA chain including a plurality of KAARMA networks for identification.
Word hash language model
A language model may be used in a variety of natural language processing tasks, such as speech recognition, machine translation, sentence completion, part-of-speech tagging, parsing, handwriting recognition, or information retrieval. A natural language processing task may use a vocabulary of words, and a word hash vector may be created for each word in the vocabulary. A sequence of input words may be received, and a hash vector may be obtained for each word in the sequence. A language model may process the hash vectors for the sequence of input words to generate an output hash vector that describes words that are likely to follow the sequence of input words. One or words may then be selected using the output word hash vector and used for a natural language processing task.
SPEECH SYNTHESIS DEVICE, SPEECH SYNTHESIS METHOD, SPEECH SYNTHESIS MODEL TRAINING DEVICE, SPEECH SYNTHESIS MODEL TRAINING METHOD, AND COMPUTER PROGRAM PRODUCT
A speech synthesis device of an embodiment includes a memory unit, a creating unit, a deciding unit, a generating unit and a waveform generating unit. The memory unit stores, as statistical model information of a statistical model, an output distribution of acoustic feature parameters including pitch feature parameters and a duration distribution. The creating unit creates a statistical model sequence from context information and the statistical model information. The deciding unit decides a pitch-cycle waveform count of each state using a duration based on the duration distribution of each state of each statistical model in the statistical model sequence, and pitch information based on the output distribution of the pitch feature parameters. The generating unit generates an output distribution sequence based on the pitch-cycle waveform count, and acoustic feature parameters based on the output distribution sequence. The waveform generating unit generates a speech waveform from the generated acoustic feature parameters.
WORD HASH LANGUAGE MODEL
A language model may be used in a variety of natural language processing tasks, such as speech recognition, machine translation, sentence completion, part-of-speech tagging, parsing, handwriting recognition, or information retrieval. A natural language processing task may use a vocabulary of words, and a word hash vector may be created for each word in the vocabulary. A sequence of input words may be received, and a hash vector may be obtained for each word in the sequence. A language model may process the hash vectors for the sequence of input words to generate an output hash vector that describes words that are likely to follow the sequence of input words. One or words may then be selected using the output word hash vector and used for a natural language processing task.
GENERATION DEVICE, RECOGNITION SYSTEM, AND GENERATION METHOD FOR GENERATING FINITE STATE TRANSDUCER
A generation device includes a receiving unit and a generating unit. The receiving unit receives a model representing correspondence between one or more phonetic symbols and one or more words. The generating unit generates a first finite state transducer based on the model, the first finite state transducer at least including, as outgoing transitions from a first state representing transition destination of a first transition which has a first phonetic symbol of a predetermined type as input symbol, a second transition that has a second phonetic symbol, which is different than a particular symbol representing part or whole of input symbol of the first transition, as input symbol, and a third transition that has a third phonetic symbol, which represents the particular symbol or silence, as input symbol.
Duration ratio modeling for improved speech recognition
In speech recognition, the duration of a phoneme is taken into account when determining recognition scores. Specifically, the duration of a phoneme may be evaluated relative to the duration of neighboring phonemes. A phoneme that is interpreted to be significantly longer or shorter than its neighbors may be given a lower duration score. A duration score for a phoneme may be calculated and used to adjust a recognition score. In this manner a duration model may supplement an acoustic model and language model to improve speech recognition results.
Systems and methods for generating labeled data to facilitate configuration of network microphone devices
Systems and methods for generating training data are described herein. Pieces of metadata captured by a plurality of networked sensor systems can be captured, where each piece of metadata is associated with a specific set of sensor data captured by one of the plurality of networked sensor systems and includes a set of characteristics for the specific set of captured sensor data. A probabilistic model can be generated based on the received metadata and simulations can be performed based upon a training corpus by generating multiple scenarios, and, for each scenario, a scenario specific version of a particular annotated sample is generated by performing a simulation using the particular annotated sample. The scenario specific versions of annotated samples from the training corpus can be stored as a training data set on the at least one network device.
SYSTEMS AND METHODS FOR GENERATING LABELED DATA TO FACILITATE CONFIGURATION OF NETWORK MICROPHONE DEVICES
Systems and methods for generating training data are described herein. Pieces of metadata captured by a plurality of networked sensor systems can be captured, where each piece of metadata is associated with a specific set of sensor data captured by one of the plurality of networked sensor systems and includes a set of characteristics for the specific set of captured sensor data. A probabilistic model can be generated based on the received metadata and simulations can be performed based upon a training corpus by generating multiple scenarios, and, for each scenario, a scenario specific version of a particular annotated sample is generated by performing a simulation using the particular annotated sample. The scenario specific versions of annotated samples from the training corpus can be stored as a training data set on the at least one network device.