Patent classifications
G10L2015/022
System and method for assessing expressive language development of a key child
A method of assessing expressive language development of a key child. The method can include processing an audio recording taken in a language environment of the key child to identify segments of the audio recording that correspond to vocalizations of the key child. The method also can include applying an adult automatic speech recognition phone decoder to the segments of the audio recordings to identify each occurrence of a plurality of phone categories and to determine a duration for each of the plurality of phone categories. The method additionally can include determining a duration distribution for the plurality of phone categories based on the durations for the plurality of phone categories. The method further can include using the duration distribution for the plurality of phone categories in an age-based model to assess the expressive language development of the key child. The age-based model is selected based on a chronological age of the key child and the age-based model includes a plurality of different weights associated with the plurality of phone categories. Other embodiments are described.
Neural network training method and apparatus using experience replay sets for recognition
A training method and apparatus for speech recognition is disclosed, where an example of the training method includes determining whether a current iteration for training a neural network is performed by an experience replay iteration using an experience replay set, selecting a sample from at least one of the experience replay set and a training set based on a result of the determining, and training the neural network based on the selected sample.
Real-time accent conversion model
Techniques for real-time accent conversion are described herein. An example computing device receives an indication of a first accent and a second accent. The computing device further receives, via at least one microphone, speech content having the first accent. The computing device is configured to derive, using a first machine-learning algorithm trained with audio data including the first accent, a linguistic representation of the received speech content having the first accent. The computing device is configured to, based on the derived linguistic representation of the received speech content having the first accent, synthesize, using a second machine learning-algorithm trained with (i) audio data comprising the first accent and (ii) audio data including the second accent, audio data representative of the received speech content having the second accent. The computing device is configured to convert the synthesized audio data into a synthesized version of the received speech content having the second accent.
Speech recognition apparatus and speech recognition method
A speech recognition apparatus according to an embodiment includes a microphone that acquires an audio stream in which speech vocalized by a person is recorded, a camera that acquires an image data in which at least a mouth of the person is captured, and an operation element that recognizes speech including a consonant vocalized by the person, based on the audio stream, estimates the consonant vocalized by the person, based on the shape of the mouth of the person in the image data, and specifies the consonant based on the estimated consonant and the speech-recognized consonant.
DOMAIN ADAPTIVE SPEECH RECOGNITION USING ARTIFICIAL INTELLIGENCE
Methods, systems, and computer program products for domain adaptive speech recognition using artificial intelligence are provided herein. A computer-implemented method includes generating a set of language data candidates, each language data candidate comprising one or more graphemes, by processing a sequence of phonemes related to input speech data using an artificial intelligence-based data conversion model; determining, for a target pair of phonemes and graphemes, a subset of graphemes from the set of language data candidates; generating a first speech recognition output by processing the subset of graphemes using at least one biasing language model and an artificial intelligence-based speech recognition model; generating a second speech recognition output by replacing at least a portion of the subset of graphemes in the first speech recognition output with at least one of the graphemes from the target pair; and performing automated actions based on the second speech recognition output.
METHOD AND SYSTEM OF HIGH ACCURACY KEYPHRASE DETECTION FOR LOW RESOURCE DEVICES
Techniques related to keyphrase detection for applications such as wake on voice are disclosed herein. Such techniques may have high accuracy by using scores of phone positions in triphones to select which triphones to use with a rejection model, using context-related phones for the rejection model, adding silence before keyphrase sounds for a keyphrase model, or any combination of these.
Training acoustic models using connectionist temporal classification
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.
METHODS FOR REAL-TIME ACCENT CONVERSION AND SYSTEMS THEREOF
Techniques for real-time accent conversion are described herein. An example computing device receives an indication of a first accent and a second accent. The computing device further receives, via at least one microphone, speech content having the first accent. The computing device is configured to derive, using a first machine-learning algorithm trained with audio data including the first accent, a linguistic representation of the received speech content having the first accent. The computing device is configured to, based on the derived linguistic representation of the received speech content having the first accent, synthesize, using a second machine learning-algorithm trained with (i) audio data comprising the first accent and (ii) audio data including the second accent, audio data representative of the received speech content having the second accent. The computing device is configured to convert the synthesized audio data into a synthesized version of the received speech content having the second accent.
TRAINING METHOD AND APPARATUS FOR SPEECH RECOGNITION
A training method and apparatus for speech recognition is disclosed, where an example of the training method includes determining whether a current iteration for training a neural network is performed by an experience replay iteration using an experience replay set, selecting a sample from at least one of the experience replay set and a training set based on a result of the determining, and training the neural network based on the selected sample.
SPEECH RECOGNITION METHOD AND APPARATUS
A speech recognition method comprises: generating, based on a preset speech knowledge source, a search space comprising preset client information and for decoding a speech signal; extracting a characteristic vector sequence of a to-be-recognized speech signal; calculating a probability at which the characteristic vector corresponds to each basic unit of the search space; and executing a decoding operation in the search space by using the probability as an input to obtain a word sequence corresponding to the characteristic vector sequence.