Patent classifications
G10L2015/022
Speaker-adaptive speech recognition
A method for generating a test-speaker-specific adaptive system for recognising sounds in speech spoken by a test speaker; the method employing: (i) training data comprising speech items spoken by the test speaker; and (ii) an input network component and a speaker adaptive output network, the input network component and speaker adaptive output network having been trained using training data from training speakers;
the method comprising: (a) using the training data to train a test-speaker-specific adaptive model component of an adaptive model comprising the input network component, and the test-speaker-specific adaptive model component, and (b) providing the test-speaker-specific adaptive system comprising the input network component, the trained test-speaker-specific adaptive model component, and the speaker-adaptive output network.
Keyword detector and keyword detection method
A keyword detector includes a processor configured to calculate a feature vector for each frame from a speech signal, input the feature vector for each frame to a DNN to calculate a first output probability for each triphone according to a sequence of phonemes contained in a predetermined keyword and a second output probability for each monophone, for each of at least one state of an HMM, calculate a first likelihood representing the probability that the predetermined keyword is uttered in the speech signal by applying the first output probability to the HMM, calculate a second likelihood for the most probable phoneme string in the speech signal by applying the second output probability to the HMM, and determine whether the keyword is to be detected on the basis of the first likelihood and the second likelihood.
Method and apparatus for exemplary morphing computer system background
Method and apparatus for reducing a size of databases required for recorded speech data.
SYSTEM AND METHOD FOR ASSESSING EXPRESSIVE LANGUAGE DEVELOPMENT OF A KEY CHILD
A method of assessing expressive language development of a key child. The method can include processing an audio recording taken in a language environment of the key child to identify segments of the audio recording that correspond to vocalizations of the key child. The method also can include applying an adult automatic speech recognition phone decoder to the segments of the audio recordings to identify each occurrence of a plurality of phone categories and to determine a duration for each of the plurality of phone categories. The method additionally can include determining a duration distribution for the plurality of phone categories based on the durations for the plurality of phone categories. The method further can include using the duration distribution for the plurality of phone categories in an age-based model to assess the expressive language development of the key child. The age-based model is selected based on a chronological age of the key child and the age-based model includes a plurality of different weights associated with the plurality of phone categories. Other embodiments are described.
SPEECH RECOGNITION APPARATUS AND SPEECH RECOGNITION METHOD
A speech recognition apparatus according to an embodiment includes a microphone that acquires an audio stream in which speech vocalized by a person is recorded, a camera that acquires an image data in which at least a mouth of the person is captured, and an operation element that recognizes speech including a consonant vocalized by the person, based on the audio stream, estimates the consonant vocalized by the person, based on the shape of the mouth of the person in the image data, and specifies the consonant based on the estimated consonant and the speech-recognized consonant.
Apparatus and method for recognizing speech based on a deep-neural-network (DNN) sound model
A speech recognition apparatus based on a deep-neural-network (DNN) sound model includes a memory and a processor. As the processor executes a program stored in the memory, the processor generates sound-model state sets corresponding to a plurality of pieces of set training speech data included in multi-set training speech data, generates a multi-set state cluster from the sound-model state sets, and sets the multi-set training speech data as an input node and the multi-set state cluster as output nodes so as to learn a DNN structured parameter.
METHODS FOR REAL-TIME ACCENT CONVERSION AND SYSTEMS THEREOF
Techniques for real-time accent conversion are described herein. An example computing device receives an indication of a first accent and a second accent. The computing device further receives, via at least one microphone, speech content having the first accent. The computing device is configured to derive, using a first machine-learning algorithm trained with audio data including the first accent, a linguistic representation of the received speech content having the first accent. The computing device is configured to, based on the derived linguistic representation of the received speech content having the first accent, synthesize, using a second machine learning-algorithm trained with (i) audio data comprising the first accent and (ii) audio data including the second accent, audio data representative of the received speech content having the second accent. The computing device is configured to convert the synthesized audio data into a synthesized version of the received speech content having the second accent.
Systems and methods for speech animation using visemes with phonetic boundary context
Speech animation may be performed using visemes with phonetic boundary context. A viseme unit may comprise an animation that simulates lip movement of an animated entity. Individual ones of the viseme units may correspond to one or more complete phonemes and phoneme context of the one or more complete phonemes. Phoneme context may include a phoneme that is adjacent to the one or more complete phonemes that correspond to a given viseme unit. Potential sets of viseme units that correspond with individual phoneme string portions may be determined. One of the potential sets of viseme units may be selected for individual ones of the phoneme string portions based on a fit metric that conveys a match between individual ones of the potential sets and the corresponding phoneme string portion.
System and method for emotion assessment
A method of determining an emotion of an utterance. The method can include receiving the utterance at a processor-based device comprising an audio engine. The method also can include extracting emotion-related acoustic features from the utterance. The method additionally can include comparing the emotion-related acoustic features to a plurality of emotion models that are representative of emotions. The method further can include selecting a model from the plurality of emotion models based on the comparing the emotion-related acoustic features to the plurality of emotion models. The method additionally can include outputting the emotion of the utterance, wherein the emotion corresponds to the selected model. Other embodiments are provided.
Accent correction in speech recognition systems
A method comprising receiving an audio input signal comprising speech, determining an accent class corresponding to the speech, identifying an accented phone pattern within the speech, replacing the accented phone pattern with an unaccented phone pattern, and generating an unaccented output signal from the unaccented phone pattern.