Patent classifications
G10L25/00
Method and apparatus for assigning keyword model to voice operated function
A method, performed in an electronic device, for assigning a target keyword to a function is disclosed. In this method, a list of a plurality of target keywords is received at the electronic device via a communication network, and a particular target keyword is selected from the list of target keywords. Further, the method may include receiving a keyword model for the particular target keyword via the communication network. In this method, the particular target keyword is assigned to a function of the electronic device such that the function is performed in response to detecting the particular target keyword based on the keyword model in an input sound received at the electronic device.
Acoustic sound signature detection based on sparse features
A low power sound recognition sensor is configured to receive an analog signal that may contain a signature sound. Sparse sound parameter information is extracted from the analog signal. The extracted sparse sound parameter information is processed using a sound signature database stored in the sound recognition sensor to identify sounds or speech contained in the analog signal, wherein the sound signature database comprises a plurality of sound signatures each representing an entire word or multiword phrase.
Voice pattern coding sequence and cataloging voice matching system
A method for voice pattern coding and catalog matching. The method includes identifying a set of vocal variables for a user, by a voice recognition system, based, at least in part, on a user interaction with the voice recognition system. The method further includes generating a voice model of speech patterns that represent the speaking of a particular language using the identified set of vocal variables, wherein the voice model is adapted to improve recognition of the user's voice by the voice recognition system. The method further includes matching the generated voice model to a catalog of speech patterns, and identifying a voice model code that represents speech patterns in the catalog that match the generated voice model. The method further includes providing the identified voice model code to the user.
Augmenting speech segmentation and recognition using head-mounted vibration and/or motion sensors
Example methods and systems use multiple sensors to determine whether a speaker is speaking. Audio data in an audio-channel speech band detected by a microphone can be received. Vibration data in a vibration-channel speech band representative of vibrations detected by a sensor other than the microphone can be received. The microphone and the sensor can be associated with a head-mountable device (HMD). It is determined whether the audio data is causally related to the vibration data. If the audio data and the vibration data are causally related, an indication can be generated that the audio data contains HMD-wearer speech. Causally related audio and vibration data can be used to increase accuracy of text transcription of the HMD-wearer speech. If the audio data and the vibration data are not causally related, an indication can be generated that the audio data does not contain HMD-wearer speech.
Network computer system to generate voice response communications
A network computer system for managing a network service (e.g., a transport service) can include a voice-assistant subsystem for generating dialogues and performing actions for service providers of the network service. The network computer system can receive, from a user device, a request for the network service. In response, the network computer system can identify a service provider and transmit an invitation to the provider device of the service provider. In response to the identification of the service provider for the request, the voice-assistant subsystem can trigger an audio voice prompt to be presented on the provider device and a listening period during which the provider device monitors for an audio input from the service provider. Based on the audio input captured by the provider device, the network computer system can determine an intent corresponding to whether the service provider accepts or declines the invitation.
Method and apparatus for recognizing speech by lip reading
A dictation device includes: an audio input device configured to receive a voice utterance including a plurality of words; a video input device configured to receive video of lip motion during the voice utterance; a memory portion; a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and a transceiver for sending the first data packets to a server end device and receiving second data packets including combined dictation based upon the audio stream and the video stream from the server end device. In the combined dictation, first dictation generated based upon the audio stream has been corrected by second dictation generated based upon the video stream.
System and method for applying a convolutional neural network to speech recognition
A system and method for applying a convolutional neural network (CNN) to speech recognition. The CNN may provide input to a hidden Markov model and has at least one pair of a convolution layer and a pooling layer. The CNN operates along the frequency axis. The CNN has units that operate upon one or more local frequency bands of an acoustic signal. The CNN mitigates acoustic variation.
System and method for applying a convolutional neural network to speech recognition
A system and method for applying a convolutional neural network (CNN) to speech recognition. The CNN may provide input to a hidden Markov model and has at least one pair of a convolution layer and a pooling layer. The CNN operates along the frequency axis. The CNN has units that operate upon one or more local frequency bands of an acoustic signal. The CNN mitigates acoustic variation.
Generating audio fingerprints based on audio signal complexity
An audio identification system accounts for an audio signal's complexity when generating a test audio fingerprint for identification of the audio signal. In particular, the audio identification system determines a complexity of an audio signal to be fingerprinted. For example, the audio signal's complexity may be determined by performance of an autocorrelation on the audio signal. Based on the determined complexity, the audio identification system determines a length of a sample of the audio signal used to generate a test audio fingerprint. A sample having the length is then obtained and used to generate a test audio fingerprint for the audio signal. The test audio fingerprint may be compared to a set of reference audio fingerprints to identify the audio signal.
Name recognition system
A speech recognition system uses, in one embodiment, an extended phonetic dictionary that is obtained by processing words in a user's set of databases, such as a user's contacts database, with a set of pronunciation guessers. The speech recognition system can use a conventional phonetic dictionary and the extended phonetic dictionary to recognize speech inputs that are user requests to use the contacts database, for example, to make a phone call, etc. The extended phonetic dictionary can be updated in response to changes in the contacts database, and the set of pronunciation guessers can include pronunciation guessers for a plurality of locales, each locale having its own pronunciation guesser.