Patent classifications
G10L2015/025
METHOD AND SYSTEM FOR ENHANCING THE INTELLIGIBILITY OF INFORMATION FOR A USER
A system for providing information to a user includes and/or interfaces with a set of models and/or algorithms. Additionally or alternatively, the system can include and/or interface with any or all of: a processing subsystem; a sensory output device; a user device; an audio input device; and/or any other components. A method for providing information to a user includes and/or interfaces with: receiving a set of inputs; processing the set of inputs to determine a set of sensory outputs; and providing the set of sensory outputs.
Clockwork hierarchal variational encoder
A method of providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word and selecting a mel spectral embedding for the text utterance. Each word has at least one syllable and each syllable has at least one phoneme. For each phoneme, the method further includes using the selected mel spectral embedding to: (i) predict a duration of the corresponding phoneme based on corresponding linguistic features associated with the word that includes the corresponding phoneme and corresponding linguistic features associated with the syllable that includes the corresponding phoneme; and (ii) generate a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame represents mel-spectral information of the corresponding phoneme.
End-to-End Streaming Keyword Spotting
A method for detecting a hotword includes receiving a sequence of input frames that characterize streaming audio captured by a user device and generating a probability score indicating a presence of a hotword in the streaming audio using a memorized neural network. The network includes sequentially-stacked single value decomposition filter (SVDF) layers and each SVDF layer includes at least one neuron. Each neuron includes a respective memory component, a first stage configured to perform filtering on audio features of each input frame individually and output to the memory component, and a second stage configured to perform filtering on all the filtered audio features residing in the respective memory component. The method also includes determining whether the probability score satisfies a hotword detection threshold and initiating a wake-up process on the user device for processing additional terms.
VOICE-TO-TEXT DATA PROCESSING
A computing system includes a processor configured to convert a word spoken by a user into a pattern of symbols in response to an unsuccessful attempt to retrieve the word in a list. The pattern of symbols provide a visual representation of speech sounds identifying the contact in the contact list. The pattern of symbols of the converted contact is compared to a database of patterns, with the patterns in the database being in a format of symbols corresponding to the words in the list. Each pattern used in the compare has a match value assigned thereto based on being compared to the pattern of symbols of the converted word. The processor provides the word in the list corresponding to the pattern having the match value that is indicative of a match to the converted word.
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM STORING INFORMATION PROCESSING PROGRAM FOR SELECTING SET VALUE USED TO EXECUTE FUNCTION
The processor of an information processing apparatus includes serves, by executing an information processing program, as: a function determiner; a morpheme analyzer configured to analyze a message input by a user in morphemes; a word detector configured to detect a predetermined time-representing word indicating temporal nearness or farness and a predetermined keyword which is modified by the time-representing word and which indicates settings associated with the function from the message analyzed in morphemes by the morpheme analyzer; a setting selector configured to select a newest set value when the word detector has detected the time-representing word indicating temporal nearness and to select a set value used when the user used the function in the past when the word detector has detected the time-representing word indicating temporal farness; and a function executor configured to execute the function determined by the function determiner using the set value selected by the setting selector.
SPEECH RECOGNITION DEVICE AND OPERATING METHOD THEREOF
Provided are a method and device for speech recognition. The speech recognition method includes: receiving a speech signal generated by an utterance of a user; identifying a named entity from the received speech signal; determining a speech signal portion, which corresponds to the identified named entity, from the received speech signal; generating a first acoustic embedding vector corresponding to the speech signal portion, based on an acoustic embedding model; determining a second acoustic embedding vector that is one of a plurality of acoustic embedding vectors corresponding to a plurality of named entities included in an acoustic embedding database (DB), based on distances between the plurality of acoustic embedding vectors and the first acoustic embedding vector; determining a corrected named entity corresponding to the second acoustic embedding vector; and providing a result of speech recognition with respect to the speech signal, based on the corrected named entity.
Remote distress monitor
A remote distress monitor includes a steerable microphone array, a memory, and a control system. The steerable microphone array is configured to detect audio data and generate sound data. The memory stores machine-readable instructions. The control system includes one or more processors configured to execute the machine-readable instructions. The generated sound data from the steerable microphone array is analyzed. Based at least in part on the analysis, a token associated with the audio data detected by the steerable microphone array is generated. The audio data is representative of one or more sounds associated with a distress event. The generated token is transmitted, via a communications network, to an electronic device to cause a distress response action to occur. The distress response action is associated with the distress event.
System and method for combining phonetic and automatic speech recognition search
A text search query including one or more words may be received. An ASR index created for an audio recording may be searched over using the query to produce ASR search results including words, each word associated with a confidence score. For each of the words in the ASR search results associated with a confidence score below a threshold (and in some cases having one or more preceding words in the ASR index and one or more subsequent words in the ASR index), a phonetic representation of the audio recording may be searched for the word having the confidence score below the threshold, where it occurs in the audio recording, possibly after the one or more preceding words and in the audio recording before the one or more subsequent words, to produce phonetic search results. Search results may be returned include ASR and phonetic results.
SPEECH RECOGNITION METHOD AND APPARATUS, DEVICE, STORAGE MEDIUM, AND PROGRAM PRODUCT
A computer device acquires speech content. The device performs feature extraction on the speech content to obtain an intermediate feature. The intermediate feature is used for indicating an audio expression characteristic of the speech content. The device decodes the intermediate feature based on an attention mechanism to obtain a first word graph network. The device performs feature mapping on the intermediate feature based on pronunciation of the speech content to obtain a second word graph network. The device determines a recognition result of the speech content according to candidate word connection relationships indicated by the first word graph network and the second word graph network.
VOICE INTERACTIVE WAKEUP ELECTRONIC DEVICE AND METHOD BASED ON MICROPHONE SIGNAL, AND MEDIUM
An electronic device configured with a microphone, a voice interaction wake-up method executed by an electronic device equipped with a microphone, and a computer-readable medium, the electronic device comprising a memory and a central processing unit, wherein the memory stores computer-executable instructions, and when executed by the central processing unit, the computer-executable instructions perform the following operations: analyzing a sound signal collected by a microphone, identifying whether the sound signal contains speech spoken by a person and whether it contains wind noise sounds generated by airflows hitting the microphone as a result of the speech spoken by the person, and in response to determining that the sound signal contains sound spoken by the person and contains wind noise sounds generated by airflows hitting the microphone as a result of the speech spoken by the user, processing the sound signal as speech input by the user. The solution disclosed in the present application is applicable to performing voice input when a user carries an intelligent electronic device, and the operation is natural and simple, simplifying the steps of voice input, reducing the burden and difficulty of interaction, and making the interaction more natural.