Patent classifications
G10L2015/025
SPEECH RECOGNITION APPARATUS, METHOD AND PROGRAM
A score integration unit 7 obtains a new score Score (l.sub.1:n.sup.b, c) that integrates a score Score (l.sub.1:n.sup.b, c) and a score Score (w.sub.1:o.sup.b, c). This new score Score (l.sub.1:n.sup.b, c) becomes a score Score (l.sub.1:n.sup.b) in a hypothesis selection unit 8. Thus, the score Score (l.sub.1:n.sup.b) can be said to take into account the score Score (w.sub.1:o.sup.b, c). In a speech recognition apparatus, first information is extracted on the basis of the score Score (l.sub.1:n.sup.b) taking into account the score Score (w.sub.1:o.sup.b, c). Thus, speech recognition with higher performance than that in the related art can be achieved.
System-on-a-chip incorporating artificial neural network and general-purpose processor circuitry
A circuit system and a method of analyzing audio or video input data that is capable of detecting, classifying, and post-processing patterns in an input data stream. The circuit system may consist of one or more digital processors, one or more configurable spiking neural network circuits, and digital logic for the selection of two-dimensional input data. The system may use the neural network circuits for detecting and classifying patterns and one or more the digital processors to perform further detailed analyses on the input data and for signaling the result of an analysis to outputs of the system.
Systems and methods for variably paced real-time translation between the written and spoken forms of a word
An enunciation system (ES) enables users to gain acquaintance, understanding, and mastery of the relationship between letters and sounds in the context of an alphabetic writing system. The ES enables the user to experience the action of sounding out a word, before their own phonics knowledge enables them to sound out the word independently; its continuous, unbroken speech output or input avoids the common confusions that ensue from analyzing words by breaking them up into discrete sounds; its user-controlled pacing allows the user to slow down enunciation at specific points of difficulty within the word; its real-time touch control allows the written word to be “played” like a musical instrument, with expressive and aesthetic possibilities; and its highlighting of the letter cluster that is responsible for the recognized phoneme enunciated by the user as it occurs allows the user to more easily associated the letters with the sounds.
Systems and Methods for Assisted Translation and Lip Matching for Voice Dubbing
Systems and methods for generating candidate translations for use in creating synthetic or human-acted voice dubbings, aiding human translators in generating translations that match the corresponding video, automatically grading how well a candidate translation matches the corresponding video, suggesting modifications to the speed and/or timing of the translated text to improve the grading of a candidate translation, and suggesting modifications to the voice dubbing and/or video to improve the grading of a candidate translation. In that regard, the present technology may be used to fully automate the process of generating lip-matched translations and associated voice dubbings, or as an aid for human-in-the-loop processes that may reduce or eliminate the time and effort required from translators, adapters, voice actors, and/or audio editors to generate voice dubbings.
Speech recognition method and apparatus
A speech recognition method includes receiving speech data, obtaining, from the received speech data, a candidate text including at least one word and a phonetic symbol sequence associated with a pronunciation of a target word included in the received speech data, using a speech recognition model, replacing the phonetic symbol sequence included in the candidate text with a replacement word corresponding to the phonetic symbol sequence, and determining a target text corresponding to the received speech data based on a result of the replacing.
End-to-end streaming keyword spotting
A method for detecting a hotword includes receiving a sequence of input frames that characterize streaming audio captured by a user device and generating a probability score indicating a presence of a hotword in the streaming audio using a memorized neural network. The network includes sequentially-stacked single value decomposition filter (SVDF) layers and each SVDF layer includes at least one neuron. Each neuron includes a respective memory component, a first stage configured to perform filtering on audio features of each input frame individually and output to the memory component, and a second stage configured to perform filtering on all the filtered audio features residing in the respective memory component. The method also includes determining whether the probability score satisfies a hotword detection threshold and initiating a wake-up process on the user device for processing additional terms.
Automatic synthesis of translated speech using speaker-specific phonemes
An embodiment includes converting an original audio signal to an original text string, the original audio signal being from a recording of the original text string spoken by a specific person in a source language. The embodiment generates a translated text string by translating the original text string from the source language to a target language, including translation of a word from the source language to a target language. The embodiment assembles a standard phoneme sequence from a set of standard phonemes, where the standard phoneme sequence includes a standard pronunciation of the translated word. The embodiment also associates a custom phoneme with a standard phoneme of the standard phoneme sequence, where the custom phoneme includes the specific person's pronunciation of a sound in the translated word. The embodiment synthesizes the translated text string to a translated audio signal including the translated word pronounced using the custom phoneme.
Systems, methods, devices and apparatuses for detecting facial expression
A system, method and apparatus for detecting facial expressions according to EMG signals.
Phonetic comparison for virtual assistants
In an approach for optimizing an intelligent virtual assistant by using phonetic comparison to find a response stored in a local database, a processor receives an audio input on a computing device. A processor transcribes the audio input to text. A processor compares the text to a set of user queries and commands in a local database of the computing device using a phonetic algorithm. A processor determines whether a user query or command of the set of user queries and commands meets a pre-defined threshold of similarity. Responsive to determining that the user query or command meets the pre-defined threshold of similarity, a processor identifies an intention of a set of intentions stored in the local database corresponding to the user query or command. A processor identifies a response of a set of responses in the local database corresponding to the intention. A processor outputs the response audibly.
Electronic device and method for providing conversational service
A method, performed by an electronic device, of providing a conversational service includes: receiving an utterance input; identifying a temporal expression representing a time in a text obtained from the utterance input; determining a time point related to the utterance input based on the temporal expression; selecting a database corresponding to the determined time point from among a plurality of databases storing information about a conversation history of a user using the conversational service; interpreting the text based on information about the conversation history of the user, the conversation history information being acquired from the selected database; generating a response message to the utterance input based on a result of the interpreting; and outputting the generated response message.