Patent classifications
G10L2015/025
System and method for recommendation of terms, including recommendation of search terms in a search system
Embodiments of systems and methods for providing search term suggestions in a search system are disclosed. Embodiments as disclosed may utilize the sound of an original search term to locate candidate terms based on the sound of the candidate terms and the frequency of appearance of the candidate terms in the corpus of documents being searched. A set of search term suggestions can then be determined from the candidate terms and returned to the user as search term suggestions for the original search term.
UNIFIED SPEECH REPRESENTATION LEARNING
Systems and methods are provided for training a machine learning model to learn speech representations. Labeled speech data or both labeled and unlabeled data sets is applied to a feature extractor of a machine learning model to generate latent speech representations. The latent speech representations are applied to a quantizer to generate quantized latent speech representations and to a transformer context network to generate contextual representations. Each contextual representation included in the contextual representations is aligned with a phoneme label to generate phonetically-aware contextual representations. Quantized latent representations are aligned with phoneme labels to generate phonetically aware latent speech representations. Systems and methods also include randomly replacing a sub-set of the contextual representations with quantized latent speech representations during their alignments to phoneme labels and aligning the phonetically aware latent speech representations to the contextual representations using supervised learning.
System and method for combining phonetic and automatic speech recognition search
A text search query including one or more words may be received. An ASR index created for an audio recording may be searched over using the query to produce ASR search results including words, each word associated with a confidence score. For each of the words in the ASR search results associated with a confidence score below a threshold (and in some cases having one or more preceding words in the ASR index and one or more subsequent words in the ASR index), a phonetic representation of the audio recording may be searched for the word having the confidence score below the threshold, where it occurs in the audio recording, possibly after the one or more preceding words and in the audio recording before the one or more subsequent words, to produce phonetic search results. Search results may be returned include ASR and phonetic results.
Dynamic adjustment of story time special effects based on contextual data
The disclosure provides technology for enabling a computing device to provide context sensitive special effects that supplement a text source as it is read aloud. An example method includes receiving, by a processing device, audio data comprising a spoken word of a user, analyzing contextual data associated with the user, determining a match between the audio data and data of a text source; and initiating a physical effect in response to the determining the match, wherein the physical effect corresponds to the text source and is based on the contextual data.
Systems, methods, devices and apparatuses for detecting facial expression
A system, method and apparatus for detecting facial expressions according to EMG signals.
Urgency level estimation apparatus, urgency level estimation method, and program
An urgency level estimation technique of estimating an urgency level of a speaker for free uttered speech, which does not require a specific word, is provided. An urgency level estimation apparatus includes a feature amount extracting part configured to extract a feature amount of an utterance from uttered speech, and an urgency level estimating part configured to estimate an urgency level of a speaker of the uttered speech from the feature amount based on a relationship between a feature amount extracted from uttered speech and an urgency level of a speaker of the uttered speech, the relationship being determined in advance, and the feature amount includes at least one of a feature indicating speaking speed of the uttered speech, a feature indicating voice pitch of the uttered speech and a feature indicating a power level of the uttered speech.
Electronic device for executing application by using phoneme information included in audio data and operation method therefor
An electronic device according to various embodiments may comprise a memory in which one or more applications are installed, a communication circuit, and a processor, wherein the processor is configured to acquire audio data during execution of a designated application among the one or more applications, wherein the acquiring of audio data comprises an operation of storing, in the memory, at least a portion including multiple pieces of phoneme information among the audio data, when a designated condition is satisfied, transmit the at least portion to an external electronic device so that the external electronic device generates designated information for execution of at least one application among the one or more applications by using at least a part of the multiple pieces of phoneme information stored before the designated condition is satisfied, and on the basis of the designated information, execute the at least one application in relation to the designated application.
System to convert phonemes into phonetics-based words
A system to convert phonemes into phonetics-based words that is implemented in one or more computing systems, in association with a system that provides required inputs is disclosed. Said system comprises a phoneme enhancer, a phoneme sequence buffer, a phoneme sequence to phonetics-based word converter that comprises a sliding window phoneme sequence matcher, a phoneme sequence to phonetics-based word custom data memory, a most frequent phonetics-based word predictive memory, a phoneme similarity matrix, and a phonetics-based word output unit.
Systems and methods for generating disambiguated terms in automatically generated transcriptions including instructions within a particular knowledge domain
System and method for generating disambiguated terms in automatically generated transcriptions including instructions within a knowledge domain and employing the system are disclosed. Exemplary implementations may: obtain a set of transcripts related to the knowledge domain representing various speech from users; obtain indications of correlated correct and incorrect transcripts of spoken terms within the knowledge domain; use a vector generation model to generate vectors for individual instances of the transcribed terms in the set of transcripts that are part of the lexicography of the knowledge domain such that a first set of vectors and a second set of vectors are generated that numerically represent the instances of the first correctly transcribed term and the first incorrectly transcribed term, respectively, and in different contexts; train the vector generation model to reduce spatial separation of vectors generated for instances of correlated correct and incorrect transcripts of spoken terms within the knowledge domain.
Natural human-computer interaction for virtual personal assistant systems
Technologies for natural language interactions with virtual personal assistant systems include a computing device configured to capture audio input, distort the audio input to produce a number of distorted audio variations, and perform speech recognition on the audio input and the distorted audio variants. The computing device selects a result from a large number of potential speech recognition results based on contextual information. The computing device may measure a user's engagement level by using an eye tracking sensor to determine whether the user is visually focused on an avatar rendered by the virtual personal assistant. The avatar may be rendered in a disengaged state, a ready state, or an engaged state based on the user engagement level. The avatar may be rendered as semitransparent in the disengaged state, and the transparency may be reduced in the ready state or the engaged state. Other embodiments are described and claimed.