Patent classifications
G10L2015/027
METHOD OF RECOGNIZING SPEECH OFFLINE, ELECTRONIC DEVICE, AND STORAGE MEDIUM
The present disclosure provides a method of recognizing speech offline, electronic device, and a storage medium, relating to a field of artificial intelligence such as speech recognition, natural language processing, and deep learning. The method may include: decoding speech data to be recognized into a syllable recognition result; transforming the syllable recognition result into a corresponding text as a speech recognition result of the speech data.
Acoustic signal processing with neural network using amplitude, phase, and frequency
According to one embodiment, a signal generation device includes one or more processors. The processors convert an acoustic signal and output amplitude and phase at a plurality of frequencies. The processors, for each of a plurality of nodes of a hidden layer included in a neural network that treats the amplitude and the phase as input, obtain frequency based on a plurality of weights used in arithmetic operation of the node. The processors generate an acoustic signal based on the plurality of obtained frequencies and based on amplitude and phase corresponding to each of the plurality of nodes.
Personalizing a DNN-based text-to-speech system using small target speech corpus
A personalized text-to-speech system configured to perform speaker adaption is disclosed. The TTS system includes an acoustic model comprising a base neural network and a differential neural network. The base neural network is configured to generate acoustic parameters corresponding to a base speaker or voice actor, while the differential neural network is configured to generate acoustic parameters corresponding to differences between acoustic parameters of the base speaker and a particular target speaker. The output of the acoustic model is then a weighted linear combination of the output from the base neural network and differential neural network. The base neural network and differential neural network share a first input layer and first plurality of hidden layers. Thereafter, the base neural network further comprises a second plurality of hidden layers and output layer. In parallel, the differential neural network further comprises a third plurality of hidden layers and separate output layer.
Digitally aware neural dictation interface
Systems and methods for populating the fields of an electronic form are disclosed. A method includes receiving a speech input from a user corresponding to a first field of a plurality of fields of an electronic form provided on a display screen of the user device; converting the speech input from an audible value into text; displaying, on the display screen of the user device, the text in the first field of the electronic form to allow a visual verification by the user; prompting, via a speaker of the user device, the user for information for a second field and a subsequent field in the plurality of fields upon each preceding field being populated with text from converted speech inputs; determining the form is complete based on a set of populated fields in the plurality of fields; and enabling a submission of the completed form.
Method and device for updating language model and performing speech recognition based on language model
A method of updating a grammar model used during speech recognition includes obtaining a corpus including at least one word, obtaining the at least one word from the corpus, splitting the at least one obtained word into at least one segment, generating a hint for recombining the at least one segment into the at least one word, and updating the grammar model by using at least one segment comprising the hint.
METHOD, APPARATUS AND DEVICE FOR TRAINING NETWORK AND STORAGE MEDIUM
Embodiments of the present disclosure disclose a method, apparatus and device for training a network, and a storage medium, relate to the field of artificial intelligence technology such as deep learning and speech analysis. A semantic prediction network comprises: an encoder network and at least one decoder network; and a particular solution is: acquiring a first speech feature of a target speech sample; the target speech sample being a synthesized speech sample or a real speech sample, the synthesized speech sample being attached with a sample syllable label and a semantic label comprising a value of the domain, and the real speech sample being attached with a sample syllable label; and jointly training an initial semantic prediction network and a syllable classification network using the first speech feature of the target speech sample, to obtain a trained semantic prediction network.
ALGORITHMIC DETERMINATION OF A STORY READERS DISCONTINUATION OF READING
The disclosure provides technology for enhancing the ability of a computing device to detect when a user has discontinued reading a text source. An example method includes receiving audio data comprising a spoken word associated with a text source, wherein the audio data comprises a first duration and a second duration; comparing the audio data with data of the text source, wherein the first duration of the audio data corresponds with the data of the text source; calculating, by a processing device, a correspondence measure between the second duration of the audio data and the data of the text source; and responsive to determining the correspondence measure satisfies a threshold, transmitting a signal to cease comparing audio data with the data of the text source.
Device for recognizing speech input from user and operating method thereof
Provided are a device for recognizing a speech input including a named entity from a user and an operating method thereof. The device is configured to: generate a weighted finite state transducer model by using a vocabulary list including a plurality of named entities; obtain a first string from a speech input received from a user, by using a first decoding model; obtain a second string by using a second decoding model that uses the weighted finite state transducer model, the second string including a word sequence, which corresponds to at least one named entity, and an unrecognized word sequence not identified as a named entity; and output a text corresponding to the speech input by substituting the unrecognized word sequence of the second string with a word sequence included in the first string.
Method and system for speech emotion recognition
A method for speech emotion recognition for enriching speech to text communications between users in speech chat sessions including: implementing a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content by: generating a data set of speech samples with labels of a plurality of emotion classes; extracting a set of acoustic features from each of the emotion classes; generating a machine learning (ML) model based on the acoustic features and data set; training the ML model from acoustic features from speech samples during speech chat sessions; predicting emotion content based on a trained ML model in the observed speech; generating enriched text based on predicted emotion content of the trained ML model; and presenting the enriched text in speech to text communications between users in the chat session for visual notice of an observed emotion in the speech sample.
SYLLABLE BASED AUTOMATIC SPEECH RECOGNITION
Systems, methods, and computer programs are described which utilize the structure of syllables as an organizing element of automated speech recognition processing to overcome variations in pronunciation, to efficiently resolve confusable aspects, to exploit context, and to map the speech to orthography.