Patent classifications
G10L2015/081
Method and apparatus for generating speech
A speech generation method and apparatus are disclosed. The speech generation method includes obtaining, by a processor, a linguistic feature and a prosodic feature from an input text, determining, by the processor, a first candidate speech element through a cost calculation and a Viterbi search based on the linguistic feature and the prosodic feature, generating, at a speech element generator implemented at the processor, a second candidate speech element based on the linguistic feature or the prosodic feature and the first candidate speech element, and outputting, by the processor, an output speech by concatenating the second candidate speech element and a speech sequence determined through the Viterbi search.
SYSTEM FOR PROVIDING CUSTOMIZED VIDEO PRODUCING SERVICE USING CLOUD-BASED VOICE COMBINING
A system for providing a customized video producing service using a cloud based voice combination of the present invention comprises a customized video production service providing server including: a user terminal that is input and uploads utterance of a user by voice data, selects any one category among at least one type of category to select content including an image or a video, selects a subtitle or background music, and plays a customized video including the content, the uploaded voice data, and the subtitle or background music; a database unit classifying and storing text, image, video, and background music by the at least one type of category; an upload unit receiving the voice data corresponding to the utterance of the user uploaded from the user terminal; a conversion unit that converts the uploaded voice data into text data using STT (Speech to Text) and stores the converted text data; a provision unit that provides an image or video previously mapped and stored in the selected category to the user terminal when any one category among the at least one type of category is selected from the user terminal; a creation unit that creates the customized video including the content, the uploaded voice, and the subtitles or background music when receiving subtitle data or selection of background music from the user terminal by the user terminal's selection of the subtitle or background music.
Method of recognising a sound event
A method for recognising at least one of a non-verbal sound event and a scene in an audio signal comprising a sequence of frames of audio data, the method comprising: for each frame of the sequence: processing the frame of audio data to extract multiple acoustic features for the frame of audio data; and classifying the acoustic features to classify the frame by determining, for each of a set of sound classes, a score that the frame represents the sound class; processing the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame; and processing the sound class decisions for the sequence of frames to recognise the at least one of a non-verbal sound event and a scene.
SCALABLE ENTITIES AND PATTERNS MINING PIPELINE TO IMPROVE AUTOMATIC SPEECH RECOGNITION
A computing system obtains features that have been extracted from an acoustic signal, where the acoustic signal comprises spoken words uttered by a user. The computing system performs automatic speech recognition (ASR) based upon the features and a language model (LM) generated based upon expanded pattern data. The expanded pattern data includes a name of an entity and a search term, where the entity belongs to a segment identified in a knowledge base. The search term has been included in queries for entities belonging to the segment. The computing system identifies a sequence of words corresponding to the features based upon results of the ASR. The computing system transmits computer-readable text to a search engine, where the text includes the sequence of words.
ELECTRONIC DEVICE AND SPEECH PROCESSING METHOD THEREOF
According to various example embodiments, an electronic device includes a microphone configured to receive an audio signal including speech of a user, a processor, and a memory configured to store instructions executable by the processor and personal information of the user, in which the processor is configured to extract a plurality of speech recognition candidates by analyzing a feature of the speech of the user, extract a keyword based on the plurality of speech recognition candidates, search for replacement data, based on the keyword and the personal information, and generate a recognition result corresponding to the speech of the user, based on the replacement data.
Deliberation Model-Based Two-Pass End-To-End Speech Recognition
A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector. The method also includes decoding the first context vector and the second context vector at a context vector decoder to form a second-pass hypothesis
Adaptively modifying dialog output by an artificial intelligence engine during a conversation with a customer based on changing the customer's negative emotional state to a positive one
In some examples, a server may receive an utterance from a customer. The utterance may be included in a conversation between the artificial intelligence engine and the customer. The server may convert the utterance to text and determine a customer intent based on the text and a user history. The server may determine a user model of the customer based on the text and the customer intent. The server may update a conversation state associated with the conversation based on the customer intent and the user model. The server may determine a user state based on the user model and the conversation state. The server may select, using a reinforcement learning based module, a particular action from a set of actions, the particular action including a response and provide the response to the customer.
SPECIFYING PREFERRED INFORMATION SOURCES TO AN ASSISTANT
Implementations relate to interactions between a user and an automated assistant during a dialog between the user and the automated assistant. Some implementations relate to processing received user request input to determine that it is of a particular type that is associated with a source parameter rule and, in response, causing one or more sources indicated as preferred by the source parameter rule and one or more additional sources not indicated by the source parameter rule to be searched based on the user request input. Further, those implementations relate to identifying search results of the search(es), and generating, in dependence on the search results, a response to the user request that includes content from search result(s) of the preferred source(s) and/or content from search result(s) of the additional source(s). Generating the response further includes including, in the response, some indication that indicates whether the source parameter rule was followed or violated in generating the response.
Deep learning internal state index-based search and classification
Systems and methods are disclosed for generating internal state representations of a neural network during processing and using the internal state representations for classification or search. In some embodiments, the internal state representations are generated from the output activation functions of a subset of nodes of the neural network. The internal state representations may be used for classification by training a classification model using internal state representations and corresponding classifications. The internal state representations may be used for search, by producing a search feature from an search input and comparing the search feature with one or more feature representations to find the feature representation with the highest degree of similarity.
System and method for robust access and entry to large structured data using voice form-filling
A method, apparatus and machine-readable medium are provided. A phonotactic grammar is utilized to perform speech recognition on received speech and to generate a phoneme lattice. A document shortlist is generated based on using the phoneme lattice to query an index. A grammar is generated from the document shortlist. Data for each of at least one input field is identified based on the received speech and the generated grammar.