G10L15/148

SPEECH SYNTHESIS DEVICE, SPEECH SYNTHESIS METHOD, SPEECH SYNTHESIS MODEL TRAINING DEVICE, SPEECH SYNTHESIS MODEL TRAINING METHOD, AND COMPUTER PROGRAM PRODUCT
20200357381 · 2020-11-12 · ·

A speech synthesis device of an embodiment includes a memory unit, a creating unit, a deciding unit, a generating unit and a waveform generating unit. The memory unit stores, as statistical model information of a statistical model, an output distribution of acoustic feature parameters including pitch feature parameters and a duration distribution. The creating unit creates a statistical model sequence from context information and the statistical model information. The deciding unit decides a pitch-cycle waveform count of each state using a duration based on the duration distribution of each state of each statistical model in the statistical model sequence, and pitch information based on the output distribution of the pitch feature parameters. The generating unit generates an output distribution sequence based on the pitch-cycle waveform count, and acoustic feature parameters based on the output distribution sequence. The waveform generating unit generates a speech waveform from the generated acoustic feature parameters.

SYSTEM AND METHOD FOR KEY PHRASE SPOTTING

A method for key phrase spotting may comprise: obtaining an audio; obtaining a plurality of candidate words corresponding to a plurality of the audio portions and obtaining a first probability score for each corresponding relationship between the obtained candidate word and the audio portion; determining if the plurality of candidate words respectively match a plurality of key words of a key phrase and if the first probability score of each of the plurality of candidate words exceeds a corresponding first threshold, the plurality of candidate words constituting a candidate phrase; and in response to determining the plurality of candidate words matching the plurality of key words and the each first probability score exceeding the corresponding threshold, obtaining a second probability score representing a matching relationship between the candidate phrase and the key phrase based on the first probability score of each of the plurality of candidate words.

Pulse-based automatic speech recognition

Various examples are provided related to speech recognition. In one example, a method includes converting an auditory signal into a pulse train, segmenting the pulse train into a series of frames having a predefined duration, and identifying a portion of the auditory signal by applying at least a portion of the series of frames segmented from the pulse train to a kernel adaptive autoregressive-moving-average (KAARMA) network. In another example, a speech recognition system includes processing circuitry configured to convert an auditory signal into a pulse train, segment the pulse train into a secured of frames, and identifying a portion of the auditory signal by applying at least a portion of the series of frames segmented from the pulse train to a KAARMA network. The series of frames segmented from the pulse train can be applied to a KAARMA chain including a plurality of KAARMA networks for identification.

INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND PROGRAM
20200143796 · 2020-05-07 · ·

In order to improve accuracy for detecting presence or absence of a target object. A time-series segmentation unit 102 creates first time-series data by segmenting processing target data into each frame of n time zones. Each of first determination units 103 creates m second time-series data by determining each frame of the first time-series data using m models having different characteristics. A second determination unit 104 creates a second determination result as a presence probability of the target object for a set of second time-series data including nm data.

Generation device, recognition system, and generation method for generating finite state transducer
10600407 · 2020-03-24 · ·

A generation device includes a receiving unit and a generating unit. The receiving unit receives a model representing correspondence between one or more phonetic symbols and one or more words. The generating unit generates a first finite state transducer based on the model, the first finite state transducer at least including, as outgoing transitions from a first state representing transition destination of a first transition which has a first phonetic symbol of a predetermined type as input symbol, a second transition that has a second phonetic symbol, which is different than a particular symbol representing part or whole of input symbol of the first transition, as input symbol, and a third transition that has a third phonetic symbol, which represents the particular symbol or silence, as input symbol.

SPEECH EMOTION DETECTION METHOD AND APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM

A speech emotion detection system may obtain to-be-detected speech data. The system may generate speech frames based on framing processing and the to-be-detected speech data. The system may extract speech features corresponding to the speech frames to form a speech feature matrix corresponding to the to-be-detected speech data. The system may input the speech feature matrix to an emotion state probability detection model. The system may generate, based on the speech feature matrix and the emotion state probability detection model, an emotion state probability matrix corresponding to the to-be-detected speech data. The system may input the emotion state probability matrix and the speech feature matrix to an emotion state transition model. The system may generate an emotion state sequence based on the emotional state probability matrix, the speech feature matrix, and the emotional state transition model. The system may determine an emotion state based on the emotion state sequence.

AUDIO SEGMENTATION METHOD BASED ON ATTENTION MECHANISM

An audio segmentation method based on an attention mechanism is provided. The audio segmentation method according to an embodiment obtains a mapping relationship between an inputted text and an audio spectrum feature vector for generating an audio signal, the audio spectrum feature vector being automatically synthesized by using the inputted text, and segments an inputted audio signal by using the mapping relationship. Accordingly, high quality can be guaranteed and the effort, time, and cost can be noticeably reduced through audio segmentation utilizing the attention mechanism.

Systems and methods for generating labeled data to facilitate configuration of network microphone devices

Systems and methods for generating training data are described herein. Pieces of metadata captured by a plurality of networked sensor systems can be captured, where each piece of metadata is associated with a specific set of sensor data captured by one of the plurality of networked sensor systems and includes a set of characteristics for the specific set of captured sensor data. A probabilistic model can be generated based on the received metadata and simulations can be performed based upon a training corpus by generating multiple scenarios, and, for each scenario, a scenario specific version of a particular annotated sample is generated by performing a simulation using the particular annotated sample. The scenario specific versions of annotated samples from the training corpus can be stored as a training data set on the at least one network device.

MULTI-LANGUAGE MIXED SPEECH RECOGNITION METHOD
20190378497 · 2019-12-12 ·

The invention discloses a multi-language mixed speech recognition method, which belongs to the technical field of speech recognition; the method comprises: step S1, configuring a multi-language mixed dictionary including a plurality of different languages; step S2, performing training according to the multi-language mixed dictionary and multi-language speech data including a plurality of different languages to form an acoustic recognition model; step S3, performing training according to multi-language text corpus including a plurality of different languages to form a language recognition model; step S4, forming the speech recognition system by using the multi-language mixed dictionary, the acoustic recognition model and the language recognition model; and subsequently, recognizing mixed speech by using the speech recognition system, and outputting a corresponding recognition result. The above technical solution has the beneficial effects of being able to support the recognition of mixed speech in multiple languages, improving the accuracy and efficiency of recognition, and thus improving the performance of the speech recognition system.

Word hash language model
10482875 · 2019-11-19 · ·

A language model may be used in a variety of natural language processing tasks, such as speech recognition, machine translation, sentence completion, part-of-speech tagging, parsing, handwriting recognition, or information retrieval. A natural language processing task may use a vocabulary of words, and a word hash vector may be created for each word in the vocabulary. A sequence of input words may be received, and a hash vector may be obtained for each word in the sequence. A language model may process the hash vectors for the sequence of input words to generate an output hash vector that describes words that are likely to follow the sequence of input words. One or words may then be selected using the output word hash vector and used for a natural language processing task.