G10L2015/027

INFORMATION PROCESSING APPARATUS, KEYWORD DETECTING APPARATUS, AND INFORMATION PROCESSING METHOD
20210065684 · 2021-03-04 · ·

According to one embodiment, an information processing apparatus includes following units. The acquisition unit acquires first training data including a combination of a voice feature quantity and a correct phoneme label of the voice feature quantity. The training unit trains an acoustic model using the first training data in a manner to output the correct phoneme label in response to input of the voice feature quantity. The extraction unit extracts from the first training data, second training data including voice feature quantities of at least one of a keyword, a sub-word, a syllable, or a phoneme included in the keyword. The adaptation processing unit adapts the trained acoustic model using the second training data to a keyword detection model.

Method and System of Providing Speech Rehearsal Assistance

A method and system for speech rehearsal assistant during a presentation rehearsal includes receiving audio data from a speech rehearsal session over a network, receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session, calculating a real time speaking rate for the speech rehearsal session, determining if the speaking rate is within a threshold range, detecting utterance of a filler phrase or sound during the speech rehearsal session using at least in part a machine learning model trained for identifying filler phrases and sounds in a text, and upon determining the speaking rate falls outside the threshold range or detecting the utterance of the filler phrase or sound, enabling real time display of a notification on a display device.

Assessment of speech consumability by text analysis

Methods, computer program products, and systems are presented. The methods include, for instance: obtaining an input text for an output speech. The number of words and syllables are counted in each sentence, and a mean sentence length of the input text is calculated. Each sentence length is checked against the mean sentence length and a variation for each sentence is calculated. For the input text, the consumability-readability score is produced as an average of variations for all sentences in the input text. The consumability-readability score indicates the level of satisfaction for the listener of the output speech based on the input text.

Intelligent health monitoring

Embodiments are disclosed for health assessment and diagnosis implemented in an artificial intelligence (AI) system. In an embodiment, a method comprises: capturing, using one or more sensors of a device, signals including information about a user's symptoms; using one or more processors of the device to: collect other data correlative of symptoms experienced by the user; and implement pre-trained data driven methods to: determine one or more symptoms of the user; determine a disease or disease state of the user based on the determined one or more symptoms; determine a medication effectiveness in suppressing at least one determined symptom or improving the determined disease state of the user; and present, using an output device, one or more evidence for at least one of the determined symptoms, the disease, disease state, or an indication of the medication effectiveness for the user.

METHOD, APPARATUS, DEVICE AND COMPUTER READABLE STORAGE MEDIUM FOR RECOGNIZING AND DECODING VOICE BASED ON STREAMING ATTENTION MODEL
20210020175 · 2021-01-21 ·

A method, apparatus, device, and computer readable storage medium for recognizing and decoding a voice based on a streaming attention model are provided. The method may include generating a plurality of acoustic paths for decoding the voice using the streaming attention model, and then merging acoustic paths with identical last syllables of the plurality of acoustic paths to obtain a plurality of merged acoustic paths. The method may further include selecting a preset number of acoustic paths from the plurality of merged acoustic paths as retained candidate acoustic paths. Embodiments of the present disclosure present a concept that acoustic score calculating of a current voice fragment is only affected by its last voice fragment and has nothing to do with earlier voice history, and merge acoustic paths with the identical last syllables of the plurality of candidate acoustic paths.

Multimedia authoring apparatus with synchronized motion and voice feature and method for the same

Disclosed is a technique for a multimedia authoring tool embodied in a computer program. A voice clip is displayed on a timeline, and a playback time point for each syllable and pronunciation information of the corresponding syllable are also displayed. Also, a motion clip may be edited in synchronization with the voice clip on the basis of the playback time point for each syllable. By moving a syllable of the voice clip along the timeline, a portion of the voice clip may be altered.

DEVICE FOR RECOGNIZING SPEECH INPUT FROM USER AND OPERATING METHOD THEREOF

Provided are a device for recognizing a speech input including a named entity from a user and an operating method thereof. The device is configured to: generate a weighted finite state transducer model by using a vocabulary list including a plurality of named entities; obtain a first string from a speech input received from a user, by using a first decoding model; obtain a second string by using a second decoding model that uses the weighted finite state transducer model, the second string including a word sequence, which corresponds to at least one named entity, and an unrecognized word sequence not identified as a named entity; and output a text corresponding to the speech input by substituting the unrecognized word sequence of the second string with a word sequence included in the first string.

METHOD AND DEVICE FOR GENERATING SPEECH RECOGNITION MODEL AND STORAGE MEDIUM
20200402500 · 2020-12-24 ·

A method and device for generating speech recognition model are provided. The method includes: obtaining training samples, wherein each training sample includes a speech frame sequence and a labeled text sequence; training the encoder by using the speech frame sequence as an input feature and using speech encoded frames of the speech frame sequence as an output feature; training the decoder by using the speech encoded frames as a first input feature and using the labeled text sequence as a first output feature, and obtaining a current prediction text sequence; and training the decoder again by using the speech encoded frames as a second input feature and using a sequence as a second output feature, wherein the sequence is obtained by sampling the labeled text sequence and the current prediction text sequence based on a preset probability.

Method and device for extracting factoid associated words from natural language sentences
10861437 · 2020-12-08 · ·

A method an system for extracting factoid associated words from natural language sentences is disclosed. The method includes creating an input vector that includes a plurality of parameters for each target word in a sentence. For a target word, the plurality of parameters includes a Part of Speech (POS) vector, a word embedding, a word embedding for a head word of the target word, a dependency label, and a semantic role label. The method includes processing for each target word, the input vector through a trained neural network and assigning one or more factoid tags to each target word in the sentence. The method includes extracting text associated with factoids from the sentence based on the one or more factoid tags. The method further includes providing a response to the sentence inputted by the user based on the text associated with the factoids.

Methods, devices and computer-readable storage media for real-time speech recognition

Methods, apparatuses, devices and computer-readable storage media for real-time speech recognition are provided. The method includes: based on an input speech signal, obtaining truncating information for truncating a sequence of features of the speech signal; based on the truncating information, truncating the sequence of features into a plurality of subsequences; and for each subsequence in the plurality of subsequences, obtaining a real-time recognition result through attention mechanism.