IPIQ

G10L2015/027

SYSTEMS AND METHODS FOR AUDIO SIGNAL GENERATION

20230317092 · 2023-10-05 ·

SHENZHEN SHOKZ CO., LTD.

Xin QI

The method for audio signal generation may include obtaining a bone conduction audio signal and an air conduction audio signal. The method may also include obtaining a trained machine learning model that provides a mapping relationship between a set of bone conduction data derived from a specific bone conduction audio signal and one or more sets of equivalent air conduction data derived from a specific equivalent air conduction audio signal. The method may also include determining a target set of equivalent air conduction data corresponding to the bone conduction audio signal using the trained machine learning model based on the bone conduction audio signal and the air conduction audio signal. The method may further include causing an audio signal output device to output a target audio signal representing the speech of the user based on the target set of equivalent air conduction data.

Method and apparatus for speech recognition, and storage medium

11756529 · 2023-09-12 ·

Beijing Baidu Netcom Science And Technology Co., Ltd.

Proposed are a method and apparatus for speech recognition, and a storage medium. The specific solution includes: obtaining audio data to be recognized; decoding the audio data to obtain a first syllable of a to-be-converted word, in which the first syllable is a combination of at least one phoneme corresponding to the to-be-converted word; obtaining a sentence to which the to-be-converted word belongs and a converted word in the sentence, and obtaining a second syllable of the converted word; encoding the first syllable and the second syllable to generate first encoding information of the first syllable; and decoding the first encoding information to obtain a text corresponding to the to-be-converted word.

APPARATUS AND METHOD FOR SELF-SUPERVISED TRAINING OF END-TO-END SPEECH RECOGNITION MODEL

20230134942 · 2023-05-04 ·

Electronics And Telecommunications Research Institute

Disclosed herein are an apparatus and method for self-supervised training of an end-to-end speech recognition model. The apparatus includes memory in which at least one program is recorded and a processor for executing the program. The program trains an end-to-end speech recognition model, including an encoder and a decoder, using untranscribed speech data. The program may add predetermined noise to the input signal of the end-to-end speech recognition model, and may calculate loss by reflecting a predetermined constraint based on the output of the encoder of the end-to-end speech recognition model.

Intelligent health monitoring

11810670 · 2023-11-07 ·

CurieAI, Inc.

Embodiments are disclosed for health assessment and diagnosis implemented in an artificial intelligence (AI) system. The AI system takes as input information from a multitude of sensors measuring different biomarkers in a continuous or intermittent fashion. The proposed techniques disclosed herein address the unique challenges encountered in implementing such an AI system.

METHOD OF PROCESSING SPEECH INFORMATION, METHOD OF TRAINING MODEL, AND WAKE-UP METHOD

20230360638 · 2023-11-09 ·

A method of processing a speech information, a method of training a speech model, a speech wake-up method, an electronic device, and a storage medium are provided, which relate to a field of artificial intelligence technology, in particular to fields of human-computer interaction, deep learning and intelligent speech technologies. A specific implementation solution includes: performing a syllable recognition on a speech information to obtain a posterior probability sequence for the speech information, where the speech information includes a speech frame sequence, the posterior probability sequence corresponds to the speech frame sequence, and each posterior probability in the posterior probability sequence represents a similarity between a syllable in a speech frame matched with the posterior probability and a predetermined syllable; and determining a target peak speech frame from the speech frame sequence based on the posterior probability sequence.

DIGITALLY AWARE NEURAL DICTATION INTERFACE

20230359812 · 2023-11-09 ·

Wells Fargo Bank, N.A.

Systems and methods for populating the elements of content are disclosed. One method includes determining a plurality of elements of a document and receiving a first speech input from a user to enable a mode of operation. The method further includes authenticating the user by comparing the first speech input with at least one voice sample of the user and enabling the mode of operation. The method further includes receiving, in the mode of operation, a second speech input for filling out a first element of the document and determining an irregularity or distortion in the second speech input based on the first element and identifying a missing syllable or a distorted syllable. The method further includes refining the second speech input into at least one matching syllable, converting the refined second speech input, and providing the text to populate the first element with the text.

Method, apparatus and device for training network and storage medium

11823660 · 2023-11-21 ·

Beijing Baidu Netcom Science And Technology Co., Ltd.

Li Chen
Saisai Zou

Embodiments of the present disclosure disclose a method, apparatus and device for training a network, and a storage medium, relate to the field of artificial intelligence technology such as deep learning and speech analysis. A semantic prediction network comprises: an encoder network and at least one decoder network; and a particular solution is: acquiring a first speech feature of a target speech sample; the target speech sample being a synthesized speech sample or a real speech sample, the synthesized speech sample being attached with a sample syllable label and a semantic label comprising a value of the domain, and the real speech sample being attached with a sample syllable label; and jointly training an initial semantic prediction network and a syllable classification network using the first speech feature of the target speech sample, to obtain a trained semantic prediction network.

Verbal cues for high-speed control of a voice-enabled device

11423879 · 2022-08-23 ·

Disney Enterprises, Inc.

William Valentine Zajac, III

A technique for controlling a voice-enabled device using voice commands includes receiving an audio signal that is generated in response to a verbal utterance, generating a verbal utterance indicator for the verbal utterance based on the audio signal, selecting a first command for a voice-controlled application residing within the voice-enabled device based on the verbal utterance indicator, and transmitting the first command to the voice-controlled application as an input.

METHOD AND SYSTEM FOR PRESENTING A MULTIMEDIA STREAM

20220092109 · 2022-03-24 ·

Shaofeng Li

A method for presenting a multimedia stream including a first audio stream and a second audio stream, comprising: receiving the first audio stream, wherein the first audio stream comprises a set of first audio slices sequentially located therein, wherein each first audio slice comprises a timestamp and a grade value; receiving the second audio stream, wherein the second audio stream comprises a set of second audio slices sequentially located in the second stream, and aligned in time with one of the first audio slice; presenting the first audio stream according to the timestamp of the first set of first audio slices; receiving a set of control commands including a first threshold value; determining whether the first threshold value is lower than the grade value of the first audio slice; and presenting the second audio slice aligned with the first audio slice.

Speech terminal, speech command generation system, and control method for a speech command generation system

11302318 · 2022-04-12 ·

Yamaha Corporation

A speech command generation system includes multiple speech terminals that communicate with each other via a network. Each terminal, which includes a sound pickup device and a speaker. At least one of the terminals converts local picked up sound data to text data, while delaying outputting of the sound data to a remotely communicating terminal, and determines whether the text data includes a trigger word. When the text data includes the trigger word, the outputting of the sound data to the remotely communicating terminal is inhibited.

Patent classifications

G10L2015/027