G10L13/02

Link-based audio recording, collection, collaboration, embedding and delivery system
11715455 · 2023-08-01 · ·

A machine has a processor and a memory connected to the processor. The memory stores instructions executed by the processor to supply a name page in response to a request from an administrator machine. Name page updates are received from the administrator machine. The name page updates include participants and associated network contact information for the participants. A code is utilized to form a link to the name page. Prompts for textual name information and audio name information are supplied to a client machine that activates the link to the name page. Textual name information and audio name information are received from the client machine. The textual name information and audio name information are stored in association with the name page. Navigation tools are supplied to facilitate access to the textual name information and audio name information.

VOICE ASSISTANT SYSTEM AND METHOD FOR PERFORMING VOICE ACTIVATED MACHINE TRANSLATION

A method for performing a query based on a natural language voice input is provided. The method includes receiving, via a microphone, a voice input of a user, and converting the voice input into a first text data object. The method further includes converting the first text data object into a first technical language object using AI, and submitting a query based on the first technical language object. A query result in a second technical language object is retrieved in response to the query, and the query result is converted into a second text data object using AI. The method further converts the second text object into a voice data object indicating the query result, and outputs a voice signal to provide the information of the query result in a natural language form to the user.

VOICE ASSISTANT SYSTEM AND METHOD FOR PERFORMING VOICE ACTIVATED MACHINE TRANSLATION

A method for performing a query based on a natural language voice input is provided. The method includes receiving, via a microphone, a voice input of a user, and converting the voice input into a first text data object. The method further includes converting the first text data object into a first technical language object using AI, and submitting a query based on the first technical language object. A query result in a second technical language object is retrieved in response to the query, and the query result is converted into a second text data object using AI. The method further converts the second text object into a voice data object indicating the query result, and outputs a voice signal to provide the information of the query result in a natural language form to the user.

SYSTEMS AND METHODS FOR AUTOMATED AUDIO TRANSCRIPTION, TRANSLATION, AND TRANSFER FOR ONLINE MEETING

The present invention discloses systems and methods for multimedia processing. For example, the present invention provides systems and methods for receiving spoken audio, converting the spoken audio to text, and transferring the text to a user. As desired, the speech or text can be translated into one or more different languages. Systems and methods for real-time conversion and transmission of speech and text are provided, including systems and methods for large scale processing of multimedia events.

SYSTEMS AND METHODS FOR AUTOMATED AUDIO TRANSCRIPTION, TRANSLATION, AND TRANSFER FOR ONLINE MEETING

The present invention discloses systems and methods for multimedia processing. For example, the present invention provides systems and methods for receiving spoken audio, converting the spoken audio to text, and transferring the text to a user. As desired, the speech or text can be translated into one or more different languages. Systems and methods for real-time conversion and transmission of speech and text are provided, including systems and methods for large scale processing of multimedia events.

APPARATUS, METHOD, AND COMPUTER PROGRAM FOR PROVIDING LIP-SYNC VIDEO AND APPARATUS, METHOD, AND COMPUTER PROGRAM FOR DISPLAYING LIP-SYNC VIDEO
20230023102 · 2023-01-26 · ·

Provided is a lip-sync video providing apparatus for providing a video in which a voice and lip shapes are synchronized. The lip-sync video providing apparatus is configured to obtain a template video including at least one frame and depicting a target object, obtain a target voice to be used as a voice of the target object, generate a lip image corresponding to the voice for each frame of the template video by using a trained first artificial neural network, and generate lip-sync data including frame identification information of a frame in the template video, the lip image, and position information regarding the lip image in a frame in the template video.

END-TO-END SPEECH CONVERSION

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for end to end speech conversion are disclosed. In one aspect, a method includes the actions of receiving first audio data of a first utterance of one or more first terms spoken by a user. The actions further include providing the first audio data as an input to a model that is configured to receive first given audio data in a first voice and output second given audio data in a synthesized voice without performing speech recognition on the first given audio data. The actions further include receiving second audio data of a second utterance of the one or more first terms spoken in the synthesized voice. The actions further include providing, for output, the second audio data of the second utterance of the one or more first terms spoken in the synthesized voice.

END-TO-END SPEECH CONVERSION

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for end to end speech conversion are disclosed. In one aspect, a method includes the actions of receiving first audio data of a first utterance of one or more first terms spoken by a user. The actions further include providing the first audio data as an input to a model that is configured to receive first given audio data in a first voice and output second given audio data in a synthesized voice without performing speech recognition on the first given audio data. The actions further include receiving second audio data of a second utterance of the one or more first terms spoken in the synthesized voice. The actions further include providing, for output, the second audio data of the second utterance of the one or more first terms spoken in the synthesized voice.

AUDIO PROCESSING METHOD AND APPARATUS BASED ON ARTIFICIAL INTELLIGENCE, DEVICE, STORAGE MEDIUM, AND COMPUTER PROGRAM PRODUCT
20230230571 · 2023-07-20 ·

Disclosed are an audio processing method performed by an electronic device, a non-transitory computer-readable storage medium, and a computer program product. The method includes: sampling multiple fragments of audio data of a target object to obtain reference audio data of the target object; performing audio encoding on the reference audio data of the target object to obtain a reference embedding vector of the reference audio data; performing tone-based attention processing on the reference embedding vector of the reference audio data to obtain a tone embedding vector of the target object, wherein the tone embedding vector is independent from content of the audio data; and generating audio data of a target text that conforms to a tone of the target object according to the tone embedding vector of the target object.

AUDIO PROCESSING METHOD AND APPARATUS BASED ON ARTIFICIAL INTELLIGENCE, DEVICE, STORAGE MEDIUM, AND COMPUTER PROGRAM PRODUCT
20230230571 · 2023-07-20 ·

Disclosed are an audio processing method performed by an electronic device, a non-transitory computer-readable storage medium, and a computer program product. The method includes: sampling multiple fragments of audio data of a target object to obtain reference audio data of the target object; performing audio encoding on the reference audio data of the target object to obtain a reference embedding vector of the reference audio data; performing tone-based attention processing on the reference embedding vector of the reference audio data to obtain a tone embedding vector of the target object, wherein the tone embedding vector is independent from content of the audio data; and generating audio data of a target text that conforms to a tone of the target object according to the tone embedding vector of the target object.