G10L15/25

REMOTELESS CONTROL OF DRONE BEHAVIOR
20230047759 · 2023-02-16 ·

A drone system is configured to capture an audio stream that includes voice commands from an operator, to process the audio stream for identification of the voice commands, and to perform operations based on the identified voice commands. The drone system can identify a particular voice stream in the audio stream as an operator voice, and perform the command recognition with respect to the operator voice to the exclusion of other voice streams present in the audio stream. The drone can include a directional camera that is automatically and continuously focused on the operator to capture a video stream usable in disambiguation of different voice streams captured by the drone.

REMOTELESS CONTROL OF DRONE BEHAVIOR
20230047759 · 2023-02-16 ·

A drone system is configured to capture an audio stream that includes voice commands from an operator, to process the audio stream for identification of the voice commands, and to perform operations based on the identified voice commands. The drone system can identify a particular voice stream in the audio stream as an operator voice, and perform the command recognition with respect to the operator voice to the exclusion of other voice streams present in the audio stream. The drone can include a directional camera that is automatically and continuously focused on the operator to capture a video stream usable in disambiguation of different voice streams captured by the drone.

METHOD AND DEVICE FOR GENERATING SPEECH VIDEO ON BASIS OF MACHINE LEARNING
20220358703 · 2022-11-10 ·

A device for generating a speech video may include a first encoder to receive a person background image corresponding to a video part of a speech video of a person and extract an image feature vector from the person background image, a second encoder to receive a speech audio signal corresponding to an audio part of the speech video and extract a voice feature vector from the speech audio signal, a combiner to generate a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder, and a decoder to reconstruct the speech video of the person using the combined vector as an input. The person background image input to the first encoder includes a face and an upper body of the person, with a portion related to speech of the person covered with a mask.

METHOD AND DEVICE FOR GENERATING SPEECH VIDEO ON BASIS OF MACHINE LEARNING
20220358703 · 2022-11-10 ·

A device for generating a speech video may include a first encoder to receive a person background image corresponding to a video part of a speech video of a person and extract an image feature vector from the person background image, a second encoder to receive a speech audio signal corresponding to an audio part of the speech video and extract a voice feature vector from the speech audio signal, a combiner to generate a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder, and a decoder to reconstruct the speech video of the person using the combined vector as an input. The person background image input to the first encoder includes a face and an upper body of the person, with a portion related to speech of the person covered with a mask.

VOICE ACTIVITY DETECTION METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM

The present disclosure discloses a voice activity detection method and apparatus, an electronic device and a storage medium, and relates to the field of artificial intelligence, such as deep learning, intelligent voices, or the like. The method may include: acquiring time-aligned voice data and video data; performing a first detection of a voice start point and a voice end point of the voice data using a voice detection model obtained by a training operation; performing a second detection of a lip movement start point and a lip movement end point of the video data; and correcting a result of the first detection using a result of the second detection, and taking a corrected result as a voice activity detection result. The solution of the present disclosure may improve accuracy of the voice activity detection result, or the like.

VOICE ACTIVITY DETECTION METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM

The present disclosure discloses a voice activity detection method and apparatus, an electronic device and a storage medium, and relates to the field of artificial intelligence, such as deep learning, intelligent voices, or the like. The method may include: acquiring time-aligned voice data and video data; performing a first detection of a voice start point and a voice end point of the voice data using a voice detection model obtained by a training operation; performing a second detection of a lip movement start point and a lip movement end point of the video data; and correcting a result of the first detection using a result of the second detection, and taking a corrected result as a voice activity detection result. The solution of the present disclosure may improve accuracy of the voice activity detection result, or the like.

PORTABLE TERMINAL DEVICE AND INFORMATION PROCESSING SYSTEM
20230039067 · 2023-02-09 ·

A portable terminal device in an information processing system and method includes a camera and a microphone. Data of obtained images and voice are transmitted to a server that identifies operations to be executed based on the received voice and image data. The server transmits an identification of one or more results of the plurality of operations to the portable terminal device. When the portable terminal device receives only one result from the server, an operation corresponding to the one result is executed, and when a plurality of results is received, the portable terminal device displays information corresponding to the plurality of results as candidates. Additional voice is captured for selecting one of the plurality of results during the displaying of the information. A determination of one result from the plurality of results is made based on the captured voice, and an operation corresponding to the determined result is executed.

PORTABLE TERMINAL DEVICE AND INFORMATION PROCESSING SYSTEM
20230039067 · 2023-02-09 ·

A portable terminal device in an information processing system and method includes a camera and a microphone. Data of obtained images and voice are transmitted to a server that identifies operations to be executed based on the received voice and image data. The server transmits an identification of one or more results of the plurality of operations to the portable terminal device. When the portable terminal device receives only one result from the server, an operation corresponding to the one result is executed, and when a plurality of results is received, the portable terminal device displays information corresponding to the plurality of results as candidates. Additional voice is captured for selecting one of the plurality of results during the displaying of the information. A determination of one result from the plurality of results is made based on the captured voice, and an operation corresponding to the determined result is executed.

Automatic dialing

In general, the subject matter described in this specification can be embodied in methods, systems, and program products for providing search results automatically to a user of a computing device. A spoken input provided by a user to a computing device is received. The spoken input is transmitted to a computer server system that is remote from the computing device. Search result information that is responsive to the spoken input is receiving by the computing device and in response to the transmitted spoken input. An alert is provided to the user that the device will connect the user to a target of the search result information if the user does not intervene to stop the connecting of the user. The user is connected to the target of the search result information based on a determination that the user has not intervened to stop the connecting of the user.

Lip language recognition method and mobile terminal using sound and silent modes

A lip language recognition method, applied to a mobile terminal having a sound mode and a silent mode, includes: training a deep neural network in the sound mode; collecting a user's lip images in the silent mode; and identifying content corresponding to the user's lip images with the deep neural network trained in the sound mode. The method further includes: switching from the sound mode to the silent mode when a privacy need of the user arises.