G10L13/10

SYNTHESIZED SPEECH AUDIO DATA GENERATED ON BEHALF OF HUMAN PARTICIPANT IN CONVERSATION
20230046658 · 2023-02-16 ·

Generating synthesized speech audio data on behalf of a given user in a conversation. The synthesized speech audio data includes synthesized speech that incorporates textual segment(s). The textual segment(s) can include recognized text that results from processing spoken input, of the given user, using a speech recognition model and/or can include a selection of a rendered suggestion that conveys the textual segment(s). Some implementations dynamically determine one or more prosodic properties for use in speech synthesis of the textual segment, and generate the synthesized speech with the one or more determined prosodic properties. The prosodic properties can be determined based on the textual segment(s) used in speech synthesis, textual segment(s) corresponding to recent spoken input of additional participant(s), attribute(s) of relationship(s) between the given user and additional participant(s) in the conversation, and/or feature(s) of a current location for the conversation.

SYNTHESIZED SPEECH AUDIO DATA GENERATED ON BEHALF OF HUMAN PARTICIPANT IN CONVERSATION
20230046658 · 2023-02-16 ·

Generating synthesized speech audio data on behalf of a given user in a conversation. The synthesized speech audio data includes synthesized speech that incorporates textual segment(s). The textual segment(s) can include recognized text that results from processing spoken input, of the given user, using a speech recognition model and/or can include a selection of a rendered suggestion that conveys the textual segment(s). Some implementations dynamically determine one or more prosodic properties for use in speech synthesis of the textual segment, and generate the synthesized speech with the one or more determined prosodic properties. The prosodic properties can be determined based on the textual segment(s) used in speech synthesis, textual segment(s) corresponding to recent spoken input of additional participant(s), attribute(s) of relationship(s) between the given user and additional participant(s) in the conversation, and/or feature(s) of a current location for the conversation.

AUTONOMOUS MOBILE BODY, INFORMATION PROCESSING METHOD, PROGRAM, AND INFORMATION PROCESSING DEVICE
20230042682 · 2023-02-09 ·

The present technology relates to an autonomous mobile body, an information processing method, a program, and an information processing device, by which a user experience based on an output sound of the autonomous mobile body can be improved. The autonomous mobile body includes a recognition section that recognizes a paired device that is paired with the autonomous mobile body, and a sound control section that changes a control method for an output sound to be outputted from the autonomous mobile body, on the basis of a recognition result of the paired device, and controls the output sound in accordance with the changed control method. The present technology is applicable to a robot, for example.

REPRODUCTION CONTROL METHOD, CONTROL SYSTEM, AND PROGRAM
20230042477 · 2023-02-09 ·

A reproduction control method implemented by a computer includes receiving, from a first terminal device, a first reproduction request in accordance with an instruction from a first user, receiving, from a second terminal device, a second reproduction request in accordance with an instruction from a second user, acquiring a first acoustic signal representing a first sound in accordance with the first reproduction request, and a second acoustic signal representing a second sound which is in accordance with the second reproduction request and have acoustic characteristics that differ from acoustic characteristics of the first sound represented by the first acoustic signal, mixing the first acoustic signal and the second acoustic signal, thereby generating a third acoustic signal, and causing a reproduction system to reproduce a third sound represented by the third acoustic signal.

Synthetic speech processing

A speech-processing system receives input data representing text. An input encoder processes the input data to determine first embedding data representing the text. A local attention encoder processes a subset of the first embedding data in accordance with a predicted size to determine second embedding data. An attention encoder processes the second embedding data to determine third embedding data. A decoder processes the third embedding data to determine audio data corresponding to the text.

Synthetic speech processing

A speech-processing system receives input data representing text. An input encoder processes the input data to determine first embedding data representing the text. A local attention encoder processes a subset of the first embedding data in accordance with a predicted size to determine second embedding data. An attention encoder processes the second embedding data to determine third embedding data. A decoder processes the third embedding data to determine audio data corresponding to the text.

Generating videos with a character indicating a region of an image
11595738 · 2023-02-28 · ·

Methods, systems, and computer-readable media for generating videos with characters indicating regions of images are provided. For example, an image containing a first region may be received. At least one characteristic of a character may be obtained. A script containing a first segment of the script may be received. The first segment of the script may be related to the first region of the image. The at least one characteristic of a character and the script may be used to generate a video of the character presenting the script and at least part of the image, where the character visually indicates the first region of the image while presenting the first segment of the script.

Generating videos with a character indicating a region of an image
11595738 · 2023-02-28 · ·

Methods, systems, and computer-readable media for generating videos with characters indicating regions of images are provided. For example, an image containing a first region may be received. At least one characteristic of a character may be obtained. A script containing a first segment of the script may be received. The first segment of the script may be related to the first region of the image. The at least one characteristic of a character and the script may be used to generate a video of the character presenting the script and at least part of the image, where the character visually indicates the first region of the image while presenting the first segment of the script.

SOUND CONTROL DEVICE, SOUND CONTROL METHOD, AND SOUND CONTROL PROGRAM
20180005617 · 2018-01-04 ·

A sound control device includes: a reception unit that receives a start instruction indicating a start of output of a sound; a reading unit that reads a control parameter that determines an output mode of the sound, in response to the start instruction being received; and a control unit that causes the sound to be output in a mode according to the read control parameter.

SOUND CONTROL DEVICE, SOUND CONTROL METHOD, AND SOUND CONTROL PROGRAM
20180005617 · 2018-01-04 ·

A sound control device includes: a reception unit that receives a start instruction indicating a start of output of a sound; a reading unit that reads a control parameter that determines an output mode of the sound, in response to the start instruction being received; and a control unit that causes the sound to be output in a mode according to the read control parameter.