Patent classifications
G10L13/08
CORRECTION METHOD OF SYNTHESIZED SPEECH SET FOR HEARING AID
A method for correcting a synthesized speech set for hearing aid according to an aspect of the present invention includes the steps of outputting first synthesized speech for testing on the basis of first synthesized speech data for testing correlated with a first phoneme label in a synthesized speech set for testing, accepting a first answer selected by a user, outputting second synthesized speech for testing on the basis of second synthesized speech data for testing correlated with a second phoneme label in the synthesized speech set for testing, accepting a second answer selected by the user, and correlating first synthesized speech data for hearing aid with the second phoneme label instead of second synthesized speech data for hearing aid in a synthesized speech set for hearing aid, in a case in which the first answer matches the second phoneme label and also the second answer does not match the second phoneme label.
CORRECTION METHOD OF SYNTHESIZED SPEECH SET FOR HEARING AID
A method for correcting a synthesized speech set for hearing aid according to an aspect of the present invention includes the steps of outputting first synthesized speech for testing on the basis of first synthesized speech data for testing correlated with a first phoneme label in a synthesized speech set for testing, accepting a first answer selected by a user, outputting second synthesized speech for testing on the basis of second synthesized speech data for testing correlated with a second phoneme label in the synthesized speech set for testing, accepting a second answer selected by the user, and correlating first synthesized speech data for hearing aid with the second phoneme label instead of second synthesized speech data for hearing aid in a synthesized speech set for hearing aid, in a case in which the first answer matches the second phoneme label and also the second answer does not match the second phoneme label.
Training Speech Synthesis to Generate Distinct Speech Sounds
A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.
VOICE INFORMATION PROCESSING METHOD AND ELECTRONIC DEVICE
A voice information processing method and an electronic device are provided. The voice information processing method may include: a first device (1100) obtains first voice information, and when the first voice information includes a wakeup keyword, the first device (1100) sends a voice assistant wakeup instruction to a second device (1200), such that the second device (1200) launches a voice assistant; then the first device (1100) obtains second voice information and sends the second voice information to the second device (1200), the second device (1200) determines a voice triggered event corresponding to the second voice information by using the voice assistant, and feeds target information associated with performance of the voice triggered event back to the first device (1100), such that the first device (1100) performs the voice triggered event based on the target information. The method can reduce the computing burden of the first device (1100).
VOICE INFORMATION PROCESSING METHOD AND ELECTRONIC DEVICE
A voice information processing method and an electronic device are provided. The voice information processing method may include: a first device (1100) obtains first voice information, and when the first voice information includes a wakeup keyword, the first device (1100) sends a voice assistant wakeup instruction to a second device (1200), such that the second device (1200) launches a voice assistant; then the first device (1100) obtains second voice information and sends the second voice information to the second device (1200), the second device (1200) determines a voice triggered event corresponding to the second voice information by using the voice assistant, and feeds target information associated with performance of the voice triggered event back to the first device (1100), such that the first device (1100) performs the voice triggered event based on the target information. The method can reduce the computing burden of the first device (1100).
Automatic Voiceover Generation
A method includes receiving a voice request to generate synthesized voiceover speech for a target advertisement having one or more advertising campaign attributes. The method also includes generating, based on the one or more advertising campaign attributes, a voiceover script that includes a sequence of text for the synthesized voiceover speech. The method also includes generating, using a text-to-speech (TTS) system, the synthesized voiceover speech. The TTS system is configured to receive, as input, the sequence of text for the voiceover script and generate, as output, the synthesized voiceover speech. Here, the synthesized voiceover speech has speech characteristics specified by a target TTS vertical. The method also includes overlaying the synthesized voiceover speech on the target advertisement.
Automatic Voiceover Generation
A method includes receiving a voice request to generate synthesized voiceover speech for a target advertisement having one or more advertising campaign attributes. The method also includes generating, based on the one or more advertising campaign attributes, a voiceover script that includes a sequence of text for the synthesized voiceover speech. The method also includes generating, using a text-to-speech (TTS) system, the synthesized voiceover speech. The TTS system is configured to receive, as input, the sequence of text for the voiceover script and generate, as output, the synthesized voiceover speech. Here, the synthesized voiceover speech has speech characteristics specified by a target TTS vertical. The method also includes overlaying the synthesized voiceover speech on the target advertisement.
Synthetic speech processing
A speech-processing system receives input data representing text. An input encoder processes the input data to determine first embedding data representing the text. A local attention encoder processes a subset of the first embedding data in accordance with a predicted size to determine second embedding data. An attention encoder processes the second embedding data to determine third embedding data. A decoder processes the third embedding data to determine audio data corresponding to the text.
Synthetic speech processing
A speech-processing system receives input data representing text. An input encoder processes the input data to determine first embedding data representing the text. A local attention encoder processes a subset of the first embedding data in accordance with a predicted size to determine second embedding data. An attention encoder processes the second embedding data to determine third embedding data. A decoder processes the third embedding data to determine audio data corresponding to the text.
Generating videos with a character indicating a region of an image
Methods, systems, and computer-readable media for generating videos with characters indicating regions of images are provided. For example, an image containing a first region may be received. At least one characteristic of a character may be obtained. A script containing a first segment of the script may be received. The first segment of the script may be related to the first region of the image. The at least one characteristic of a character and the script may be used to generate a video of the character presenting the script and at least part of the image, where the character visually indicates the first region of the image while presenting the first segment of the script.