G10L13/07

Text-to-speech synthesis system and method

A method, computer program product, and computer system for text-to-speech synthesis is disclosed. Synthetic speech data for an input text may be generated. The synthetic speech data may be compared to recorded reference speech data corresponding to the input text. Based on, at least in part, the comparison of the synthetic speech data to the recorded reference speech data, at least one feature indicative of at least one difference between the synthetic speech data and the recorded reference speech data may be extracted. A speech gap filling model may be generated based on, at least in part, the at least one feature extracted. A speech output may be generated based on, at least in part, the speech gap filling model.

Speech synthesis unit selection
11393450 · 2022-07-19 · ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for selecting units for speech synthesis. One of the methods includes determining a sequence of text units that each represent a respective portion of text for speech synthesis; and determining multiple paths of speech units that each represent the sequence of text units by selecting a first speech unit that includes speech synthesis data representing a first text unit; selecting multiple second speech units including speech synthesis data representing a second text unit based on (i) a join cost to concatenate the second speech unit with a first speech unit and (ii) a target cost indicating a degree that the second speech unit corresponds to the second text unit; and defining paths from the selected first speech unit to each of the multiple second speech units to include in the multiple paths of speech units.

Speech synthesis unit selection
11393450 · 2022-07-19 · ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for selecting units for speech synthesis. One of the methods includes determining a sequence of text units that each represent a respective portion of text for speech synthesis; and determining multiple paths of speech units that each represent the sequence of text units by selecting a first speech unit that includes speech synthesis data representing a first text unit; selecting multiple second speech units including speech synthesis data representing a second text unit based on (i) a join cost to concatenate the second speech unit with a first speech unit and (ii) a target cost indicating a degree that the second speech unit corresponds to the second text unit; and defining paths from the selected first speech unit to each of the multiple second speech units to include in the multiple paths of speech units.

Speech synthesis method, speech synthesis device, and electronic apparatus

A speech synthesis method, a speech synthesis device, and an electronic apparatus are provided, which relate to a field of speech synthesis. Specific implementation solution is the following: inputting text information into an encoder of an acoustic model, to output a text feature of a current time step; splicing the text feature of the current time step with a spectral feature of a previous time step to obtain a spliced feature of the current time step, and inputting the spliced feature of the current time step into an decoder of the acoustic model to obtain a spectral feature of the current time step; and inputting the spectral feature of the current time step into a neural network vocoder, to output speech.

Text-to-speech (TTS) processing with transfer of vocal characteristics

Audio data from a first, source speaker is received and processed to determine linguistic units and vocal characteristics corresponding to those linguistic units. The linguistic units may either be determined from received text data or may be determined from the audio data using automatic speech recognition. A model is trained using training data from a second, target speaker. The trained model concatenates the linguistic units with the vocal characteristics to produce output speech that has the “voice” of the target speaker and the vocal characteristics of the source speaker.

AUDIO PROCESSING METHOD, APPARATUS, AND DEVICE, AND STORAGE MEDIUM
20220262339 · 2022-08-18 ·

This application relates to an audio processing method, an electronic device, and a storage medium. The method includes: displaying a target audio clip and corresponding target text information having a mapping relationship between a location of an audio segment in the target audio clip and a location of text information in the corresponding target text information; receiving, a selection of a location in the corresponding target text information as a to-be-processed text location; matching a to-be-processed audio location of an audio segment that has the mapping relationship with the to-be-processed text location; and processing the target audio at the to-be-processed audio location to generate an updated target audio clip, and updating the corresponding target text information at the to-be-processed text location to generate updated target text information; and displaying the updated target audio clip and the updated target text information.

AUDIO PROCESSING METHOD, APPARATUS, AND DEVICE, AND STORAGE MEDIUM
20220262339 · 2022-08-18 ·

This application relates to an audio processing method, an electronic device, and a storage medium. The method includes: displaying a target audio clip and corresponding target text information having a mapping relationship between a location of an audio segment in the target audio clip and a location of text information in the corresponding target text information; receiving, a selection of a location in the corresponding target text information as a to-be-processed text location; matching a to-be-processed audio location of an audio segment that has the mapping relationship with the to-be-processed text location; and processing the target audio at the to-be-processed audio location to generate an updated target audio clip, and updating the corresponding target text information at the to-be-processed text location to generate updated target text information; and displaying the updated target audio clip and the updated target text information.

System and method for distributed voice models across cloud and device for embedded text-to-speech

Systems, methods, and computer-readable storage media for intelligent caching of concatenative speech units for use in speech synthesis. A system configured to practice the method can identify speech units that are required for synthesizing speech. The system can request from a server the text-to-speech unit needed to synthesize the speech. The system can then synthesize speech using text-to-speech units already stored and a received text-to-speech unit from the server.

System and method for distributed voice models across cloud and device for embedded text-to-speech

Systems, methods, and computer-readable storage media for intelligent caching of concatenative speech units for use in speech synthesis. A system configured to practice the method can identify speech units that are required for synthesizing speech. The system can request from a server the text-to-speech unit needed to synthesize the speech. The system can then synthesize speech using text-to-speech units already stored and a received text-to-speech unit from the server.

Voice synthesis apparatus and voice synthesis method utilizing diphones or triphones and machine learning

A voice synthesis method includes: sequentially acquiring voice units comprising at least one of diphone or a triphone in accordance with synthesis information for synthesizing voices; generating statistical spectral envelopes using a statistical model built by machine learning in accordance with the synthesis information for synthesizing the voices; and concatenating the sequentially acquired voice units and modifying a frequency spectral envelope of each voice unit in accordance with the generated statistical spectral envelope, thereby synthesizing a voice signal based on the concatenated voice units having the modified frequency spectra.