Patent classifications
G10L13/047
Synthetic speech processing
A speech-processing system receives input data representing text. An input encoder processes the input data to determine first embedding data representing the text. A local attention encoder processes a subset of the first embedding data in accordance with a predicted size to determine second embedding data. An attention encoder processes the second embedding data to determine third embedding data. A decoder processes the third embedding data to determine audio data corresponding to the text.
Synthetic speech processing
A speech-processing system receives input data representing text. An input encoder processes the input data to determine first embedding data representing the text. A local attention encoder processes a subset of the first embedding data in accordance with a predicted size to determine second embedding data. An attention encoder processes the second embedding data to determine third embedding data. A decoder processes the third embedding data to determine audio data corresponding to the text.
Digital audio method for creating and sharing audio books using a combination of virtual voices and recorded voices, customization based on characters, serialized content, voice emotions, and audio assembler module
A method includes receiving a text file of an author's book as input to a serialized process that creates a record of each paragraph of text and creating a character file with associated character attributes and information required for the recording process and or virtualization process. The method includes combining the serialized file with the character file to create a snippet file, assigning characters to snippets, and generating audio files from snippets using text-to-speech APIs. The snippets of text are assigned to a character, can be edited and audio played back. The method includes sharing snippets with narrators to record specific characters not represented by text-to-speech synthesized audio and concatenating all audio files from snippets, with proper time spacing, into a publishable audiobook format. The snippets are concatenated, and audio files are created through links to text-to-speech API processes. The snippets are concatenated and shared with a human narrator.
Digital audio method for creating and sharing audio books using a combination of virtual voices and recorded voices, customization based on characters, serialized content, voice emotions, and audio assembler module
A method includes receiving a text file of an author's book as input to a serialized process that creates a record of each paragraph of text and creating a character file with associated character attributes and information required for the recording process and or virtualization process. The method includes combining the serialized file with the character file to create a snippet file, assigning characters to snippets, and generating audio files from snippets using text-to-speech APIs. The snippets of text are assigned to a character, can be edited and audio played back. The method includes sharing snippets with narrators to record specific characters not represented by text-to-speech synthesized audio and concatenating all audio files from snippets, with proper time spacing, into a publishable audiobook format. The snippets are concatenated, and audio files are created through links to text-to-speech API processes. The snippets are concatenated and shared with a human narrator.
DYNAMIC TEMPERED SAMPLING IN GENERATIVE MODELS INFERENCE
A method of sampling output audio samples includes, during a packet loss concealment event, obtaining a sequence of previous output audio samples. At each time step during the event, the method includes generating a probability distribution over possible output audio samples for the time step. Each sample includes a respective probability indicating a likelihood that the corresponding sample represents a portion of an utterance at the time step. The method also includes determining a temperature sampling value based on a function of a number of time steps that precedes the time step, and an initial, a minimum, and a maximum temperature sampling value. The method also includes applying the temperature sampling value to the probability distribution to adjust a probability of selecting possible samples and randomly selecting one of the possible samples based on the adjusted probability. The method also includes generating synthesized speech using the randomly selected sample.
DYNAMIC TEMPERED SAMPLING IN GENERATIVE MODELS INFERENCE
A method of sampling output audio samples includes, during a packet loss concealment event, obtaining a sequence of previous output audio samples. At each time step during the event, the method includes generating a probability distribution over possible output audio samples for the time step. Each sample includes a respective probability indicating a likelihood that the corresponding sample represents a portion of an utterance at the time step. The method also includes determining a temperature sampling value based on a function of a number of time steps that precedes the time step, and an initial, a minimum, and a maximum temperature sampling value. The method also includes applying the temperature sampling value to the probability distribution to adjust a probability of selecting possible samples and randomly selecting one of the possible samples based on the adjusted probability. The method also includes generating synthesized speech using the randomly selected sample.
POSE ESTIMATION MODEL LEARNING APPARATUS, POSE ESTIMATION APPARATUS, METHODS AND PROGRAMS FOR THE SAME
A pause estimation model learning apparatus includes: a morphological analysis unit configured to perform morphological analysis on training text data to provide M types of information, M being an integer that is equal to or larger than 2; a feature selection unit configured to combine N pieces of information, among the M pieces of information, to be an input feature when a predetermined certain condition is satisfied, and select predetermined one of the N pieces of information to be the input feature when the certain condition is not satisfied, N being an integer that is equal to or larger than 2 and equal to or smaller than M; and a learning unit configured to learn a pause estimation model by using the input feature selected by the feature selection unit and a pause correct label.
POSE ESTIMATION MODEL LEARNING APPARATUS, POSE ESTIMATION APPARATUS, METHODS AND PROGRAMS FOR THE SAME
A pause estimation model learning apparatus includes: a morphological analysis unit configured to perform morphological analysis on training text data to provide M types of information, M being an integer that is equal to or larger than 2; a feature selection unit configured to combine N pieces of information, among the M pieces of information, to be an input feature when a predetermined certain condition is satisfied, and select predetermined one of the N pieces of information to be the input feature when the certain condition is not satisfied, N being an integer that is equal to or larger than 2 and equal to or smaller than M; and a learning unit configured to learn a pause estimation model by using the input feature selected by the feature selection unit and a pause correct label.
VOICE COMMUNICATION BETWEEN A SPEAKER AND A RECIPIENT OVER A COMMUNICATION NETWORK
Voice communication, between a speaker and a recipient, either or both of which may be in a motor vehicle, is provided via a communication network. In a first step, an input speech utterance is received from the speaker. Optionally, a bandwidth of a connection to the communication network is evaluated at the side of the speaker. The input speech utterance is then converted to text. At least the text is transmitted over the communication network. In case of a sufficiently large bandwidth, the input speech utterance may be transmitted as voice and as text. The transmitted text is converted into an output speech utterance that simulates a voice of the speaker. Finally, the output speech utterance is provided to the recipient.
VOICE COMMUNICATION BETWEEN A SPEAKER AND A RECIPIENT OVER A COMMUNICATION NETWORK
Voice communication, between a speaker and a recipient, either or both of which may be in a motor vehicle, is provided via a communication network. In a first step, an input speech utterance is received from the speaker. Optionally, a bandwidth of a connection to the communication network is evaluated at the side of the speaker. The input speech utterance is then converted to text. At least the text is transmitted over the communication network. In case of a sufficiently large bandwidth, the input speech utterance may be transmitted as voice and as text. The transmitted text is converted into an output speech utterance that simulates a voice of the speaker. Finally, the output speech utterance is provided to the recipient.