Patent classifications
G10L13/047
SYSTEMS AND METHODS FOR PROVIDING AUDIBLE FLIGHT INFORMATION
Disclosed are methods and systems for providing audible flight information to an operator of an aircraft. A method, for example, may include receiving flight information detected by one or more sensors positioned on the aircraft, causing an image to be displayed on a display device, the image including a plurality of text items corresponding to the flight information, receiving a first operator selection indicative of one or more of the text items, parsing the one or more text items to generate a set of intermediate data, synthesizing audio data based on the intermediate data, and causing audible content corresponding to the audio data to be emitted by one or more audio emitting devices, wherein the audible content includes speech corresponding to the flight information.
SYSTEMS AND METHODS FOR PROVIDING AUDIBLE FLIGHT INFORMATION
Disclosed are methods and systems for providing audible flight information to an operator of an aircraft. A method, for example, may include receiving flight information detected by one or more sensors positioned on the aircraft, causing an image to be displayed on a display device, the image including a plurality of text items corresponding to the flight information, receiving a first operator selection indicative of one or more of the text items, parsing the one or more text items to generate a set of intermediate data, synthesizing audio data based on the intermediate data, and causing audible content corresponding to the audio data to be emitted by one or more audio emitting devices, wherein the audible content includes speech corresponding to the flight information.
Synthetic speech processing
A speech-processing system receives input data representing text. A first encoder processes segments of the text to determine embedding data representing the text, and a second encoder processes corresponding audio data to determine prosodic data corresponding to the text. The embedding and prosodic data is processed to create output data including a representation of speech corresponding to the text and prosody.
Synthetic speech processing
A speech-processing system receives input data representing text. A first encoder processes segments of the text to determine embedding data representing the text, and a second encoder processes corresponding audio data to determine prosodic data corresponding to the text. The embedding and prosodic data is processed to create output data including a representation of speech corresponding to the text and prosody.
Multilingual speech synthesis and cross-language voice cloning
A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.
Multilingual speech synthesis and cross-language voice cloning
A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.
Text to Speech Processing Method, Terminal, and Server
A text to speech processing method implemented by a terminal includes detecting an instruction to perform a text to speech conversion, sending text to a server downloading from the server, audio data based on the text, determining whether a first frame of playable audio data is downloaded within a preset duration, and continuing to download remaining audio data when the first frame is downloaded within the preset duration.
Text to Speech Processing Method, Terminal, and Server
A text to speech processing method implemented by a terminal includes detecting an instruction to perform a text to speech conversion, sending text to a server downloading from the server, audio data based on the text, determining whether a first frame of playable audio data is downloaded within a preset duration, and continuing to download remaining audio data when the first frame is downloaded within the preset duration.
Training Speech Synthesis to Generate Distinct Speech Sounds
A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.
Training Speech Synthesis to Generate Distinct Speech Sounds
A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.