Patent classifications
G10L21/043
SPEECH SYNTHESIS METHOD AND APPARATUS, AND READABLE STORAGE MEDIUM
A speech synthesis method includes: converting a text input sequence into a text feature representation sequence; inputting the text feature representation sequence into an encoder including N encoding layers; the N encoding layers including an encoding layer E.sub.i and an encoding layer E.sub.i+1; the encoding layer E.sub.i+1 including a first multi-head self-attention network; acquiring a first attention matrix and a historical text encoded sequence outputted by the encoding layer E.sub.i, and generating a second attention matrix of the encoding layer E.sub.i+1 according to residual connection between the first attention matrix and the first multi-head self-attention network and the historical text encoded sequence; and generating a target text encoded sequence of the encoding layer E.sub.i+1 according to the second attention matrix and the historical text encoded sequence, and generating synthesized speech data matched with the text input sequence based on the target text encoded sequence.
SYSTEMS AND METHODS FOR PROVIDING AUDIO-FILE LOOP-PLAYBACK FUNCTIONALITY
Systems and methods for providing audio-file loop-playback functionality are provided. The system includes a processor that performs a method including setting a playback loop start-point based on a first selection of a button; setting a loop end-point, associating a loop with an audio file, and entering into the loop based on a second selection of the button; and exiting the loop based on a third selection of the button. Associating the loop with the audio file includes adding metadata to the audio file. The metadata associates the loop with a button. The method includes reentering the loop based on a fourth selection of the button and exiting the loop based on a fifth selection of the button.
SYSTEMS AND METHODS FOR PROVIDING AUDIO-FILE LOOP-PLAYBACK FUNCTIONALITY
Systems and methods for providing audio-file loop-playback functionality are provided. The system includes a processor that performs a method including setting a playback loop start-point based on a first selection of a button; setting a loop end-point, associating a loop with an audio file, and entering into the loop based on a second selection of the button; and exiting the loop based on a third selection of the button. Associating the loop with the audio file includes adding metadata to the audio file. The metadata associates the loop with a button. The method includes reentering the loop based on a fourth selection of the button and exiting the loop based on a fifth selection of the button.
COMMUNICATION APPARATUS MOUNTED WITH SPEECH SPEED CONVERSION DEVICE
In a communication apparatus, an encoder compresses telephone call voice which is transmitted from another communication apparatus. A voice accumulator preserves the telephone call voice, which is compressed by the encoder, as a message. A decoder expands the telephone call voice which is preserved in the voice accumulator. A signal memory temporarily maintains the telephone call voice which is expanded by the decoder. A speech speed convertor performs speech speed conversion on the telephone call voice, which is read from the signal memory, and outputs resulting voice from a speaker. A memory monitor temporarily stops to expand the telephone call voice in the decoder in a case where the memory monitor determines that an idle capacity of the signal memory approaches a predetermined lower limit value.
COMMUNICATION APPARATUS MOUNTED WITH SPEECH SPEED CONVERSION DEVICE
In a communication apparatus, an encoder compresses telephone call voice which is transmitted from another communication apparatus. A voice accumulator preserves the telephone call voice, which is compressed by the encoder, as a message. A decoder expands the telephone call voice which is preserved in the voice accumulator. A signal memory temporarily maintains the telephone call voice which is expanded by the decoder. A speech speed convertor performs speech speed conversion on the telephone call voice, which is read from the signal memory, and outputs resulting voice from a speaker. A memory monitor temporarily stops to expand the telephone call voice in the decoder in a case where the memory monitor determines that an idle capacity of the signal memory approaches a predetermined lower limit value.
Fast playback in media files with reduced impact to speech quality
The present invention is a computer program product and method for increasing the playback speed of audio or other media files. The computer program product and method identifies pedagogic media files and adds a flag to the metadata of the media file. The flag represents the number and type of pauses or silent sections in the pedagogic media file. Based on the flag, the computer program product and method may fast forward or remove a portion of the pauses and silent sections to provide a new playback speed.
Fast playback in media files with reduced impact to speech quality
The present invention is a computer program product and method for increasing the playback speed of audio or other media files. The computer program product and method identifies pedagogic media files and adds a flag to the metadata of the media file. The flag represents the number and type of pauses or silent sections in the pedagogic media file. Based on the flag, the computer program product and method may fast forward or remove a portion of the pauses and silent sections to provide a new playback speed.
Transcription of audio
A method may include obtaining first features of first audio data that includes speech and obtaining second features of second audio data that is a revoicing of the first audio data. The method may further include providing the first features and the second features to an automatic speech recognition system and obtaining a single transcription generated by the automatic speech recognition system using the first features and the second features.
TRANSCRIPTION OF AUDIO
A method may include obtaining first features of first audio data that includes speech and obtaining second features of second audio data that is a revoicing of the first audio data. The method may further include providing the first features and the second features to an automatic speech recognition system and obtaining a single transcription generated by the automatic speech recognition system using the first features and the second features.
AUDIO DRIVEN ACCELERATED BINGE WATCH
Example embodiments provide systems and methods for accelerating digital content playback based on speech. A content acceleration system electronically accesses digital content. The system analyzes the digital content to detect at least one audio portion within the digital content, each of the at least one audio portion comprising speech. The system creates at least one digital content segment from the digital content based on the at least one audio portion, whereby a beginning of each digital content segment of the at least one digital content segment coincides with a beginning of a corresponding audio portion of the at least one audio portion. The system then accelerates playback of the digital content by fast forwarding through parts of the at least one digital content segment where speech is absent.