Patent classifications
G10L13/08
SIMPLE AFFIRMATIVE RESPONSE OPERATING SYSTEM
A simple affirmative response operating system for selecting a data item from a list of options using a unique affirmative action. Text-based labels in a listing of content are converted to speech using an embedded text-to-speech engine and an audio output of a first converted label is provided. A listening state is entered into for a predefined pause time to await receipt of the simple affirmative action. If the simple affirmative action is performed during the predefined pause time, an associated content item is selected for output. If the simple affirmative action is not performed during the predefined pause time, an audio output of a next converted label in the list is provided. This protocol may be used to control a variety of computing devices safely and efficiently while a user is distracted or disabled from using traditional input methods.
SIMPLE AFFIRMATIVE RESPONSE OPERATING SYSTEM
A simple affirmative response operating system for selecting a data item from a list of options using a unique affirmative action. Text-based labels in a listing of content are converted to speech using an embedded text-to-speech engine and an audio output of a first converted label is provided. A listening state is entered into for a predefined pause time to await receipt of the simple affirmative action. If the simple affirmative action is performed during the predefined pause time, an associated content item is selected for output. If the simple affirmative action is not performed during the predefined pause time, an audio output of a next converted label in the list is provided. This protocol may be used to control a variety of computing devices safely and efficiently while a user is distracted or disabled from using traditional input methods.
DIALOGUE APPARATUS, METHOD AND PROGRAM
A dialogue apparatus includes a speech recognition unit (1) configured to perform speech recognition on utterance input to generate a text corresponding to the utterance, a speech waveform corresponding to the utterance, and information regarding a length of sound of the utterance; a language understanding unit (2) configured to grasp contents of the utterance by using the text corresponding to the utterance; a dialogue management unit (3) configured to determine contents of a response corresponding to the utterance by using the content of the utterance; an utterance state extraction unit (4) configured to extract a state of the utterance by using the text corresponding to the utterance, the speech waveform corresponding to the utterance, and the information regarding the length of the sound of the utterance; a response state determination unit (5) configured to determine a state of the response according to the state of the utterance; a response sentence generation unit (6) configured to generate a response sentence by using the content of the response; and a speech synthesis unit (7) configured to synthesize speech corresponding to the response sentence with the state of the response taken into account.
SPEECH SYNTHESIS METHOD, AND ELECTRONIC DEVICE
The disclosure provides a speech synthesis method, and an electronic device. The technical solution is described as follows. A text to be synthesized and speech features of a target user are obtained. Predicted first acoustic features based on the text to be synthesized and the speech features are obtained. A target template audio is obtained from a template audio library based on the text to be synthesized. Second acoustic features of the target template audio are extracted. Target acoustic features are generated by splicing the first acoustic features and the second acoustic features. Speech synthesis is performed on the text to be synthesized based on the target acoustic features and the speech features, to generate a target speech of the text to be synthesized.
SPEECH SYNTHESIS METHOD, AND ELECTRONIC DEVICE
The disclosure provides a speech synthesis method, and an electronic device. The technical solution is described as follows. A text to be synthesized and speech features of a target user are obtained. Predicted first acoustic features based on the text to be synthesized and the speech features are obtained. A target template audio is obtained from a template audio library based on the text to be synthesized. Second acoustic features of the target template audio are extracted. Target acoustic features are generated by splicing the first acoustic features and the second acoustic features. Speech synthesis is performed on the text to be synthesized based on the target acoustic features and the speech features, to generate a target speech of the text to be synthesized.
Speech-processing system
A system may include first and second speech-processing systems with corresponding first and second wakewords. An utterance may contain two or more wakewords. The system determines which wakeword was spoken first and can send data to that wakeword's speech-processing system to perform further processing.
Speech-processing system
A system may include first and second speech-processing systems with corresponding first and second wakewords. An utterance may contain two or more wakewords. The system determines which wakeword was spoken first and can send data to that wakeword's speech-processing system to perform further processing.
Real time correction of accent in speech audio signals
Systems and methods for real-time correction of an accent in a speech audio signal are provided. A method includes dividing the speech audio signal into a stream of input chunks, an input chunk from the stream of input chunks including a pre-defined number of frames of the speech audio signal, extracting, by an acoustic features extraction module from the input chunk and a context associated with the input chunk, acoustic features, the context is a pre-determined number of the frames preceding the input chunk in the stream; extracting, by a linguistic features extraction module from the input chunk and the context, linguistic features, receiving a speaker embedding for a human speaker, providing the speaker embedding, the acoustic features, and the linguistic features to a synthesis module to generate a melspectrogram with a reduced accent, providing the melspectrogram to a vocoder to generate an output chunk of an output audio signal.
Artificial intelligence apparatus for converting text and speech in consideration of style and method for the same
According to an embodiment of the present invention, there is provided an artificial intelligence (AI) apparatus for mutually converting a text and a speech, including: a memory configured to store a plurality of Text-To-Speech (TTS) engines; and a processor configured to: obtain image data containing a text, determine a speech style corresponding to the text, generate a speech corresponding to the text by using a TTS engine corresponding to the determined speech style among the plurality of TTS engines, and output the generated speech.
Artificial intelligence apparatus for converting text and speech in consideration of style and method for the same
According to an embodiment of the present invention, there is provided an artificial intelligence (AI) apparatus for mutually converting a text and a speech, including: a memory configured to store a plurality of Text-To-Speech (TTS) engines; and a processor configured to: obtain image data containing a text, determine a speech style corresponding to the text, generate a speech corresponding to the text by using a TTS engine corresponding to the determined speech style among the plurality of TTS engines, and output the generated speech.