Patent classifications
G10L13/027
TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, AND A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM
A text-to-speech synthesis method includes receiving text, inputting the received text in a synthesizer that includes a prediction network configured to convert the received text into speech data having a speech attribute that includes emotion, intention, projection, pace, and/or accent, and outputting said speech data. The prediction network is obtained by obtaining a first sub-dataset and a second sub-dataset, where the first sub-dataset and the second sub-dataset each include audio samples and corresponding text, and the speech attribute of the audio samples of the second sub-dataset is more pronounced than the speech attribute of the audio samples of the first sub-dataset, training a first model using the first sub-dataset until a performance metric reaches a first predetermined value, training a second model by further training the first model using the second sub-dataset until the performance metric reaches a second predetermined value, and selecting one trained model as the prediction network.
MANAGEMENT APPARATUS, MANAGEMENT SYSTEM, MANAGEMENT METHOD, AND RECORDING MEDIUM
A management apparatus includes: an extractor that extracts, on the basis of a predetermined condition, one or more utterance histories from a plurality of utterance histories each indicating the content of a voice caused to be output by a first device (e.g., utterance device); and a display controller that causes a second device (e.g., display device) to display information indicating the one or more utterance histories.
MANAGEMENT APPARATUS, MANAGEMENT SYSTEM, MANAGEMENT METHOD, AND RECORDING MEDIUM
A management apparatus includes: an extractor that extracts, on the basis of a predetermined condition, one or more utterance histories from a plurality of utterance histories each indicating the content of a voice caused to be output by a first device (e.g., utterance device); and a display controller that causes a second device (e.g., display device) to display information indicating the one or more utterance histories.
Real-time neural text-to-speech
Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.
Real-time neural text-to-speech
Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.
VOICE INTERACTION METHOD AND ELECTRONIC DEVICE
Embodiments of this application provide a voice interaction method and an electronic device, and relate to the field of artificial intelligence AI technologies and the field of voice processing technologies. A specific solution includes: An electronic device may receive first voice information sent by a second user, and the electronic device recognizes the first voice information in response to the first voice information. The first voice information is used to request a voice conversation with a first user. The electronic device may have, on a basis that the electronic device recognizes that the first voice information is voice information of the second user, a voice conversation with the second user by imitating a voice of the first user and in a mode in which the first user has a voice conversation with the second user.
Multimedia processing method and electronic system
An electronic system is provided. The electronic system includes a host and a display. The host includes an audio processing module, and a smart interpreter engine. The audio processing module is utilized for acquiring audio data corresponding to a first language from audio streams processed by an application program executed on the host. The application program executed on the host includes a specific game software. The smart interpreter engine is utilized for receiving the audio data corresponding to the first language from the audio processing module and converting the audio data corresponding to the first language into text data corresponding to a second language according to the game software executed on the host The display is utilized for receiving the text data corresponding to the second language from the smart interpreter engine and displaying the text data corresponding to the second language.
Multimedia processing method and electronic system
An electronic system is provided. The electronic system includes a host and a display. The host includes an audio processing module, and a smart interpreter engine. The audio processing module is utilized for acquiring audio data corresponding to a first language from audio streams processed by an application program executed on the host. The application program executed on the host includes a specific game software. The smart interpreter engine is utilized for receiving the audio data corresponding to the first language from the audio processing module and converting the audio data corresponding to the first language into text data corresponding to a second language according to the game software executed on the host The display is utilized for receiving the text data corresponding to the second language from the smart interpreter engine and displaying the text data corresponding to the second language.
Multi-Purpose Protective Face Mask
A protective face mask implemented with a pocket located on a front surface of the mask A removable amplifier unit configured to be placed into the pocket, the removable amplifier unit comprising: a micro-processor configured to process voice data; a rechargeable battery coupled to the micro-processor; a Bluetooth device coupled to the micro-processor; a microphone coupled to the micro-processor and configured to provide the voice data to the micro-processor; and a speaker unit configured to output the voice data processed by the micro-processor.
Multi-Purpose Protective Face Mask
A protective face mask implemented with a pocket located on a front surface of the mask A removable amplifier unit configured to be placed into the pocket, the removable amplifier unit comprising: a micro-processor configured to process voice data; a rechargeable battery coupled to the micro-processor; a Bluetooth device coupled to the micro-processor; a microphone coupled to the micro-processor and configured to provide the voice data to the micro-processor; and a speaker unit configured to output the voice data processed by the micro-processor.