G10L13/0335

UNIT-SELECTION TEXT-TO-SPEECH SYNTHESIS BASED ON PREDICTED CONCATENATION PARAMETERS

Systems and processes for performing unit-selection text-to-speech synthesis are provided. In an example process, text to be converted to speech is received. The text is represented as a sequence of target units. A plurality of candidate speech segments corresponding to the sequence of target units are selected. Predicted statistical parameters of acoustic features associated with the sequence of target units are determined. The predicted statistical parameters of acoustic features are used to determine target costs and concatenation costs associated with the plurality of candidate speech segments. Based on a combined cost determined from the target costs and concatenation costs, a subset of candidate speech segments is selected from the plurality of candidate speech segments. Speech corresponding to the received text is generated using the subset of candidate speech segments.

Real-time speech-to-speech generation (RSSG) and sign language conversion apparatus, method and a system therefore
11501091 · 2022-11-15 ·

A real-time speech-to-speech generator and sign gestures converter system is disclosed. The system is still challenging for deaf or hearing impaired people. Embodiments of the invention provide direct speech to speech translation system and further conversion to sign gestures is disclosed. Direct speech to speech translation and further sign gesture conversion uses a one-tier approach, creating a unified-model for whole application. The single-model ecosystem takes in audio (MEL spectrogram) as an input and gives out audio (MEL spectrogram) as an output to a speech-sign converter device with a display. This solves the bottleneck problem by converting the translated speech directly to sign language gesture from first language with emotion by preserving phonetic information along the way. This model needs parallel audio samples in two languages. The training methodology involves augmenting or changing both sides of the audio equally and later converts to sign gestures which are being displayed on a speech-sign converter device.

Real-time speech to singing conversion

A method of converting a frame of a voice sample to a singing frame includes obtaining a pitch value of the frame; obtaining formant information of the frame using the pitch value; obtaining aperiodicity information of the frame using the pitch value; obtaining a tonic pitch and chord pitches; using the formant information, the aperiodicity information, the tonic pitch, and the chord pitches to obtain the singing frame; and outputting or saving the singing frame.

Systems and techniques for producing spoken voice prompts

Methods and systems are described in which spoken voice prompts can be produced in a manner such that they will most likely have the desired effect, for example to indicate empathy, or produce a desired follow-up action from a call recipient. The prompts can be produced with specific optimized speech parameters, including duration, gender of speaker, and pitch, so as to encourage participation and promote comprehension among a wide range of patients or listeners. Upon hearing such voice prompts, patients/listeners can know immediately when they are being asked questions that they are expected to answer, and when they are being given information, as well as the information that considered sensitive.

Sound synthesis device, sound synthesis method and storage medium
09805711 · 2017-10-31 · ·

A sound synthesis device that includes a processor configured to perform the following: extracting intonation information from prosodic information contained in sound data and digitally smoothing the extracted intonation information to obtain smoothed intonation information; obtaining a plurality of digital sound units based on text data and concatenating the plurality of digital sound units so as to construct a concatenated series of digital sound units that corresponds to the text data; and modifying the concatenated series of digital sound units in accordance with the smoothed intonation information with respect to at least one of parameters of the concatenated series of digital sound units to generate synthesized sound data corresponding to the text data.

SPEAKING-RATE NORMALIZED PROSODIC PARAMETER BUILDER, SPEAKING-RATE DEPENDENT PROSODIC MODEL BUILDER, SPEAKING-RATE CONTROLLED PROSODIC-INFORMATION GENERATION DEVICE AND PROSODIC-INFORMATION GENERATION METHOD ABLE TO LEARN DIFFERENT LANGUAGES AND MIMIC VARIOUS SPEAKERS' SPEAKING STYLES
20170309271 · 2017-10-26 · ·

A speaking-rate dependent prosodic model builder and a related method are disclosed. The proposed builder includes a first input terminal for receiving a first information of a first language spoken by a first speaker, a second input terminal for receiving a second information of a second language spoken by a second speaker and a functional information unit having a function, wherein the function includes a first plurality of parameters simultaneously relevant to the first language and the second language or a plurality of sub-parameters in a second plurality of parameters relevant to the second language alone, and the functional information unit under a maximum a posteriori condition and based on the first information, the second information and the first plurality of parameters or the plurality of sub-parameters produces speaking-rate dependent reference information and constructs a speaking-rate dependent prosodic model of the second language.

Unsupervised singing voice conversion with pitch adversarial network
11257480 · 2022-02-22 · ·

A method, a computer readable medium, and a computer system are provided for singing voice conversion. Data corresponding to a singing voice is received. One or more features and pitch data are extracted from the received data using one or more adversarial neural networks. One or more audio samples are generated based on the extracted pitch data and the one or more features.

HEARING ASSISTANCE WITH AUTOMATED SPEECH TRANSCRIPTION

The assistive hearing device implementations described herein assist hearing impaired users of the device by using automated speech transcription to generate text representing speech received in audio signals which can then be read in a synthesized voice tailored to overcome a user's hearing deficiencies. A speech recognition engine recognizes speech in received audio and converts the speech of the received audio to text. Once the speech is converted to text, a text-to-speech engine can convert the text to synthesized speech that can be enhanced and output in a voice that compensates for the hearing loss profiles of a user of the assistive hearing device. By transcribing received speech into text the assistive hearing device implementations described herein eliminate background noise from the audio signal. By converting the transcribed text into a synthesized voice that is easier to understand to hearing impaired persons, their hearing deficiencies can be remedied.

Speech Synthesis Device and Method
20170221470 · 2017-08-03 ·

This invention is an improvement of technology for automatically generating response voice to voice uttered by a speaker (user), and is characterized by controlling a pitch of the response voice in accordance with a pitch of the speaker's utterance. A voice signal of the speaker's utterance (e.g., question) is received, and a pitch (e.g., highest pitch) of a representative portion of the utterance is detected. Voice data of a responsive to the utterance is acquired, and a pitch (e.g., average pitch) based on the acquired response voice data is acquired. A pitch shift amount for shifting the acquired pitch to a target pitch having a particular relationship to the pitch of the representative portion is determined. When response voice is to be synthesized on the basis of the response voice data, the pitch of the response voice to be synthesized is shifted in accordance with the pitch shift amount.

Coordinating and mixing vocals captured from geographically distributed performers

Despite many practical limitations imposed by mobile device platforms and application execution environments, vocal musical performances may be captured and continuously pitch-corrected for mixing and rendering with backing tracks in ways that create compelling user experiences. Based on the techniques described herein, even mere amateurs are encouraged to share with friends and family or to collaborate and contribute vocal performances as part of virtual “glee clubs.” In some implementations, these interactions are facilitated through social network- and/or eMail-mediated sharing of performances and invitations to join in a group performance. Using uploaded vocals captured at clients such as a mobile device, a content server (or service) can mediate such virtual glee clubs by manipulating and mixing the uploaded vocal performances of multiple contributing vocalists.