Patent classifications
G10L2021/0135
METHOD AND APPARATUS FOR EXEMPLARY MORPHING COMPUTER SYSTEM BACKGROUND
Method and apparatus for reducing a size of databases required for recorded speech data.
Voice conversion system and training method therefor
The present disclosure proposes a speech conversion scheme for non-parallel corpus training, to get rid of dependence on parallel text and resolve a technical problem that it is difficult to achieve speech conversion under conditions that resources and equipment are limited. A voice conversion system and a training method therefor are included. Compared with the prior art, according to the embodiments of the present disclosure: a trained speaker-independent automatic speech recognition model can be used for any source speaker, that is, the speaker is independent; and bottleneck features of audio are more abstract as compared with phonetic posteriorGram features, can reflect decoupling of spoken content and timbre of the speaker, and meanwhile are not closely bound with a phoneme class, and are not in a clear one-to-one correspondence relationship. In this way, a problem of inaccurate pronunciation caused by a recognition error in ASR is relieved to some extent. Pronunciation accuracy of audio obtained by performing voice conversion by the bottleneck feature is obviously higher than that of a phonetic posteriorGram based method, and timbre is not significantly different. By means of a transfer learning mode, dependence on training corpus can be greatly reduced.
MULTIFUNCTIONAL MICROPHONE
A multifunctional microphone, includes a controlling mainboard; a sound collector electrically connected with the controlling mainboard; a speaker electrically connected with the controlling mainboard; and a sound adjusting module arranged on the controlling mainboard and configured to adjust sound collected by the sound collector.
Devices and Methods for a Speech-Based User Interface
A device may identify a plurality of sources for outputs that the device is configured to provide. The plurality of sources may include at least one of a particular application in the device, an operating system of the device, a particular area within a display of the device, or a particular graphical user interface object. The device may also assign a set of distinct voices to respective sources of the plurality of sources. The device may also receive a request for speech output. The device may also select a particular source that is associated with the requested speech output. The device may also generate speech having particular voice characteristics of a particular voice assigned to the particular source.
Method and apparatus for voice modification during a call
A method for voice modification during a telephone call comprising receiving a source audio signal associated with at least one participant, wherein the source audio signal comprises a voice of the at least one participant, detecting a source dialect of the at least one participant, selecting a target dialect based on at least a characteristic of a target participant and creating a modulated audio signal based on the source audio signal, the source dialect, and the target dialect and transmitting the modulated audio signal to the target participant.
SYNTHESIZED SPEECH GENERATION
A device for speech generation includes one or more processors configured to receive one or more control parameters indicating target speech characteristics. The one or more processors are also configured to process, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a version of the speech based on the target speech characteristics.
OPTIMIZING INTERACTION RESULTS USING AI-GUIDED MANIPULATED VIDEO
Real-time modification of audio of humans allows for the audio to be modified so that an expression of a subject human may be changed. Customer service agents may have more successful interactions with customers if they provide vocalization attribute in their speech that are appropriate, such as to provide a particular emotional state. By determining an appropriate vocalization attribute, and any deviation from a customer service agent's current vocalization attribute, a modification to the audio of the customer service agent's speech may be determined and applied. As a result, agents may not have a vocalization attribute that is best suited to successfully resolve a purpose of the interaction, altered to have the customer be presented with the customer service agent's speech having the best-suited vocalization attribute.
Presentation of audio and visual content at live events based on user accessibility
An information processing device includes circuitry that receives a user-input for selection of one of a visual accessibility feature and an aural accessibility feature and further receives a first audio segment from an audio capturing device at a live event. The first audio segment includes a first audio portion of the audio content and a first audio closed caption (CC) information. The circuitry controls display of first text information for the first audio portion and second text information for the first audio CC information, based on received user-input for the selection of the visual accessibility feature. The circuitry generates a second audio segment from the first audio segment based on a first audio characteristic of the first audio segment. The circuitry controls a playback of the generated second audio segment, based on the received user-input for the selection of the aural accessibility feature.
System and method for voice morphing
A system and method for masking an identity of a speaker of natural language speech, such as speech clips to be labeled by humans in a system generating voice transcriptions for training an automatic speech recognition model. The natural language speech is morphed prior to being presented to the human for labeling. In one embodiment, morphing comprises pitch shifting the speech randomly either up or down, then frequency shifting the speech, then pitch shifting the speech in a direction opposite the first pitch shift.
Systems, methods and computer program products for generating script elements and call to action components therefor
A call to action processor receives an entity datapoint containing data related to an entity, a campaign objective datapoint containing data associated with a campaign objective, at least one definite script element based on the campaign objective, and entity metadata containing data associated with the entity. The call to action further performs generating at least one variable script element based on the entity metadata, presenting to a device the at least one definite script element the at least one variable script element.