Patent classifications
G10L2021/0135
Contact center of celebrities
Customers can become bored with an interaction with an agent. By providing speech and/or images, of a celebrity disguising the speech, and/or image, of the agent, customers can appear to interact with a particular celebrity. As a result, customers are more likely to stay engaged and have a positive experience. The celebrity, or a particular persona of a celebrity, may be selected from customer preferences and/or a purpose of the call. For example, a movie star's role may have a persona, such as a “heavy,” suitable for collection calls (audio or audio-video), whereas a scientific or technical innovator may be selected for technical support calls.
CONVERSION LEARNING APPARATUS, CONVERSION LEARNING METHOD, CONVERSION LEARNING PROGRAM AND CONVERSION APPARATUS
A conversion learning device includes: a source encoding unit that converts, by using a first machine learning model, a feature amount sequence of a source domain that is a characteristic of conversion-source content data, into a first internal representation vector sequence that is a matrix in which internal representation vectors at individual locations of the feature amount sequence of the source domain are arranged; a target encoding unit that converts, by using a second machine learning model, a feature amount sequence of a target domain that is a characteristic of conversion-target content data, into a second internal representation vector sequence that is a matrix in which internal representation vectors at individual locations of the feature amount sequence of the target domain are arranged; an attention matrix calculation unit that calculates, by using the first internal representation vector sequence and the second internal representation vector sequence, an attention matrix that is a matrix mapping the individual locations of the feature amount sequence of the source domain to the individual locations of the feature amount sequence of the target domain, and calculates a third internal representation vector sequence that is a product of an internal representation vector sequence calculated by linear conversion of the first internal representation vector sequence and the attention matrix; a target decoding unit that calculates, by using the third internal representation vector sequence, a feature amount sequence of a conversion domain that is used to convert the source domain into the conversion domain, by using a third machine learning model; and a learning execution unit that causes at least one of the target encoding unit and the target decoding unit to learn such that a distance between a submatrix of the feature amount sequence of the target domain and a submatrix of the feature amount sequence of the conversion domain becomes shorter.
NEURAL PITCH-SHIFTING AND TIME-STRETCHING
Methods for modifying audio data include operations for accessing audio data having a first prosody, receiving a target prosody differing from the first prosody, and computing acoustic features representing samples. Computing respective acoustic features for a sample includes computing a pitch feature as a quantized pitch value of the sample by assigning a pitch value, of the target prosody or the audio data, to at least one of a set of pitch bins having equal widths in cents. Computing the respective acoustic features further includes computing a periodicity feature from the audio data. The respective acoustic features for the sample include the pitch feature, the periodicity feature, and other acoustic features. A neural vocoder is applied to the acoustic features to pitch-shift and time-stretch the audio data from the first prosody toward the target prosody.
COMMUNICATION APPARATUS MOUNTED WITH SPEECH SPEED CONVERSION DEVICE
In a communication apparatus, an encoder compresses telephone call voice which is transmitted from another communication apparatus. A voice accumulator preserves the telephone call voice, which is compressed by the encoder, as a message. A decoder expands the telephone call voice which is preserved in the voice accumulator. A signal memory temporarily maintains the telephone call voice which is expanded by the decoder. A speech speed convertor performs speech speed conversion on the telephone call voice, which is read from the signal memory, and outputs resulting voice from a speaker. A memory monitor temporarily stops to expand the telephone call voice in the decoder in a case where the memory monitor determines that an idle capacity of the signal memory approaches a predetermined lower limit value.
Synthesized speech generation
A device for speech generation includes one or more processors configured to receive one or more control parameters indicating target speech characteristics. The one or more processors are also configured to process, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a version of the speech based on the target speech characteristics.
Real-time speech-to-speech generation (RSSG) and sign language conversion apparatus, method and a system therefore
A real-time speech-to-speech generator and sign gestures converter system is disclosed. The system is still challenging for deaf or hearing impaired people. Embodiments of the invention provide direct speech to speech translation system and further conversion to sign gestures is disclosed. Direct speech to speech translation and further sign gesture conversion uses a one-tier approach, creating a unified-model for whole application. The single-model ecosystem takes in audio (MEL spectrogram) as an input and gives out audio (MEL spectrogram) as an output to a speech-sign converter device with a display. This solves the bottleneck problem by converting the translated speech directly to sign language gesture from first language with emotion by preserving phonetic information along the way. This model needs parallel audio samples in two languages. The training methodology involves augmenting or changing both sides of the audio equally and later converts to sign gestures which are being displayed on a speech-sign converter device.
Methods and systems for image and voice processing
Systems and methods are disclosed configured to train an autoencoder using images that include faces, wherein the autoencoder comprises an input layer, an encoder configured to output a latent image from a corresponding input image, and a decoder configured to attempt to reconstruct the input image from the latent image. An image sequence of a face exhibiting a plurality of facial expressions and transitions between facial expressions is generated and accessed. Images of the plurality of facial expressions and transitions between facial expressions are captured from a plurality of different angles and using different lighting. An autoencoder is trained using source images that include the face with different facial expressions captured at different angles with different lighting, and using destination images that include a destination face. The trained autoencoder is used to generate an output where the likeness of the face in the destination images is swapped with the likeness of the source face, while preserving expressions of the destination face.
Unsupervised singing voice conversion with pitch adversarial network
A method, a computer readable medium, and a computer system are provided for singing voice conversion. Data corresponding to a singing voice is received. One or more features and pitch data are extracted from the received data using one or more adversarial neural networks. One or more audio samples are generated based on the extracted pitch data and the one or more features.
METHOD FOR TRANSFORMING AUDIO SIGNAL, DEVICE, AND STORAGE MEDIUM
A method for transforming an audio signal comprises obtaining a plurality of segmental original frequency-domain signal segments and a plurality of segmental target frequency-domain signal segments by segmenting and performing a Fourier transform on an original audio signal and an initial target audio signal obtained by pitch shifting on the original audio signal; obtaining a plurality of original formant envelopes by respectively filtering the plurality of segmental original frequency-domain signal segments according to a plurality of original segment window functions, and obtaining a plurality of target formant envelopes by respectively filtering the plurality of segmental target frequency-domain signal segments according to a plurality of target segment window functions; and determining a pitch-shifted audio signal based on the plurality of segmental target frequency-domain signal segments, the plurality of original formant envelopes, and the plurality of target formant envelopes.
Cue-aware privacy filter for participants in persistent communications
A cue, for example a facial expression or hand gesture, is identified, and a device communication is filtered according to the cue.