G10L21/013

Voice morphing apparatus having adjustable parameters
11600284 · 2023-03-07 · ·

A voice morphing apparatus having adjustable parameters is described. The disclosed system and method include a voice morphing apparatus that morphs input audio to mask a speaker's identity. Parameter adjustment uses evaluation of an objective function that is based on the input audio and output of the voice morphing apparatus. The voice morphing apparatus includes objectives that are based adversarially on speaker identification and positively on audio fidelity. Thus, the voice morphing apparatus is adjusted to reduce identifiability of speakers while maintaining fidelity of the morphed audio. The voice morphing apparatus may be used as part of an automatic speech recognition system.

COORDINATING AND MIXING AUDIOVISUAL CONTENT CAPTURED FROM GEOGRAPHICALLY DISTRIBUTED PERFORMERS
20230112247 · 2023-04-13 ·

Audiovisual performances, including vocal music, are captured and coordinated with those of other users in ways that create compelling user experiences. In some cases, the vocal performances of individual users are captured (together with performance synchronized video) on mobile devices, television-type display and/or set-top box equipment in the context of karaoke-style presentations of lyrics in correspondence with audible renderings of a backing track. Contributions of multiple vocalists are coordinated and mixed in a manner that selects for visually prominent presentation performance synchronized video of one or more of the contributors. Prominence of particular performance synchronized video may be based, at least in part, on computationally-defined audio features extracted from (or computed over) captured vocal audio. Over the course of a coordinated audiovisual performance timeline, these computationally-defined audio features are selective for performance synchronized video of one or more of the contributing vocalists.

COORDINATING AND MIXING AUDIOVISUAL CONTENT CAPTURED FROM GEOGRAPHICALLY DISTRIBUTED PERFORMERS
20230112247 · 2023-04-13 ·

Audiovisual performances, including vocal music, are captured and coordinated with those of other users in ways that create compelling user experiences. In some cases, the vocal performances of individual users are captured (together with performance synchronized video) on mobile devices, television-type display and/or set-top box equipment in the context of karaoke-style presentations of lyrics in correspondence with audible renderings of a backing track. Contributions of multiple vocalists are coordinated and mixed in a manner that selects for visually prominent presentation performance synchronized video of one or more of the contributors. Prominence of particular performance synchronized video may be based, at least in part, on computationally-defined audio features extracted from (or computed over) captured vocal audio. Over the course of a coordinated audiovisual performance timeline, these computationally-defined audio features are selective for performance synchronized video of one or more of the contributing vocalists.

APPARATUS AND METHOD FOR PITCH-SHIFTING AUDIO SIGNAL WITH LOW COMPLEXITY

An apparatus and method for pitch-shifting an audio signal with low complexity are disclosed. The method includes identifying a distance between an audio object included in the audio signal and a listener, checking whether the distance between the audio object and the listener decreases, and performing stepwise stretching pitch-shifting of repeatedly using at least one of frequency components of the audio signal when the distance between the audio object and the listener decreases.

Pitch emphasis apparatus, method and program for the same

Provided is pitch enhancement processing having little unnaturalness even in time segments for consonants, and having little unnaturalness to listeners caused by discontinuities even when time segments for consonants and other time segments switch frequently. A pitch emphasis apparatus carries out the following as the pitch enhancement processing: for a time segment in which a spectral envelope of a signal has been determined to be flat, obtaining an output signal for each of times in the time segment, the output signal being a signal including a signal obtained by adding (1) a signal obtained by multiplying the signal of a time, further in the past than the time by a number of samples T.sub.0 corresponding to a pitch period of the time segment, a pitch gain σ.sub.0 of the time segment, a predetermined constant B.sub.0, and a value greater than 0 and less than 1, to (2) the signal of the time.

APPARATUS FOR OUTPUTTING AN AUDIO SIGNAL IN A VEHICLE CABIN
20220319531 · 2022-10-06 · ·

Apparatus (2) for outputting an audio signal (3) in a vehicle cabin (4), the apparatus (2) comprising: at least one audio outputting device (6) configured to output an audio signal (3), particularly an audio signal (3) comprising at least one audio signal component containing a human voice, particularly a singer's voice, in a vehicle cabin (4); at least one audio receiving device (10) configured to receive a human voice signal (9) of at least one person (P) located in the or a vehicle cabin (4) whilst the at least one audio outputting device (6) outputs the audio signal (3) in the or a vehicle cabin (4); at least one processing device (11) configured to combine the audio signal (3) and the received human voice signal (9) so as to generate a combined audio signal containing the audio signal (3) and the received human voice signal (9) which combined audio signal is outputtable or output in the or a vehicle cabin (4) via the at least one audio outputting device (6).

APPARATUS FOR OUTPUTTING AN AUDIO SIGNAL IN A VEHICLE CABIN
20220319531 · 2022-10-06 · ·

Apparatus (2) for outputting an audio signal (3) in a vehicle cabin (4), the apparatus (2) comprising: at least one audio outputting device (6) configured to output an audio signal (3), particularly an audio signal (3) comprising at least one audio signal component containing a human voice, particularly a singer's voice, in a vehicle cabin (4); at least one audio receiving device (10) configured to receive a human voice signal (9) of at least one person (P) located in the or a vehicle cabin (4) whilst the at least one audio outputting device (6) outputs the audio signal (3) in the or a vehicle cabin (4); at least one processing device (11) configured to combine the audio signal (3) and the received human voice signal (9) so as to generate a combined audio signal containing the audio signal (3) and the received human voice signal (9) which combined audio signal is outputtable or output in the or a vehicle cabin (4) via the at least one audio outputting device (6).

Learning speech data generating apparatus, learning speech data generating method, and program

A training speech data generating apparatus includes: a voice conversion unit that converts, using fourth noise data, which is noise data based on third noise data, and speech data, the speech data so as to make the speech data clearly audible under a noise environment corresponding to the fourth noise data; and a noise superimposition unit that obtains training speech data by superimposing the third noise data and the converted speech data.

Learning speech data generating apparatus, learning speech data generating method, and program

A training speech data generating apparatus includes: a voice conversion unit that converts, using fourth noise data, which is noise data based on third noise data, and speech data, the speech data so as to make the speech data clearly audible under a noise environment corresponding to the fourth noise data; and a noise superimposition unit that obtains training speech data by superimposing the third noise data and the converted speech data.

Contact center of celebrities

Customers can become bored with an interaction with an agent. By providing speech and/or images, of a celebrity disguising the speech, and/or image, of the agent, customers can appear to interact with a particular celebrity. As a result, customers are more likely to stay engaged and have a positive experience. The celebrity, or a particular persona of a celebrity, may be selected from customer preferences and/or a purpose of the call. For example, a movie star's role may have a persona, such as a “heavy,” suitable for collection calls (audio or audio-video), whereas a scientific or technical innovator may be selected for technical support calls.