Patent classifications
G10L21/007
Pronunciation conversion apparatus, pitch mark timing extraction apparatus, methods and programs for the same
Provided is a system which allows a learner who is a non-native speaker of a given language to intuitively improve pronunciation of the language. A pronunciation conversion apparatus includes a conversion section which converts a first feature value corresponding to a first speech signal obtained when a first speaker who speaks a given language as his/her native language speaks another language such that the first feature value approaches a second feature value corresponding to a second speech signal obtained when a second speaker who speaks the other language as his/her native language speaks the other language, each of the first feature value and the second feature value is a feature value capable of representing a difference in pronunciation, and a speech signal obtained from the first feature value after the conversion is presented to the first speaker.
User voice control system
Embodiments include techniques and objects related to a wearable audio device that includes a microphone to detect a plurality of sounds in an environment in which the wearable audio device is located. The wearable audio device further includes a non-acoustic sensor to detect that a user of the wearable audio device is speaking. The wearable audio device further includes one or more processors communicatively to alter, based on an identification by the non-acoustic sensor that the user of the wearable audio device is speaking, one or more of the plurality of sounds to generate a sound output. Other embodiments may be described or claimed.
SANITIZING PERSONALLY IDENTIFIABLE INFORMATION (PII) IN AUDIO AND VISUAL DATA
Techniques for sanitizing personally identifiable information (PII) from audio and visual data are provided. For instance, in a scenario where the data comprises an audio signal with speech uttered by a person P, these techniques can include removing/obfuscating/transforming speech-related PII in the audio signal such as pitch and acoustic cues associated with P's vocal tract shape and/or vocal actuators (e.g., lips, nasal air bypass, teeth, tongue, etc.) while allowing the content of the speech to remain recognizable. Further, in a scenario where the data comprises a still image or video in which a person P appears, these techniques can include removing/obfuscating/transforming visual PII in the image or video such as P's biological features and indicators of P's location/belongings/data while allowing the general nature of the image or video to remain discernable. Through this PII sanitization process, the privacy of individuals portrayed in the audio or visual data can be preserved.
SANITIZING PERSONALLY IDENTIFIABLE INFORMATION (PII) IN AUDIO AND VISUAL DATA
Techniques for sanitizing personally identifiable information (PII) from audio and visual data are provided. For instance, in a scenario where the data comprises an audio signal with speech uttered by a person P, these techniques can include removing/obfuscating/transforming speech-related PII in the audio signal such as pitch and acoustic cues associated with P's vocal tract shape and/or vocal actuators (e.g., lips, nasal air bypass, teeth, tongue, etc.) while allowing the content of the speech to remain recognizable. Further, in a scenario where the data comprises a still image or video in which a person P appears, these techniques can include removing/obfuscating/transforming visual PII in the image or video such as P's biological features and indicators of P's location/belongings/data while allowing the general nature of the image or video to remain discernable. Through this PII sanitization process, the privacy of individuals portrayed in the audio or visual data can be preserved.
Adaptive coefficients and samples elimination for circular convolution
Technologies are disclosed for improving the efficiency of real-time audio processing, and specifically for improving the efficiency of continuously modifying a real-time audio signal. Efficiency is improved by reducing memory bandwidth requirements and by reducing the amount of processing used to modify the real-time audio signal. In some configurations, memory bandwidth requirements are reduced by selectively transferring active samples in the frequency domain—e.g. avoiding the transfer samples with amplitudes of zero or near-zero. This has particular importance when the specialized hardware retrieves samples from main memory in real-time. In some configurations, the amount of processing needed to modify the audio signal is reduced by omitting operations that do not meaningfully affect the output audio signal. For example, a multiplication of samples may be avoided when at least one of the samples has an amplitude of zero or near-zero.
Adaptive coefficients and samples elimination for circular convolution
Technologies are disclosed for improving the efficiency of real-time audio processing, and specifically for improving the efficiency of continuously modifying a real-time audio signal. Efficiency is improved by reducing memory bandwidth requirements and by reducing the amount of processing used to modify the real-time audio signal. In some configurations, memory bandwidth requirements are reduced by selectively transferring active samples in the frequency domain—e.g. avoiding the transfer samples with amplitudes of zero or near-zero. This has particular importance when the specialized hardware retrieves samples from main memory in real-time. In some configurations, the amount of processing needed to modify the audio signal is reduced by omitting operations that do not meaningfully affect the output audio signal. For example, a multiplication of samples may be avoided when at least one of the samples has an amplitude of zero or near-zero.
VOICE TRANSLATION AND VIDEO MANIPULATION SYSTEM
A communication modification system including an audio gathering unit that gathers an audio stream, a language detection unit that converts the audio stream into text, where the language detection unit correlates portions of the text with audio portions of the audio stream, and the language detection unit determines a first and second deviation in the audio stream portion based on the text portion and audio portion gathered by the audio gathering unit.
VOICE TRANSLATION AND VIDEO MANIPULATION SYSTEM
A communication modification system including an audio gathering unit that gathers an audio stream, a language detection unit that converts the audio stream into text, where the language detection unit correlates portions of the text with audio portions of the audio stream, and the language detection unit determines a first and second deviation in the audio stream portion based on the text portion and audio portion gathered by the audio gathering unit.
Generation and detection of watermark for real-time voice conversion
A method watermarks speech data by using a generator to generate speech data including a watermark. The generator is trained to generate the speech data including the watermark. The training process generates first speech from the generator. The first speech data is configured to represent speech. The first speech data includes a candidate watermark. The training also produces an inconsistency message as a function of at least one difference between the first speech data and at least authentic speech data. The training further includes transforming the first speech data, including the candidate watermark, using a watermark robustness module to produce transformed speech data including a transformed candidate watermark. The transformed speech data includes a transformed candidate watermark. The training further produces a watermark-detectability message, using a watermark detection machine learning system, relating to one or more desirable watermark features of the transformed candidate watermark.
Generation and detection of watermark for real-time voice conversion
A method watermarks speech data by using a generator to generate speech data including a watermark. The generator is trained to generate the speech data including the watermark. The training process generates first speech from the generator. The first speech data is configured to represent speech. The first speech data includes a candidate watermark. The training also produces an inconsistency message as a function of at least one difference between the first speech data and at least authentic speech data. The training further includes transforming the first speech data, including the candidate watermark, using a watermark robustness module to produce transformed speech data including a transformed candidate watermark. The transformed speech data includes a transformed candidate watermark. The training further produces a watermark-detectability message, using a watermark detection machine learning system, relating to one or more desirable watermark features of the transformed candidate watermark.