G10L25/30

Audio stem identification systems and methods

Methods, systems and computer program products are provided for determining acoustic feature vectors of query and target items in a first vector space, and mapping the acoustic feature vectors to a second vector space having a lower dimension. The distribution of vectors in the second vector space can then be used to identify items from the same songs, and/or items that are complementary. A mapping function is trained using a machine learning algorithm, such that complementary audio items are closer in the second vector space than the first, according to a given distance metric.

Systems and methods for identifying an acoustic source based on observed sound

An electronic device includes a processor, and a memory containing instructions that, when executed by the processor, cause the electronic device to learn a sound emitted by a legacy device and to issue an output when the electronic device subsequently hears the sound. For example, the electronic device can receive a training input and extract a compact representation of a sound in the training input, which the device stores. The device can receive an audio signal corresponding to an observed acoustic scene and extract a representation of the observed acoustic scene from the audio signal. The electronic device can determine whether the sound is present in the observed acoustic scene at least in part from a comparison of the representation of the observed acoustic scene with the representation of the sound. The electronic device emits a selected output responsive to determining that the sound is present in the acoustic scene.

Real time correction of accent in speech audio signals
11715457 · 2023-08-01 · ·

Systems and methods for real-time correction of an accent in a speech audio signal are provided. A method includes dividing the speech audio signal into a stream of input chunks, an input chunk from the stream of input chunks including a pre-defined number of frames of the speech audio signal, extracting, by an acoustic features extraction module from the input chunk and a context associated with the input chunk, acoustic features, the context is a pre-determined number of the frames preceding the input chunk in the stream; extracting, by a linguistic features extraction module from the input chunk and the context, linguistic features, receiving a speaker embedding for a human speaker, providing the speaker embedding, the acoustic features, and the linguistic features to a synthesis module to generate a melspectrogram with a reduced accent, providing the melspectrogram to a vocoder to generate an output chunk of an output audio signal.

Real time correction of accent in speech audio signals
11715457 · 2023-08-01 · ·

Systems and methods for real-time correction of an accent in a speech audio signal are provided. A method includes dividing the speech audio signal into a stream of input chunks, an input chunk from the stream of input chunks including a pre-defined number of frames of the speech audio signal, extracting, by an acoustic features extraction module from the input chunk and a context associated with the input chunk, acoustic features, the context is a pre-determined number of the frames preceding the input chunk in the stream; extracting, by a linguistic features extraction module from the input chunk and the context, linguistic features, receiving a speaker embedding for a human speaker, providing the speaker embedding, the acoustic features, and the linguistic features to a synthesis module to generate a melspectrogram with a reduced accent, providing the melspectrogram to a vocoder to generate an output chunk of an output audio signal.

Artificial intelligence apparatus for converting text and speech in consideration of style and method for the same
11715485 · 2023-08-01 · ·

According to an embodiment of the present invention, there is provided an artificial intelligence (AI) apparatus for mutually converting a text and a speech, including: a memory configured to store a plurality of Text-To-Speech (TTS) engines; and a processor configured to: obtain image data containing a text, determine a speech style corresponding to the text, generate a speech corresponding to the text by using a TTS engine corresponding to the determined speech style among the plurality of TTS engines, and output the generated speech.

Artificial intelligence apparatus for converting text and speech in consideration of style and method for the same
11715485 · 2023-08-01 · ·

According to an embodiment of the present invention, there is provided an artificial intelligence (AI) apparatus for mutually converting a text and a speech, including: a memory configured to store a plurality of Text-To-Speech (TTS) engines; and a processor configured to: obtain image data containing a text, determine a speech style corresponding to the text, generate a speech corresponding to the text by using a TTS engine corresponding to the determined speech style among the plurality of TTS engines, and output the generated speech.

Audio signal processing method and apparatus
11714596 · 2023-08-01 · ·

Disclosed is an operation method of an audio signal processing device configured to process an audio signal including a first audio signal component and a second audio signal component. The operation method includes: receiving the audio signal; normalizing loudness of the audio signal, based on a pre-designated target loudness; acquiring the first audio signal component from the audio signal having the normalized loudness, by using a machine learning model; and de-normalizing loudness of the first audio signal component, based on the pre-designated target loudness.

MACHINE LEARNING MODELS FOR AUTOMATED PROCESSING OF AUDIO WAVEFORM DATABASE ENTRIES
20230238019 · 2023-07-27 ·

A computer system includes memory hardware and processor hardware configured to execute stored instructions. The instructions include training a machine learning model with the historical feature vector inputs including multiple audio data entries and multiple claims data entries, to generate a condition likelihood output indicative of a specified condition associated with one of multiple historical database entities. The instructions include for each of a set of multiple database entities, generating a feature vector input according to audio data and the claims data associated with the entity, processing the feature vector input with the machine learning model to generate the condition likelihood output, and assigning the database entity to an identified condition subset in response to determining that the condition likelihood output is greater than a specified likelihood threshold. The instructions include transforming a user interface to display the condition likelihood output associated with the database entity.

MACHINE LEARNING MODELS FOR AUTOMATED PROCESSING OF AUDIO WAVEFORM DATABASE ENTRIES
20230238019 · 2023-07-27 ·

A computer system includes memory hardware and processor hardware configured to execute stored instructions. The instructions include training a machine learning model with the historical feature vector inputs including multiple audio data entries and multiple claims data entries, to generate a condition likelihood output indicative of a specified condition associated with one of multiple historical database entities. The instructions include for each of a set of multiple database entities, generating a feature vector input according to audio data and the claims data associated with the entity, processing the feature vector input with the machine learning model to generate the condition likelihood output, and assigning the database entity to an identified condition subset in response to determining that the condition likelihood output is greater than a specified likelihood threshold. The instructions include transforming a user interface to display the condition likelihood output associated with the database entity.

HIERARCHICAL GENERATED AUDIO DETECTION SYSTEM

Disclosed is a hierarchical generated audio detection system, comprising an audio preprocessing module, a CQCC feature extraction module, a LFCC feature extraction module, a first-stage lightweight coarse-level detection model and a second-stage fine-level deep identification model; the audio preprocessing module preprocesses collected audio or video data to obtain an audio clip with a length not exceeding the limit; inputting the audio clip into CQCC feature extraction module and LFCC feature extraction module respectively to obtain CQCC feature and LFCC feature; inputting CQCC feature or LFCC feature into the first-stage lightweight coarse-level detection model for first-stage screening to screen out the first-stage real audio and the first-stage generated audio; inputting the CQCC feature or LFCC feature of the first-stage generated audio into the second-stage fine-level deep identification model to identify the second-stage real audio and the second-stage generated audio, and the second-stage generated audio is identified as generated audio.