G10L25/54

HIERARCHICAL GENERATED AUDIO DETECTION SYSTEM

Disclosed is a hierarchical generated audio detection system, comprising an audio preprocessing module, a CQCC feature extraction module, a LFCC feature extraction module, a first-stage lightweight coarse-level detection model and a second-stage fine-level deep identification model; the audio preprocessing module preprocesses collected audio or video data to obtain an audio clip with a length not exceeding the limit; inputting the audio clip into CQCC feature extraction module and LFCC feature extraction module respectively to obtain CQCC feature and LFCC feature; inputting CQCC feature or LFCC feature into the first-stage lightweight coarse-level detection model for first-stage screening to screen out the first-stage real audio and the first-stage generated audio; inputting the CQCC feature or LFCC feature of the first-stage generated audio into the second-stage fine-level deep identification model to identify the second-stage real audio and the second-stage generated audio, and the second-stage generated audio is identified as generated audio.

SYSTEM AND METHOD FOR GENERATING WRAP UP INFORMATION

A system for generating wrap-up information is capable of learning how interactions are transformed into contact notes and outcome codes using natural language processing and can generate the contact notes and outcome codes for new incoming interactions by applying prediction models trained on interaction data, contact notes and outcome codes. The system for generating wrap-up information receives interaction data, including interaction audio data, interaction transcripts, associated contact notes and associated outcome codes. The interaction transcripts are generated from the previous interactions between agents and customers. The contact notes and outcome codes are generated by agents during the associated previous interactions. The system processes and uses the interaction data to train prediction models to analyze interaction audio data and interaction transcripts and predict appropriate contact notes and outcome codes for the interaction. Once trained the prediction model(s) can generate appropriate contact notes and outcome codes for new interactions.

ALERT SYSTEM AND METHOD FOR VIRTUAL REALITY HEADSET

An alert method for a head mounted display includes: identifying the current user of the head mounted display, retrieving a speaker recognition profile for the current user, detecting audio using one or more microphones, estimating whether the detected audio comprises speech corresponding to that of the current user based on the retrieved speaker recognition profile, and if not, then relaying the detected audio comprising the speech to the current user of the head mounted display.

ALERT SYSTEM AND METHOD FOR VIRTUAL REALITY HEADSET

An alert method for a head mounted display includes: identifying the current user of the head mounted display, retrieving a speaker recognition profile for the current user, detecting audio using one or more microphones, estimating whether the detected audio comprises speech corresponding to that of the current user based on the retrieved speaker recognition profile, and if not, then relaying the detected audio comprising the speech to the current user of the head mounted display.

LIFELOG DEVICE UTILIZING AUDIO RECOGNITION, AND METHOD THEREFOR

The present invention relates to a lifelog device utilizing audio recognition and a method therefor, and to a device capable of recording and classifying audio lifelogs by means of an artificial intelligence algorithm. To this end, the lifelog device of the present invention comprises: an input unit for inputting lifelog data including an audio signal; and analysis unit for analyzing the inputted data; a determination unit for classifying the class of the data on the basis of the analyzed analysis value; and a recording unit for recording the inputted data and the classified class of the data.

DISFLUENCY REMOVAL USING MACHINE LEARNING
20230020574 · 2023-01-19 · ·

A method may including obtaining a voice transcript corpus and a chat transcript corpus, extracting voice transcript sentences from the voice transcript corpus and chat transcript sentences from the chat transcript corpus, encoding, by a series of neural network layers, the voice transcript sentences to generate voice sentence vectors, encoding, by the series of neural network layers, the chat transcript sentences to generate chat sentence vectors, determining, for each voice sentence vector, a matching chat sentence vector to obtain matching voice-chat vector pairs, and adding, to a parallel corpus, matching voice-chat sentence pairs using the matching voice-chat vector pairs. Each of the matching voice-chat sentence pairs may include a voice transcript sentence and a matching chat transcript sentence. The method may further include training a disfluency remover model using the parallel corpus.

DISFLUENCY REMOVAL USING MACHINE LEARNING
20230020574 · 2023-01-19 · ·

A method may including obtaining a voice transcript corpus and a chat transcript corpus, extracting voice transcript sentences from the voice transcript corpus and chat transcript sentences from the chat transcript corpus, encoding, by a series of neural network layers, the voice transcript sentences to generate voice sentence vectors, encoding, by the series of neural network layers, the chat transcript sentences to generate chat sentence vectors, determining, for each voice sentence vector, a matching chat sentence vector to obtain matching voice-chat vector pairs, and adding, to a parallel corpus, matching voice-chat sentence pairs using the matching voice-chat vector pairs. Each of the matching voice-chat sentence pairs may include a voice transcript sentence and a matching chat transcript sentence. The method may further include training a disfluency remover model using the parallel corpus.

RECOMMENDATION OF AUDIO BASED ON VIDEO ANALYSIS USING MACHINE LEARNING
20230019025 · 2023-01-19 ·

An electronic device and method for recommendation of audio based on video analysis is provided. The electronic device receives one or more frames of a first scene of a plurality of scenes of a video. The first scene includes a set of objects. The electronic device applies a trained neural network model on the received one or more frames to detect the set of objects. The electronic device determines an impact score of each object of the detected set of objects of the first scene based on the application of the trained neural network model on the set of objects. The electronic device further selects at least one first object from the set of objects based on the impact score of each object, and recommends one or more first audio tracks as a sound effect for the first scene based on the selected at least one first object.

RECOMMENDATION OF AUDIO BASED ON VIDEO ANALYSIS USING MACHINE LEARNING
20230019025 · 2023-01-19 ·

An electronic device and method for recommendation of audio based on video analysis is provided. The electronic device receives one or more frames of a first scene of a plurality of scenes of a video. The first scene includes a set of objects. The electronic device applies a trained neural network model on the received one or more frames to detect the set of objects. The electronic device determines an impact score of each object of the detected set of objects of the first scene based on the application of the trained neural network model on the set of objects. The electronic device further selects at least one first object from the set of objects based on the impact score of each object, and recommends one or more first audio tracks as a sound effect for the first scene based on the selected at least one first object.

Audio matching

An audio matching technique generates audio fingerprints from a captured audio signal. Coarse and Fine fingerprints are generated from the captured audio. The coarse fingerprint is used to match with a set of coarse fingerprints stored in a database to identify a subset of possibly matching database entries. The fine fingerprint is then used to perform a detailed comparison with fine fingerprints associated with the subset of possibly matching database entries in order to find a match for the captured audio signal.