G10L17/02

NEURAL NETWORK-BASED SIGNAL PROCESSING APPARATUS, NEURAL NETWORK-BASED SIGNAL PROCESSING METHOD, AND COMPUTER-READABLE STORAGE MEDIUM

A spoofing detection apparatus 100 includes a multi-channel spectrogram creation unit 10 and an evaluation unit 40. The multi-channel spectrogram creation unit 10 extracts different type of spectrograms from speech data and integrates the different type of spectrograms to create a multi-channel spectrogram. The evaluation unit 40 evaluates the created multi-channel spectrogram by applying the created multi-channel spectrogram to a classifier constructed using labeled multi-channel spectrograms as training data and classifies it to either genuine or spoof.

NEURAL NETWORK-BASED SIGNAL PROCESSING APPARATUS, NEURAL NETWORK-BASED SIGNAL PROCESSING METHOD, AND COMPUTER-READABLE STORAGE MEDIUM

A spoofing detection apparatus 100 includes a multi-channel spectrogram creation unit 10 and an evaluation unit 40. The multi-channel spectrogram creation unit 10 extracts different type of spectrograms from speech data and integrates the different type of spectrograms to create a multi-channel spectrogram. The evaluation unit 40 evaluates the created multi-channel spectrogram by applying the created multi-channel spectrogram to a classifier constructed using labeled multi-channel spectrograms as training data and classifies it to either genuine or spoof.

Conference Data Processing Method and Related Device
20220335949 · 2022-10-20 ·

A conference data processing method includes that a conference terminal collects an audio segment in a first conference site based on a sound source direction in a conference process, generates first additional information corresponding to each of the collected audio segments, and sends, to a conference information processing device, a conference audio recorded in the conference process and the first additional information; the conference information processing device segments the conference audio into a plurality of audio segments and attaches corresponding second additional information to the plurality of audio segments, where the second additional information corresponding to each audio segment includes information used to determine a speaker identity corresponding to the audio segment and identification information of the corresponding audio segment, and the conference information processing device generates a correspondence between a participant and a statement based on the first additional information and the second additional information.

Conference Data Processing Method and Related Device
20220335949 · 2022-10-20 ·

A conference data processing method includes that a conference terminal collects an audio segment in a first conference site based on a sound source direction in a conference process, generates first additional information corresponding to each of the collected audio segments, and sends, to a conference information processing device, a conference audio recorded in the conference process and the first additional information; the conference information processing device segments the conference audio into a plurality of audio segments and attaches corresponding second additional information to the plurality of audio segments, where the second additional information corresponding to each audio segment includes information used to determine a speaker identity corresponding to the audio segment and identification information of the corresponding audio segment, and the conference information processing device generates a correspondence between a participant and a statement based on the first additional information and the second additional information.

Biometrics-Infused Dynamic Knowledge-Based Authentication Tool
20220335433 · 2022-10-20 ·

Aspects described herein may use behavioral biometric data to authenticate an individual that requests performance of an action related to a financial account. In response to the request, challenge questions relating to recent transactions conducted with the financial account may be generated. The challenge questions may be provided to the individual and may prompt the individual for audile response and/or touch input responses. Behavioral biometric data may be extracted from the responses and may be used to determine a likelihood the individual is an authorized user of the account.

Biometrics-Infused Dynamic Knowledge-Based Authentication Tool
20220335433 · 2022-10-20 ·

Aspects described herein may use behavioral biometric data to authenticate an individual that requests performance of an action related to a financial account. In response to the request, challenge questions relating to recent transactions conducted with the financial account may be generated. The challenge questions may be provided to the individual and may prompt the individual for audile response and/or touch input responses. Behavioral biometric data may be extracted from the responses and may be used to determine a likelihood the individual is an authorized user of the account.

SPEECH EMBEDDING APPARATUS, AND METHOD
20230109177 · 2023-04-06 · ·

A frame processor 81 calculates, from a first sequence of feature vectors, a second sequence of frame-level feature vectors. A posterior estimator 82 calculates posterior probabilities for each vector included in the second sequence to a cluster. A statistics calculator 83 calculates a sufficient statistic used for extracting an i-vector by using the second sequence, the posterior probabilities, a mean vector of each cluster calculated at the time of learning of the frame processor 81 and the posterior estimator 82, and a global covariance matrix calculated based on the mean vector.

UTILIZING MACHINE LEARNING MODELS TO PROVIDE COGNITIVE SPEAKER FRACTIONALIZATION WITH EMPATHY RECOGNITION

A device may receive audio data identifying a plurality of speakers and may process the audio data, with a plurality of clustering models, to identify a plurality of speaker segments. The device may determine a plurality of diarization error rates for the plurality of speaker segments and may identify a plurality of errors in the plurality of speaker segments. The device may select rectification models to rectify the plurality of errors and may segment and/or re-segment the audio data with the rectification models to generate re-segmented audio data. The device may determine a plurality of modified diarization error rates for the plurality of speaker segments based on the re-segmented audio data and may select one of the plurality of speaker segments based on the plurality of modified diarization error rates. The device may calculate an empathy score based on the selected speaker segment and may perform actions based on the empathy score.

UTILIZING MACHINE LEARNING MODELS TO PROVIDE COGNITIVE SPEAKER FRACTIONALIZATION WITH EMPATHY RECOGNITION

A device may receive audio data identifying a plurality of speakers and may process the audio data, with a plurality of clustering models, to identify a plurality of speaker segments. The device may determine a plurality of diarization error rates for the plurality of speaker segments and may identify a plurality of errors in the plurality of speaker segments. The device may select rectification models to rectify the plurality of errors and may segment and/or re-segment the audio data with the rectification models to generate re-segmented audio data. The device may determine a plurality of modified diarization error rates for the plurality of speaker segments based on the re-segmented audio data and may select one of the plurality of speaker segments based on the plurality of modified diarization error rates. The device may calculate an empathy score based on the selected speaker segment and may perform actions based on the empathy score.

HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO

A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.