G10L17/08

Method for Speaker Diarization

Disclosed is a speaker diarization process for determining which speaker is speaking at what time during the course of a conversation. The entire process can be most easily described in five main parts: Segmentation where speech/non-speech decisions are made; frame feature extraction where useful information is obtained from the frames; segment modeling where the information from the frame feature extraction is combined with segment start and end time information to create segment specific features; speaker decisions when the segments are clustered to create speaker models; and corrections where frame level corrections are applied to the information extracted.

System and method for detecting synthetic speaker verification
09812133 · 2017-11-07 · ·

Disclosed herein are systems, methods, and tangible computer readable-media for detecting synthetic speaker verification. The method comprises receiving a plurality of speech samples of the same word or phrase for verification, comparing each of the plurality of speech samples to each other, denying verification if the plurality of speech samples demonstrate little variance over time or are the same, and verifying the plurality of speech samples if the plurality of speech samples demonstrates sufficient variance over time. One embodiment further adds that each of the plurality of speech samples is collected at different times or in different contexts. In other embodiments, variance is based on a pre-determined threshold or the threshold for variance is adjusted based on a need for authentication certainty. In another embodiment, if the initial comparison is inconclusive, additional speech samples are received.

System and method for detecting synthetic speaker verification
09812133 · 2017-11-07 · ·

Disclosed herein are systems, methods, and tangible computer readable-media for detecting synthetic speaker verification. The method comprises receiving a plurality of speech samples of the same word or phrase for verification, comparing each of the plurality of speech samples to each other, denying verification if the plurality of speech samples demonstrate little variance over time or are the same, and verifying the plurality of speech samples if the plurality of speech samples demonstrates sufficient variance over time. One embodiment further adds that each of the plurality of speech samples is collected at different times or in different contexts. In other embodiments, variance is based on a pre-determined threshold or the threshold for variance is adjusted based on a need for authentication certainty. In another embodiment, if the initial comparison is inconclusive, additional speech samples are received.

Method and system for fraud clustering by content and biometrics analysis

A computer-implemented method for proactive fraudster exposure in a customer service center according to content analysis and voice biometrics analysis, is provided herein. The computer-implemented method includes: (i) performing a first type analysis to cluster the call interactions into ranked clusters and storing the ranked clusters in a clusters database; (ii) performing a second type analysis on a predefined amount of the highest ranked clusters, into ranked clusters and storing the ranked clusters; the first type analysis is a content analysis and the second type analysis is a voice biometrics analysis, or vice versa; (iii) retrieving from the ranked clusters, a list of fraudsters; and (iv) transmitting the list of potential fraudsters to an application to display to a user said list of potential fraudsters via a display unit.

SPEECH PROCESSING DEVICE, SPEECH PROCESSING METHOD, AND NON-TRANSITORY COMPUTER READABLE MEDIUM STORING PROGRAM
20220238097 · 2022-07-28 · ·

A speech processing device includes: first segment means for dividing first speech into a plurality of first speech segments; second segment means for dividing second speech into a plurality of second speech segments; primary speaker recognition means for calculating scores indicating similarities between the plurality of first and second speech segments; threshold value calculation means for calculating a threshold value based on scores indicating similarities between the plurality of first speech segments; speaker clustering means for classifying each of the plurality of second speech segments into one or more clusters having a similarity higher than the similarity indicated by the threshold value; and secondary speaker recognition means for calculating a similarity between each of the one or more clusters and the first speech and determining based on a result of the calculation whether speech corresponding to the first speech is contained in any of the one or more clusters.

VOICE AND SPEECH RECOGNITION FOR CALL CENTER FEEDBACK AND QUALITY ASSURANCE

A computer-implemented method for providing an objective evaluation to a customer service representative regarding his performance during an interaction with a customer may include receiving a digitized data stream corresponding to a spoken conversation between a customer and a representative; converting the data stream to a text stream; generating a representative transcript that includes the words from the text stream that are spoken by the representative; comparing the representative transcript with a plurality of positive words and a plurality of negative words; and generating a score that varies according to the occurrence of each word spoken by the representative that matches one of the positive words, and/or the occurrence of each word spoken by the representative that matches one of the negative words. Tone of voice, as well as response time, during the interaction may also be monitored and analyzed to adjust the score, or generate a separate score.

VOICE AND SPEECH RECOGNITION FOR CALL CENTER FEEDBACK AND QUALITY ASSURANCE

A computer-implemented method for providing an objective evaluation to a customer service representative regarding his performance during an interaction with a customer may include receiving a digitized data stream corresponding to a spoken conversation between a customer and a representative; converting the data stream to a text stream; generating a representative transcript that includes the words from the text stream that are spoken by the representative; comparing the representative transcript with a plurality of positive words and a plurality of negative words; and generating a score that varies according to the occurrence of each word spoken by the representative that matches one of the positive words, and/or the occurrence of each word spoken by the representative that matches one of the negative words. Tone of voice, as well as response time, during the interaction may also be monitored and analyzed to adjust the score, or generate a separate score.

User authentication by subvocalization of melody singing

A computing device (300) for authenticating a user (110), such as a mobile phone, a smartphone, a tablet, or the like, is provided. The computing device is operative to acquire a representation of a melody generated by the user, and authenticate the user in response to determining that the acquired representation of the user-generated melody and a representation of a reference melody fulfil a similarity condition. The user-generated melody may either be vocalized or subvocalized. If the melody is vocalized, the representation is derived from audio data captured by a microphone (102). If the melody is subvocalized, the representation is derived from nerve signals captured by sensors attached to the throat (111) of the user, or from a video sequence acquired from a camera (103), the video sequence capturing one or more body parts (111-115) of the user subvocalizing the melody, by magnifying motions of the one or more body parts which are correlated with the subvocalized melody.

User authentication by subvocalization of melody singing

A computing device (300) for authenticating a user (110), such as a mobile phone, a smartphone, a tablet, or the like, is provided. The computing device is operative to acquire a representation of a melody generated by the user, and authenticate the user in response to determining that the acquired representation of the user-generated melody and a representation of a reference melody fulfil a similarity condition. The user-generated melody may either be vocalized or subvocalized. If the melody is vocalized, the representation is derived from audio data captured by a microphone (102). If the melody is subvocalized, the representation is derived from nerve signals captured by sensors attached to the throat (111) of the user, or from a video sequence acquired from a camera (103), the video sequence capturing one or more body parts (111-115) of the user subvocalizing the melody, by magnifying motions of the one or more body parts which are correlated with the subvocalized melody.

METHOD, SYSTEM, AND NON-TRANSITORY COMPUTER READABLE RECORD MEDIUM FOR SPEAKER DIARIZATION COMBINED WITH SPEAKER IDENTIFICATION

Provided is a method, system, and non-transitory computer-readable record medium for speaker diarization combined with speaker identification. Provided is a speaker diarization method including setting a reference speech in relation to an audio file received as a speaker diarization target speech from a client; performing a speaker identification of identifying a speaker of the reference speech in the audio file using the reference speech; and performing a speaker diarization using clustering on a remaining utterance section unidentified in the audio file.