G10L17/02

Utilizing machine learning models to provide cognitive speaker fractionalization with empathy recognition

A device may receive audio data identifying a plurality of speakers and may process the audio data, with a plurality of clustering models, to identify a plurality of speaker segments. The device may determine a plurality of diarization error rates for the plurality of speaker segments and may identify a plurality of errors in the plurality of speaker segments. The device may select rectification models to rectify the plurality of errors and may segment and/or re-segment the audio data with the rectification models to generate re-segmented audio data. The device may determine a plurality of modified diarization error rates for the plurality of speaker segments based on the re-segmented audio data and may select one of the plurality of speaker segments based on the plurality of modified diarization error rates. The device may calculate an empathy score based on the selected speaker segment and may perform actions based on the empathy score.

METHOD AND ELECTRONIC DEVICE FOR IMPROVING AUDIO QUALITY

An electronic device for improving a quality of an audio includes: a microphone configured to obtain an audio input including a voice; at least one memory; and at least one processor. The at least one processor is configured to execute one or more instructions stored in the memory to: obtain a first voice fingerprint corresponding to the obtained audio input; obtain a second voice fingerprint corresponding to the voice; estimate, based on the first voice fingerprint and the second voice fingerprint, noise caused by an acoustic environment of the obtained audio input; and remove the estimated noise from the obtained audio input.

Speech recognition method, electronic device, and computer storage medium

A speech recognition method includes segmenting captured voice information to obtain a plurality of voice segments, and extracting voiceprint information of the voice segments; matching the voiceprint information of the voice segments with a first stored voiceprint information to determine a set of filtered voice segments having voiceprint information that successfully matches the first stored voiceprint information; combining the set of filtered voice segments to obtain combined voice information, and determining combined semantic information of the combined voice information; and using the combined semantic information as a speech recognition result when the combined semantic information satisfies a preset rule.

Detection of replay attack
11704397 · 2023-07-18 · ·

In order to detect a replay attack in a speaker recognition system, at least one feature is identified in a detected magnetic field. It is then determined whether the at least one identified feature of the detected magnetic field is indicative of playback of speech through a loudspeaker. If so, it is determined that a replay attack may have taken place.

Detection of replay attack
11704397 · 2023-07-18 · ·

In order to detect a replay attack in a speaker recognition system, at least one feature is identified in a detected magnetic field. It is then determined whether the at least one identified feature of the detected magnetic field is indicative of playback of speech through a loudspeaker. If so, it is determined that a replay attack may have taken place.

System and method for predicting intelligent voice assistant content

A method including receiving an incoming call from a calling device of a caller and determining identification information for the calling device. The method also includes receiving voice audio data of the caller from the calling device, converting the voice audio data to caller phones, and identifying a customer account associated with the identification information. The method further includes obtaining user phones for multiple candidate users associated with the identified customer account, comparing the caller phones to the user phones for the multiple candidate users, and determining the identity of the caller based on the comparison.

System and method for predicting intelligent voice assistant content

A method including receiving an incoming call from a calling device of a caller and determining identification information for the calling device. The method also includes receiving voice audio data of the caller from the calling device, converting the voice audio data to caller phones, and identifying a customer account associated with the identification information. The method further includes obtaining user phones for multiple candidate users associated with the identified customer account, comparing the caller phones to the user phones for the multiple candidate users, and determining the identity of the caller based on the comparison.

STREAMING DATA PROCESSING FOR HYBRID ONLINE MEETINGS
20230231973 · 2023-07-20 ·

Techniques of streaming data processing for hybrid online meetings are disclosed herein. In one example, a method includes receiving, at the remote server, a video stream captured by a camera in the conference room. The video stream captures images of multiple local participants of an online meeting. The method also includes determining identities of the captured images of the multiple local participants in the received video stream using meeting information of the online meeting and generating a set of individual video streams each corresponding to one of the multiple local participants. The set of individual video streams can then be transmitted to the second computing device corresponding to a remote participant of the online meeting as if the multiple local participants are virtually joining the online meeting.

MULTI-REGISTER-BASED SPEECH DETECTION METHOD AND RELATED APPARATUS, AND STORAGE MEDIUM

This application discloses a multi-sound area-based speech detection method and related apparatus, and a storage medium, which is applied to the field of artificial intelligence. The method includes: obtaining sound area information corresponding to each sound area in N sound areas; using the sound area as a target detection sound area, and generating a control signal corresponding to the target detection sound area according to sound area information corresponding to the target detection sound area; processing a speech input signal corresponding to the target detection sound area by using the control signal corresponding to the target detection sound area, to obtain a speech output signal corresponding to the target detection sound area; and generating a speech detection result of the target detection sound area according to the speech output signal corresponding to the target detection sound area. Speech signals in different directions are processed in parallel based on a plurality of sound areas, so that in a multi-sound source scenario, the speech signals in different directions may be retained or suppressed by a control signal, to separate and enhance speech of a target detection user in real time, thereby improving the accuracy of speech detection.

MULTI-REGISTER-BASED SPEECH DETECTION METHOD AND RELATED APPARATUS, AND STORAGE MEDIUM

This application discloses a multi-sound area-based speech detection method and related apparatus, and a storage medium, which is applied to the field of artificial intelligence. The method includes: obtaining sound area information corresponding to each sound area in N sound areas; using the sound area as a target detection sound area, and generating a control signal corresponding to the target detection sound area according to sound area information corresponding to the target detection sound area; processing a speech input signal corresponding to the target detection sound area by using the control signal corresponding to the target detection sound area, to obtain a speech output signal corresponding to the target detection sound area; and generating a speech detection result of the target detection sound area according to the speech output signal corresponding to the target detection sound area. Speech signals in different directions are processed in parallel based on a plurality of sound areas, so that in a multi-sound source scenario, the speech signals in different directions may be retained or suppressed by a control signal, to separate and enhance speech of a target detection user in real time, thereby improving the accuracy of speech detection.