Patent classifications
G10L17/20
Speaker recognition with assessment of audio frame contribution
This application describes methods and apparatus for speaker recognition. An apparatus according to an embodiment has an analyzer for analyzing each frame of a sequence of frames of audio data which correspond to speech sounds uttered by a user to determine at least one characteristic of the speech sound of that frame. An assessment module determines, for each frame of audio data, a contribution indicator of the extent to which that frame of audio data should be used for speaker recognition processing based on the determined characteristic of the speech sound. Said contribution indicator comprises a weighting to be applied to each frame in the speaker recognition processing. In this way frames which correspond to speech sounds that are of most use for speaker discrimination may be emphasized and/or frames which correspond to speech sounds that are of least use for speaker discrimination may be de-emphasized.
ELECTRONIC APPARATUS AND METHOD OF CONTROLLING THE SAME
An electronic apparatus includes an interface configured to connect with an external apparatus, and a processor. The processor is configured to, in response to a first user speech received by the electronic apparatus including a trigger word, identify a first noise level corresponding to the first user speech received by the electronic apparatus. The processor is configured to identify a first recognition apparatus among a plurality of recognition apparatuses having a highest first noise level corresponding to the first user speech. The plurality of recognition apparatuses identify the first user speech as the trigger word and include the electronic apparatus and the external apparatus. The processor is configured to perform a control operation to implement a function corresponding to a second user speech in response to identifying a second recognition apparatus as having a highest second noise level corresponding to the second user speech among the plurality of recognition apparatuses.
Method and apparatus for establishing voiceprint model, computer device, and storage medium
A method and apparatus for establishing a voiceprint model, a computer device, and a storage medium are described herein. The method includes: collecting speech acoustic features in a speech signal to form a plurality of cluster structures; calculating an average value and a standard deviation of the plurality of cluster structures and then performing coordinate transformation and activation function calculation to obtain a feature vector; and obtaining a voiceprint model based on the feature vector.
Features search and selection techniques for speaker and speech recognition
With recent real-world applications of speaker and speech recognition systems, robust features for degraded speech have become a necessity. In general, degraded speech results in poor performance of any speech-based system. This poor performance can be attributed to feature extraction functionality of speech-based system which takes input speech file and converts it into a representation called as a feature. Embodiments of the present disclosure provide systems and methods that compute distance between each degraded speech feature extracted from an input speech signal with each clean speech feature comprised in a memory of the system to obtain set of matched clean speech features wherein at least a subset of cleaned speech features are dynamically selected based on a pre-defined threshold and the computed distance, thereby computing statistics for the dynamically selected clean speech features set for utilizing in at least one of a speech recognition system and a speaker recognition system.
Features search and selection techniques for speaker and speech recognition
With recent real-world applications of speaker and speech recognition systems, robust features for degraded speech have become a necessity. In general, degraded speech results in poor performance of any speech-based system. This poor performance can be attributed to feature extraction functionality of speech-based system which takes input speech file and converts it into a representation called as a feature. Embodiments of the present disclosure provide systems and methods that compute distance between each degraded speech feature extracted from an input speech signal with each clean speech feature comprised in a memory of the system to obtain set of matched clean speech features wherein at least a subset of cleaned speech features are dynamically selected based on a pre-defined threshold and the computed distance, thereby computing statistics for the dynamically selected clean speech features set for utilizing in at least one of a speech recognition system and a speaker recognition system.
Information processing apparatus and non-transitory computer readable medium
An information processing apparatus includes: a receiver configured to receive an utterance content of a speaker, a processing structure of a work in which the speaker utters, the work including plural processing units, and a processing unit in execution in the processing structure; an extraction unit configured to extract a related document including a sentence whose similarity to the utterance content of the speaker received by the receiver is equal to or higher than a threshold, from among related documents that are associated in advance with at least one processing unit including the processing unit in execution received by the receiver; and a setting unit configured to set a processing unit from which the extraction unit extracts a related document next, according to the processing structure received by the receiver.
Information processing apparatus and non-transitory computer readable medium
An information processing apparatus includes: a receiver configured to receive an utterance content of a speaker, a processing structure of a work in which the speaker utters, the work including plural processing units, and a processing unit in execution in the processing structure; an extraction unit configured to extract a related document including a sentence whose similarity to the utterance content of the speaker received by the receiver is equal to or higher than a threshold, from among related documents that are associated in advance with at least one processing unit including the processing unit in execution received by the receiver; and a setting unit configured to set a processing unit from which the extraction unit extracts a related document next, according to the processing structure received by the receiver.
Apparatus and method for residential speaker recognition
A home assistant device captures voice signal expressed by users in the home and extracts vocal features from these captured voice recordings. The device collects data about the current context in the home and requests from an aggregator a background model that is best adapted to the current context. This background model is obtained and locally used by the home assistant device to perform the speaker recognition. Home assistant devices from a plurality of homes contribute to the establishment of a database of background models by aggregating vocal features, clustering them according to the context and computing background models for the different contexts. These background models are then collected, clustered according to their contexts and aggregated by an aggregator in the database. Any home assistant device can then request from the aggregator the background model that fits best its current context, thus improving the speaker recognition.
Apparatus and method for residential speaker recognition
A home assistant device captures voice signal expressed by users in the home and extracts vocal features from these captured voice recordings. The device collects data about the current context in the home and requests from an aggregator a background model that is best adapted to the current context. This background model is obtained and locally used by the home assistant device to perform the speaker recognition. Home assistant devices from a plurality of homes contribute to the establishment of a database of background models by aggregating vocal features, clustering them according to the context and computing background models for the different contexts. These background models are then collected, clustered according to their contexts and aggregated by an aggregator in the database. Any home assistant device can then request from the aggregator the background model that fits best its current context, thus improving the speaker recognition.
Speaker recognition method and apparatus
A speaker recognition method and apparatus receives a first voice signal of a speaker, generates a second voice signal by enhancing the first voice signal through speech enhancement, generates a multi-channel voice signal by associating the first voice signal with the second voice signal, and recognizes the speaker based on the multi-channel voice signal.