Patent classifications
G10L17/04
Enrollment with an automated assistant
Techniques are described herein for dialog-based enrollment of individual users for single- and/or multi-modal recognition by an automated assistant, as well as determining how to respond to a particular user's request based on the particular user being enrolled and/or recognized. Rather than requiring operation of a graphical user interface for individual enrollment, dialog-based enrollment enables users to enroll themselves (or others) by way of a human-to-computer dialog with the automated assistant.
Graph-based approach for voice authentication
Methods for voice authentication include receiving a plurality of mono telephonic interactions between customers and agents; creating a mapping of the plurality of mono telephonic interactions that illustrates which agent interacted with which customer in each of the interactions; determining how many agents each customer interacted with; identifying one or more customers an agent has interacted with that have the fewest interactions with other agents; and selecting a predetermined number of interactions of the agent with each of the identified customers. In some embodiments, the methods further include creating a voice print from first and second speaker components of each interaction; comparing the voice prints of a first selected interaction to the voice prints from a second selected interaction; calculating a similarity score between the voice prints; aggregating scores; and identifying the voice prints that are associated with the agent.
System and method to determine outcome probability of an event based on videos
System and method for determining an outcome probability of an event based on videos are disclosed. The method includes receiving the videos of an event, creating a building block model, extracting one of an audio content, a video content from the videos, analysing extracted content, generating an analysis result, analysing an engagement between speaker and participant of event, generating a data lake comprising a keyword library, computing the outcome probability of the event, enabling the building block model to learn from the data lake and the outcome probability computed and representing the at least one outcome probability in a pre-defined format.
LEARNING APPARATUS, ESTIMATION APPARATUS, METHODS AND PROGRAMS FOR THE SAME
A learning apparatus includes: a speaker vector learning unit configured to learn a speaker vector extraction parameter λ based on one or more items of learning speech voice data in a speaker vector voice database; a non-speaker-individuality sound model learning unit configured to create a probability distribution model using a frequency component of one or more items of non-speaker-individuality sound data in a non-speaker-individuality sound database and calculate an internal parameter of the probability distribution model; and an age level estimation model learning unit configured to extract a speaker vector from voice data in an age level estimation model-learning voice database using the speaker vector extraction parameter λ, calculate a non-speaker-individuality sound likelihood vector from voice data in the age level estimation model-learning voice database using the internal parameters μ and Σ, and learn, with input of the speaker vector and the non-speaker-individuality sound likelihood vector, a parameter Ω of an age level estimation model that outputs an estimated value of an age level of a corresponding speaker.
LEARNING APPARATUS, ESTIMATION APPARATUS, METHODS AND PROGRAMS FOR THE SAME
A learning apparatus includes: a speaker vector learning unit configured to learn a speaker vector extraction parameter λ based on one or more items of learning speech voice data in a speaker vector voice database; a non-speaker-individuality sound model learning unit configured to create a probability distribution model using a frequency component of one or more items of non-speaker-individuality sound data in a non-speaker-individuality sound database and calculate an internal parameter of the probability distribution model; and an age level estimation model learning unit configured to extract a speaker vector from voice data in an age level estimation model-learning voice database using the speaker vector extraction parameter λ, calculate a non-speaker-individuality sound likelihood vector from voice data in the age level estimation model-learning voice database using the internal parameters μ and Σ, and learn, with input of the speaker vector and the non-speaker-individuality sound likelihood vector, a parameter Ω of an age level estimation model that outputs an estimated value of an age level of a corresponding speaker.
VOICE INTERACTION METHOD AND ELECTRONIC DEVICE
Embodiments of this application provide a voice interaction method and an electronic device, and relate to the field of artificial intelligence AI technologies and the field of voice processing technologies. A specific solution includes: An electronic device may receive first voice information sent by a second user, and the electronic device recognizes the first voice information in response to the first voice information. The first voice information is used to request a voice conversation with a first user. The electronic device may have, on a basis that the electronic device recognizes that the first voice information is voice information of the second user, a voice conversation with the second user by imitating a voice of the first user and in a mode in which the first user has a voice conversation with the second user.
Hotword-based speaker recognition
Systems, methods performed by data processing apparatus and computer storage media encoded with computer programs for receiving an utterance from a user in a multi-user environment, each user having an associated set of available resources, determining that the received utterance includes at least one predetermined word, comparing speaker identification features of the uttered predetermined word with speaker identification features of each of a plurality of previous utterances of the predetermined word, the plurality of previous predetermined word utterances corresponding to different known users in the multi-user environment, attempting to identify the user associated with the uttered predetermined word as matching one of the known users in the multi-user environment, and based on a result of the attempt to identify, selectively providing the user with access to one or more resources associated with a corresponding known user.
Transcription System with Contextual Automatic Speech Recognition
An automated speech recognition (“ASR”) system with an audio processing engine and contextual transcription engine on a computing device is provided. The audio processing engine determines audio segmentation corresponding with multiple identified speakers of audio data. The contextual transcription engine generates a text file based on the audio data in a legally-formatted transcript using one or more AI/ML models. Embodiments of the ASR system provide provides results that will comply with most of the stenographic standards for legal transcription out of the box without further setup or tuning.
Transcription System with Contextual Automatic Speech Recognition
An automated speech recognition (“ASR”) system with an audio processing engine and contextual transcription engine on a computing device is provided. The audio processing engine determines audio segmentation corresponding with multiple identified speakers of audio data. The contextual transcription engine generates a text file based on the audio data in a legally-formatted transcript using one or more AI/ML models. Embodiments of the ASR system provide provides results that will comply with most of the stenographic standards for legal transcription out of the box without further setup or tuning.
Method for reduced computation of T-matrix training for speaker recognition
A system and method for improving T-matrix training for speaker recognition, comprising receiving an audio input, divisible into a plurality of audio frames including at least an audio sample of a human speaker; generating for each audio frame a feature vector; generating for a first plurality of feature vectors centered statistics of at least a zero order and a first order; generating a first i-vector, the first i-vector representing the human speaker; and generating an optimized T-matrix training sequence computation, based on at least the first i-vector.