Patent classifications
G10L17/08
SYNTHETIC OVERSAMPLING TO ENHANCE SPEAKER IDENTIFICATION OR VERIFICATION
An apparatus for oversampling audio signals is described herein. The apparatus includes one or more microphones to receive audio signals and an extractor to extract a set of feature points from the audio signals. The apparatus also includes a processing unit to determine a distance between each pair of feature points and an oversampling unit to generate a plurality of new feature points based the distance between each pair of feature points.
WEARABLE APPARATUS AND METHODS FOR PROVIDING TRANSCRIPTION AND/OR SUMMARY
System and methods for processing audio signals are disclosed. In one implementation, a system may include a wearable apparatus including an image sensor to capture images from an environment of a user; an audio sensor to capture an audio signal from the environment of the user; and at least one processor. The processor may be programmed to receive the audio signal captured by the audio sensor; identify at least one segment including speech in the audio signal; receive an image including a representation of a code; analyze the code to determine whether the code is associated with the user and/or the wearable apparatus; and after determining that the code is associated with the user and/or the wearable apparatus, transmit at least one segment of the audio signal, at least one image of the plurality of images, and/or other information to a computing platform.
SAMPLE-EFFICIENT REPRESENTATION LEARNING FOR REAL-TIME LATENT SPEAKER STATE CHARACTERIZATION
Systems, methods, and non-transitory computer-readable media can provide audio waveform data that corresponds to a voice sample to a temporal convolutional network for evaluation. The temporal convolutional network can pre-process the audio waveform data and can output an identity embedding associated with the audio waveform data. The identity embedding associated with the voice sample can be obtained from the temporal convolutional network. Information describing a speaker associated with the voice sample can be determined based at least in part on the identity embedding.
Systems, devices, software, and methods for identity recognition and verification based on voice spectrum analysis
Hardware and/or software systems, devices, networks, and methods for identity recognition and verification based on vocal spectrum analysis. The system including one or more processors coupled to a memory/storage to collect audio samples sufficient to generate a speaker identification reference pattern and a speaker identification verification pattern, generate a speaker identification reference pattern from the audio samples and a speaker identification verification pattern from other audio samples, compare the speaker identification verification pattern with the speaker identification reference pattern; and provide a response indicating whether the speaker identification verification pattern and the speaker identification reference pattern were generated based on audio samples from the same person. The system may be employed on a mobile phone in near field communication with a control system and may include a management platform.
Apparatus for classifying speakers using a feature map and method for operating the same
A method and apparatus for processing voice data of a speech received from a speaker are provided. The method includes extracting a speaker feature vector from the voice data of the speech received from a speaker, generating a speaker feature map by positioning the extracted speaker feature vector at a specific position on a multi-dimensional vector space, forming a plurality of clusters indicating features of voices of a plurality of speakers by grouping at least one speaker feature vector positioned on the speaker feature map, and classifying the plurality of speakers according to the plurality of clusters.
SPEAKER EMBEDDING APPARATUS AND METHOD
An input unit 81 inputs an observation at current time step. A frame alignment unit 82 computes a frame alignment at a current time step by using the input observation. An i-vector computation unit 83 computes an i-vector and a precision matrix by using the computed frame alignment, the input observation, and a product obtained when computing the i-vector at the previous time step. An output unit 84 outputs the computed i-vector and precision matrix.
SPEAKER EMBEDDING APPARATUS AND METHOD
An input unit 81 inputs an observation at current time step. A frame alignment unit 82 computes a frame alignment at a current time step by using the input observation. An i-vector computation unit 83 computes an i-vector and a precision matrix by using the computed frame alignment, the input observation, and a product obtained when computing the i-vector at the previous time step. An output unit 84 outputs the computed i-vector and precision matrix.
Voice-assistant activated virtual card replacement
A device may receive a command associated with identifying a merchant for a virtual card swap procedure wherein the virtual card swap procedure is to replace a credit card of a user with a virtual card corresponding to the credit card. The device may identify the merchant for the virtual card swap procedure based on the command. The device may obtain the virtual card for the user. The device may determine a virtual card swap procedure template for the merchant. The device may perform the virtual card swap procedure based on the virtual card swap procedure template.
Sound detection
A method for generating a health indicator for at least one person of a group of people, the method comprising: receiving, at a processor, captured sound, where the captured sound is sound captured from the group of people; comparing the captured sound to a plurality of sound models to detect at least one non-speech sound event in the captured sound, each of the plurality of sound models associated with a respective health-related sound type; determining metadata associated with the at least one non-speech sound event; assigning the at least one non-speech sound event and the metadata to at least one person of the group of people; and outputting a message identifying the at least one non-speech event and the metadata to a health indicator generator module to generate a health indicator for the at least one person to whom the at least one non-speech sound event is assigned.
Method for Speaker Diarization
Disclosed is a speaker diarization process for determining which speaker is speaking at what time during the course of a conversation. The entire process can be most easily described in five main parts: Segmentation where speech/non-speech decisions are made; frame feature extraction where useful information is obtained from the frames; segment modeling where the information from the frame feature extraction is combined with segment start and end time information to create segment specific features; speaker decisions when the segments are clustered to create speaker models; and corrections where frame level corrections are applied to the information extracted.