Patent classifications
G10L15/12
System and method of automated evaluation of transcription quality
Systems and methods automatedly evaluate a transcription quality. Audio data is obtained. The audio data is segmented into a plurality of utterances with a voice activity detector operating on a computer processor. The plurality of utterances are transcribed into at least one word lattice with a large vocabulary continuous speech recognition system operating on the processor. A minimum Bayes risk decoder is applied to the at least one word lattice to create at least one confusion network. At least conformity ratio is calculated from the at least one confusion network.
System and method of automated evaluation of transcription quality
Systems and methods automatedly evaluate a transcription quality. Audio data is obtained. The audio data is segmented into a plurality of utterances with a voice activity detector operating on a computer processor. The plurality of utterances are transcribed into at least one word lattice with a large vocabulary continuous speech recognition system operating on the processor. A minimum Bayes risk decoder is applied to the at least one word lattice to create at least one confusion network. At least conformity ratio is calculated from the at least one confusion network.
Communication system
Systems and methods for responding to spoken language input or multi-modal input are described herein. More specifically, one or more user intents are determined or inferred from the spoken language input or multi-modal input to determine one or more user goals via a dialogue belief tracking system. The systems and methods disclosed herein utilize the dialogue belief tracking system to perform actions based on the determined one or more user goals and allow a device to engage in human like conversation with a user over multiple turns of a conversation. Preventing the user from having to explicitly state each intent and desired goal while still receiving the desired goal from the device, improves a user's ability to accomplish tasks, perform commands, and get desired products and/or services. Additionally, the improved response to spoken language inputs from a user improves user interactions with the device.
Systems and Methods for Speech Validation
Systems and methods for speech validation in accordance with embodiments of the invention are illustrated. One embodiment includes a method for validating speech. The method includes steps for encoding a set of audio data, processing a set of target data, wherein the target data includes a sequence of target elements associated with the set of audio data, computing a set of one or more alignment probabilities for each target element of the sequence of target elements, performing temporal resolution based on the computed set of alignment probabilities to determine an alignment between the set of target data and the set of audio data.
Unsupervised keyword spotting and word discovery for fraud analytics
Embodiments described herein provide for a computer that detects one or more keywords of interest using acoustic features, to detect or query commonalities across multiple fraud calls. Embodiments described herein may implement unsupervised keyword spotting (UKWS) or unsupervised word discovery (UWD) in order to identify commonalities across a set of calls, where both UKWS and UWD employ Gaussian Mixture Models (GMM) and one or more dynamic time-warping algorithms. A user may indicate a training exemplar or occurrence of call-specific information, referred to herein as “a named entity,” such as a person's name, an account number, account balance, or order number. The computer may perform a redaction process that computationally nullifies the import of the named entity in the modeling processes described herein.
Diagnostic techniques based on speech-sample alignment
A method includes obtaining a first sequence of reference-sample feature vectors that quantify acoustic features of different respective portions of at least one reference speech sample, which was produced by a subject at a first time while a physiological state of the subject was known, and a second sequence of test-sample feature vectors that quantify the acoustic features of different respective portions of at least one test speech sample, which was produced by the subject at a second time while the physiological state of the subject was unknown. The test-sample feature vectors are mapped to respective ones of the reference-sample feature vectors, under predefined constraints, such that a total distance between the test-sample feature vectors and the respective ones of the reference-sample feature vectors is minimized. In response to the mapping, an output indicating the physiological state of the subject at the second time is generated.
APPARATUSES AND METHODS FOR QUERYING AND TRANSCRIBING VIDEO RESUMES
Aspects relate to apparatuses and methods for generating queries and transcribing video resumes. An exemplary apparatus includes at least a processor and a memory communicatively connected to the processor, the memory containing instructions configuring the processor to receive, from a posting generator, a plurality of posting inputs from a plurality of postings, receive a video resume from a user, generate a plurality of queries as a function of the video resume based on a plurality of posting categories, transcribe, as a function of the plurality of queries, a plurality of user inputs from the video resume, wherein the plurality of user inputs is related to attributes of a user, and classify the plurality of user inputs to the plurality of posting inputs to match the user to the plurality of postings.
APPARATUSES AND METHODS FOR QUERYING AND TRANSCRIBING VIDEO RESUMES
Aspects relate to apparatuses and methods for generating queries and transcribing video resumes. An exemplary apparatus includes at least a processor and a memory communicatively connected to the processor, the memory containing instructions configuring the processor to receive, from a posting generator, a plurality of posting inputs from a plurality of postings, receive a video resume from a user, generate a plurality of queries as a function of the video resume based on a plurality of posting categories, transcribe, as a function of the plurality of queries, a plurality of user inputs from the video resume, wherein the plurality of user inputs is related to attributes of a user, and classify the plurality of user inputs to the plurality of posting inputs to match the user to the plurality of postings.
Augmentation of Audiographic Images for Improved Machine Learning
Generally, the present disclosure is directed to systems and methods that generate augmented training data for machine-learned models via application of one or more augmentation techniques to audiographic images that visually represent audio signals. In particular, the present disclosure provides a number of novel augmentation operations which can be performed directly upon the audiographic image (e.g., as opposed to the raw audio data) to generate augmented training data that results in improved model performance. As an example, the audiographic images can be or include one or more spectrograms or filter bank sequences.
Augmentation of Audiographic Images for Improved Machine Learning
Generally, the present disclosure is directed to systems and methods that generate augmented training data for machine-learned models via application of one or more augmentation techniques to audiographic images that visually represent audio signals. In particular, the present disclosure provides a number of novel augmentation operations which can be performed directly upon the audiographic image (e.g., as opposed to the raw audio data) to generate augmented training data that results in improved model performance. As an example, the audiographic images can be or include one or more spectrograms or filter bank sequences.