G10L15/063

DATA SORTING FOR GENERATING RNN-T MODELS
20230237987 · 2023-07-27 ·

A computer-implemented method for preparing training data for a speech recognition model is provided including obtaining a plurality of sentences from a corpus, dividing each phoneme in each sentence of the plurality of sentences into three hidden states, calculating, for each sentence of the plurality of sentences, a score based on a variation in duration of the three hidden states of each phoneme in the sentence, and sorting the plurality of sentences by using the calculated scores.

Artificial Intelligence Based Technologies for Improving Patient Appointment Scheduling and Inventory Management
20230238119 · 2023-07-27 ·

Artificial intelligence (Al) based technologies for improving patient appointment scheduling and inventory management are disclosed herein. An example method includes receiving, at a server including a natural language processing (NLP) model, an appointment request from a user. The example method further includes initiating, based on the appointment request, a patient appointment data stream including verbal responses from the user regarding an appointment of the user. The example method further includes applying, while simultaneously receiving the patient appointment data stream, the NLP model to the verbal responses from the user to output (i) textual transcriptions and (ii) intent interpretations. The example method further includes querying a scheduling database to determine a matching appointment that satisfies a distance threshold, a date threshold, a service threshold, and an inventory threshold. The example method further includes causing a user device of the user to convey the matching appointment to the user.

METHOD AND SYSTEM FOR SPEECH DETECTION AND SPEECH ENHANCEMENT
20230005469 · 2023-01-05 ·

A method of speech detection and speech enhancement in a speech detection and speech enhancement unit of Multipoint Conferencing Node (MCN) and a method of training the same. The method comprising receiving input audio segments, and determining an acoustic environment based on input audio auxiliary information, extracting T-F-domain features from the received input audio segments, determining if each of the received input audio segments is speech by inputting the T-F domain features into a speech detection classifier trained for the determined acoustic environment, determining, when one of the received input audio segments is speech, if the received audio segment is noisy speech by inputting the T-F domain features into a noise classifier using a statistical generative model representing the probability distributions of the T-F domain features of noisy speech trained for the determined acoustic environment, and applying a noise reduction mask on the received input audio segments according to the determination of the received audio segment is noisy speech

METHOD FOR PROCESSING AN AUDIO STREAM AND CORRESPONDING SYSTEM

A method and a system for processing an audio stream are described, wherein at least one database of classified voices and at least one database of classified background sounds are provided and a comparison between these classified voices and background sounds with the voices and the sounds extrapolated from a suitably re-processed audio stream is carried out in order to identify possible matches.

Method and device for user registration, and electronic device

Provided in embodiments of the present application are a method and apparatus for user registration and electronic device. The method includes: after obtaining a wake-up voice of a user each time, extracting and storing a first voiceprint feature corresponding to the wake-up voice; clustering the stored first voiceprint features to divide the stored first voiceprint features into at least one category, wherein, each of the at least one category includes at least one first voiceprint feature which belongs to the same user; assigning one category identifier to each category; storing each category identifier in correspondence to at least one first voiceprint feature corresponding to this category identifier to complete user registration. The embodiments of the present application can simplify the user operation and improve the user experience.

System and method for defining dialog intents and building zero-shot intent recognition models
11568855 · 2023-01-31 ·

A system and method of creating the natural language understanding component of a speech/text dialog system. The method involves a first step of defining user intent in the form of an intent flow graph. Next, (context, intent) pairs are created from each of the plurality of intent flow graphs and stored in a training database. A paraphrase task is then generated from each (context, intent) pair and also stored in the training database. A zero-shot intent recognition model is trained using the plurality of (context, intent) pairs in the training database to recognize user intents from the plurality of paraphrase tasks in the training database. Once trained, the zero-shot intent recognition model is applied to user queries to generate semantic outputs.

Machine learning method, audio source separation apparatus, and electronic instrument
11568857 · 2023-01-31 · ·

A machine learning method for training a learning model includes: transforming a first audio type of audio data into a first image type of image data, wherein a first audio component and a second audio component are mixed in the first audio type of audio data, and the first image type of image data corresponds to the first audio type of audio data; transforming a second audio type of audio data into a second image type of image data, wherein the second audio type of audio data includes the first audio component without mixture of the second audio component, and the second image type of image data corresponds to the second audio type of audio data; and performing machine learning on the learning model with training data including sets of the first image type of image data and the second image type of image data.

Intent authoring using weak supervision and co-training for automated response systems

A combination of propagation operations and learning algorithms is applied, using a selected set of labeled conversational logs retrieved from a subset of a plurality of conversational logs, to a remaining corpus of the plurality of conversational logs to train an automated response system according to an intent associated with each of the conversational logs. The combination of propagation operations and learning algorithms may include defining the labels by a user for the selected set of the subset of the plurality of conversational logs; training a probabilistic classifier using the defined labels of features of the selected set, wherein the probabilistic classifier produces labeling decisions for the subset of conversational logs; weighting the features of the selected set in a model optimization process; and/or training an additional classifier using the weighted features of the selected set and applying the additional classifier to the remaining corpus.

Z-vectors: speaker embeddings from raw audio using sincnet, extended CNN architecture and in-network augmentation techniques

Described herein are systems and methods for improved audio analysis using a computer-executed neural network having one or more in-network data augmentation layers. The systems described herein help ease or avoid unwanted strain on computing resources by employing the data augmentation techniques within the layers of the neural network. The in-network data augmentation layers will produce various types of simulated audio data when the computer applies the neural network on an inputted audio signal during a training phase, enrollment phase, and/or testing phase. Subsequent layers of the neural network (e.g., convolutional layer, pooling layer, data augmentation layer) ingest the simulated audio data and the inputted audio signal and perform various operations.

Skill shortlister for natural language processing

Devices and techniques are generally described for application determination in speech processing. Input data corresponding to a spoken utterance may be received. Speech recognition processing may be performed on the input data to generate text data. A machine learning encoder may generate a vector representation of the input data. A first binary classifier may determine a first probability that the input data corresponds to a first speech-processing application. A second binary classifier may determine a second probability that the input data corresponds to a second speech-processing application. A selection between the first speech-processing application and the second speech-processing application may be made based at least in part on the first probability and the second probability.