Patent classifications
G10L15/12
Multi-stage hotword detection
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for multi-stage hotword detection are disclosed. In one aspect, a method includes the actions of receiving, by a second stage hotword detector of a multi-stage hotword detection system that includes at least a first stage hotword detector and the second stage hotword detector, audio data that corresponds to an initial portion of an utterance. The actions further include determining a likelihood that the initial portion of the utterance includes a hotword. The actions further include determining that the likelihood that the initial portion of the utterance includes the hotword satisfies a threshold. The actions further include, in response to determining that the likelihood satisfies the threshold, transmitting a request for the first stage hotword detector to cease providing additional audio data that corresponds to one or more subsequent portions of the utterance.
METHOD OF AUTOMATICALLY CLASSIFYING SPEAKING RATE AND SPEECH RECOGNITION SYSTEM USING THE SAME
Provided are a method of automatically classifying a speaking rate and a speech recognition system using the method. The speech recognition system using automatic speaking rate classification includes a speech recognizer configured to extract word lattice information by performing speech recognition on an input speech signal, a speaking rate estimator configured to estimate word-specific speaking rates using the word lattice information, a speaking rate normalizer configured to normalize a word-specific speaking rate into a normal speaking rate when the word-specific speaking rate deviates from a preset range, and a rescoring section configured to rescore the speech signal whose speaking rate has been normalized.
Analog-digital converter and analog-to-digital conversion method
An ADC and an analog-to-digital conversion method are provided. The ADC includes: a clock generator, including M transmission gates, where the M transmission gates are configured to receive a first clock signal that is periodically sent and separately perform gating control on the first clock signal, so as to generate M second clock signals, M is an integer that is greater than or equal to 2; M ADC channels that are configured in a time interleaving manner, configured to receive one analog signal and separately perform, under the control of the M second clock signals, sampling and analog-to-digital conversion on the analog signal, so as to obtain M digital signals, where each ADC channel is corresponding to one clock signal of the M second clock signals; and an adder, configured to add the M digital signals together in a digital field, so as to obtain a digital output signal.
Analog-digital converter and analog-to-digital conversion method
An ADC and an analog-to-digital conversion method are provided. The ADC includes: a clock generator, including M transmission gates, where the M transmission gates are configured to receive a first clock signal that is periodically sent and separately perform gating control on the first clock signal, so as to generate M second clock signals, M is an integer that is greater than or equal to 2; M ADC channels that are configured in a time interleaving manner, configured to receive one analog signal and separately perform, under the control of the M second clock signals, sampling and analog-to-digital conversion on the analog signal, so as to obtain M digital signals, where each ADC channel is corresponding to one clock signal of the M second clock signals; and an adder, configured to add the M digital signals together in a digital field, so as to obtain a digital output signal.
Speaker dependent voiced sound pattern template mapping
Various implementations disclosed herein include a training module configured to produce a set of segment templates from a concurrent segmentation of a plurality of vocalization instances of a VSP vocalized by a particular speaker, who is identifiable by a corresponding set of vocal characteristics. Each segment template provides a stochastic characterization of how each of one or more portions of a VSP is vocalized by the particular speaker in accordance with the corresponding set of vocal characteristics. Additionally, in various implementations, the training module includes systems, methods and/or devices configured to produce a set of VSP segment maps that each provide a quantitative characterization of how respective segments of the plurality of vocalization instances vary in relation to a corresponding one of a set of segment templates.
Speaker dependent voiced sound pattern template mapping
Various implementations disclosed herein include a training module configured to produce a set of segment templates from a concurrent segmentation of a plurality of vocalization instances of a VSP vocalized by a particular speaker, who is identifiable by a corresponding set of vocal characteristics. Each segment template provides a stochastic characterization of how each of one or more portions of a VSP is vocalized by the particular speaker in accordance with the corresponding set of vocal characteristics. Additionally, in various implementations, the training module includes systems, methods and/or devices configured to produce a set of VSP segment maps that each provide a quantitative characterization of how respective segments of the plurality of vocalization instances vary in relation to a corresponding one of a set of segment templates.
SYSTEM AND METHOD OF AUTOMATED EVALUATION OF TRANSCRIPTION QUALITY
Systems and methods automatedly evaluate a transcription quality. Audio data is obtained. The audio data is segmented into a plurality of utterances with a voice activity detector operating on a computer processor. The plurality of utterances are transcribed into at least one word lattice with a large vocabulary continuous speech recognition system operating on the processor. A minimum Bayes risk decoder is applied to the at least one word lattice to create at least one confusion network. At least conformity ratio is calculated from the at least one confusion network.
SYSTEM AND METHOD OF AUTOMATED EVALUATION OF TRANSCRIPTION QUALITY
Systems and methods automatedly evaluate a transcription quality. Audio data is obtained. The audio data is segmented into a plurality of utterances with a voice activity detector operating on a computer processor. The plurality of utterances are transcribed into at least one word lattice with a large vocabulary continuous speech recognition system operating on the processor. A minimum Bayes risk decoder is applied to the at least one word lattice to create at least one confusion network. At least conformity ratio is calculated from the at least one confusion network.
Order statistic techniques for neural networks
According to some aspects, a method of classifying speech recognition results is provided, using a neural network comprising a plurality of interconnected network units, each network unit having one or more weight values, the method comprising using at least one computer, performing acts of providing a first vector as input to a first network layer comprising one or more network units of the neural network, transforming, by a first network unit of the one or more network units, the input vector to produce a plurality of values, the transformation being based at least in part on a plurality of weight values of the first network unit, sorting the plurality of values to produce a sorted plurality of values, and providing the sorted plurality of values as input to a second network layer of the neural network.
System and method for extracting and using prosody features
A system for carrying out voice pattern recognition and a method for achieving same. The system includes an arrangement for acquiring an input voice, a signal processing library for extracting acoustic and prosodic features of the acquired voice, a database for storing a recognition dictionary, at least one instance of a prosody detector for carrying out a prosody detection process on extracted respective prosodic features, communicating with an end user application for applying control thereto.