Patent classifications
G10L25/78
Speech feature extraction apparatus, speech feature extraction method, and computer-readable storage medium
A speech feature extraction apparatus 100 includes a voice activity detection unit 103 that drops non-voice frames from frames corresponding to an input speech utterance, and calculates a posterior of being voiced for each frame, a voice activity detection process unit 106 calculates a function value as weights in pooling frames to produce an utterance-level feature, from a given a voice activity detection posterior, and an utterance-level feature extraction unit 112 that extracts an utterance-level feature, from the frame on a basis of multiple frame-level features, using the function values.
Emitting word timings with end-to-end models
A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.
Emitting word timings with end-to-end models
A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.
SELECTIVELY ACTIVATING ON-DEVICE SPEECH RECOGNITION, AND USING RECOGNIZED TEXT IN SELECTIVELY ACTIVATING ON-DEVICE NLU AND/OR ON-DEVICE FULFILLMENT
Implementations can reduce the time required to obtain responses from an automated assistant by, for example, obviating the need to provide an explicit invocation to the automated assistant, such as by saying a hot-word/phrase or performing a specific user input, prior to speaking a command or query. In addition, the automated assistant can optionally receive, understand, and/or respond to the command or query without communicating with a server, thereby further reducing the time in which a response can be provided. Implementations only selectively initiate on-device speech recognition responsive to determining one or more condition(s) are satisfied. Further, in some implementations, on-device NLU, on-device fulfillment, and/or resulting execution occur only responsive to determining, based on recognized text form the on-device speech recognition, that such further processing should occur. Thus, through selective activation of on-device speech processing, and/or selective activation of on-device NLU and/or on-device fulfillment, various client device resources are conserved.
SELECTIVELY ACTIVATING ON-DEVICE SPEECH RECOGNITION, AND USING RECOGNIZED TEXT IN SELECTIVELY ACTIVATING ON-DEVICE NLU AND/OR ON-DEVICE FULFILLMENT
Implementations can reduce the time required to obtain responses from an automated assistant by, for example, obviating the need to provide an explicit invocation to the automated assistant, such as by saying a hot-word/phrase or performing a specific user input, prior to speaking a command or query. In addition, the automated assistant can optionally receive, understand, and/or respond to the command or query without communicating with a server, thereby further reducing the time in which a response can be provided. Implementations only selectively initiate on-device speech recognition responsive to determining one or more condition(s) are satisfied. Further, in some implementations, on-device NLU, on-device fulfillment, and/or resulting execution occur only responsive to determining, based on recognized text form the on-device speech recognition, that such further processing should occur. Thus, through selective activation of on-device speech processing, and/or selective activation of on-device NLU and/or on-device fulfillment, various client device resources are conserved.
SYSTEM AND METHOD FOR GENERATING, TRIGGERING, AND PLAYING AUDIO CUES IN REAL TIME USING A PERSONAL AUDIO DEVICE
A system and method for generating, triggering and playinga sequence of audio files with cues for delivering a presentation for a presenter using a personal audio devicecoupled to a computing device. The system comprising the comprising a computer devicethat is coupled to a presentation data analysis server through a network. The method includes (i) generating a sequence of audio files with cues for delivering a presentation, (ii) triggering playing an audio file from the sequence of audio files, and (iii) playing the sequence of audio files one by one, on the computing device, using the personal audio devicecoupled to a computing deviceto enable the presenter to recall and speak the content based on the sequence of the audio files.
SYSTEM AND METHOD FOR GENERATING, TRIGGERING, AND PLAYING AUDIO CUES IN REAL TIME USING A PERSONAL AUDIO DEVICE
A system and method for generating, triggering and playinga sequence of audio files with cues for delivering a presentation for a presenter using a personal audio devicecoupled to a computing device. The system comprising the comprising a computer devicethat is coupled to a presentation data analysis server through a network. The method includes (i) generating a sequence of audio files with cues for delivering a presentation, (ii) triggering playing an audio file from the sequence of audio files, and (iii) playing the sequence of audio files one by one, on the computing device, using the personal audio devicecoupled to a computing deviceto enable the presenter to recall and speak the content based on the sequence of the audio files.
MICROPHONE ARRAY SYSTEM WITH SOUND WIRE INTERFACE AND ELECTRONIC DEVICE
A microphone array system, comprises N microphones, including a first microphone . . . a Nth microphone, wherein N is a natural number greater than 2. Each of the N microphones is provided with: an acoustic transducer for picking up a sound signal and converting the sound signal into an electric signal; a voice activation detector, connected to a corresponding acoustic transducer, and configured to perform a voice activation detection on the electric signal and form an activation signal; a buffer memory, connected to the acoustic transducer, and configured to store a 1/N electric signal of a predetermined segment; a sound wire interface, connected to a corresponding acoustic transducer, the buffer memory, and the voice activation detector, wherein the sound wire interface is connected to an external master chip via a sound wire bus for outputting the activation signal to the external master chip.
METHOD AND SYSTEM FOR DYNAMIC VOICE ENHANCEMENT
The present disclosure provides a method and system for voice enhancement. The method and system of the present disclosure may simultaneously perform signal processing of two paths on an input signal. The first path signal processing includes receiving an audio source input and performing dynamic loudness balancing on the audio source input based on a first gain control parameter. The second path signal processing includes: performing voice detection on the audio source input and calculating a detection confidence; and calculating a second gain control parameter based on the detection confidence. The first path signal processing and the second path signal processing may be synchronous or asynchronous. The method of the present disclosure also includes updating the first gain control parameter with the second gain control parameter calculated by a second processing path and performing the first path signal processing based on the updated first gain control parameter.
METHOD AND SYSTEM FOR DYNAMIC VOICE ENHANCEMENT
The present disclosure provides a method and system for voice enhancement. The method and system of the present disclosure may simultaneously perform signal processing of two paths on an input signal. The first path signal processing includes receiving an audio source input and performing dynamic loudness balancing on the audio source input based on a first gain control parameter. The second path signal processing includes: performing voice detection on the audio source input and calculating a detection confidence; and calculating a second gain control parameter based on the detection confidence. The first path signal processing and the second path signal processing may be synchronous or asynchronous. The method of the present disclosure also includes updating the first gain control parameter with the second gain control parameter calculated by a second processing path and performing the first path signal processing based on the updated first gain control parameter.