G10L25/87

LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT
20230072481 · 2023-03-09 ·

Systems and processes for operating a digital assistant are provided. In an example process, low-latency operation of a digital assistant is provided. In this example, natural language processing, task flow processing, dialogue flow processing, speech synthesis, or any combination thereof can be at least partially performed while awaiting detection of a speech end-point condition. Upon detection of a speech end-point condition, results obtained from performing the operations can be presented to the user. In another example, robust operation of a digital assistant is provided. In this example, task flow processing by the digital assistant can include selecting a candidate task flow from a plurality of candidate task flows based on determined task flow scores. The task flow scores can be based on speech recognition confidence scores, intent confidence scores, flow parameter scores, or any combination thereof. The selected candidate task flow is executed and corresponding results presented to the user.

INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD

To enable a plurality of speeches of a user to be appropriately concatenated. An information processing device according to the present disclosure includes: an acquisition unit (131) that acquires first speech information indicating a first speech by a user, second speech information indicating a second speech by the user after the first speech, and respiration information regarding respiration of the user; and an execution unit (134) that executes processing of concatenating the first speech and the second speech by executing voice interaction control according to a respiratory state of the user based on the respiration information acquired by the acquisition unit.

Stereo Audio Signal Delay Estiamtion Method and Apparatus
20230154483 · 2023-05-18 ·

A stereo audio signal delay estimation method includes obtaining a current frame of a stereo audio signal. The current frame includes a first channel audio signal and a second channel audio signal. Estimating an inter-channel time (ITD) of the current frame using a first algorithm when a signal type of a noise signal included in the current frame is a coherent noise signal type, or estimating the ITD using a second algorithm when the signal type of the noise signal is a diffuse noise signal type. The first algorithm includes weighting a frequency domain cross power spectrum based on a first weighting function that includes a first construction factor. The second algorithm includes weighting the frequency domain cross power spectrum based on a second weighting function that includes a second construction factor different from the first construction factor.

Stereo Audio Signal Delay Estiamtion Method and Apparatus
20230154483 · 2023-05-18 ·

A stereo audio signal delay estimation method includes obtaining a current frame of a stereo audio signal. The current frame includes a first channel audio signal and a second channel audio signal. Estimating an inter-channel time (ITD) of the current frame using a first algorithm when a signal type of a noise signal included in the current frame is a coherent noise signal type, or estimating the ITD using a second algorithm when the signal type of the noise signal is a diffuse noise signal type. The first algorithm includes weighting a frequency domain cross power spectrum based on a first weighting function that includes a first construction factor. The second algorithm includes weighting the frequency domain cross power spectrum based on a second weighting function that includes a second construction factor different from the first construction factor.

Direction based end-pointing for speech recognition

A speech recognition system utilizing automatic speech recognition techniques such as end-pointing techniques in conjunction with beamforming and/or signal processing to isolate speech from one or more speaking users from multiple received audio signals and to detect the beginning and/or end of the speech based at least in part on the isolation. Audio capture devices such as microphones may be arranged in a beamforming array to receive the multiple audio signals. Multiple audio sources including speech may be identified in different beams and processed.

Direction based end-pointing for speech recognition

A speech recognition system utilizing automatic speech recognition techniques such as end-pointing techniques in conjunction with beamforming and/or signal processing to isolate speech from one or more speaking users from multiple received audio signals and to detect the beginning and/or end of the speech based at least in part on the isolation. Audio capture devices such as microphones may be arranged in a beamforming array to receive the multiple audio signals. Multiple audio sources including speech may be identified in different beams and processed.

Emitting Word Timings with End-to-End Models

A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.

Fully supervised speaker diarization
11688404 · 2023-06-27 · ·

A method includes receiving an utterance of speech and segmenting the utterance of speech into a plurality of segments. For each segment of the utterance of speech, the method also includes extracting a speaker=discriminative embedding from the segment and predicting a probability distribution over possible speakers for the segment using a probabilistic generative model configured to receive the extracted speaker-discriminative embedding as a feature input. The probabilistic generative model trained on a corpus of training speech utterances each segmented into a plurality of training segments. Each training segment including a corresponding speaker-discriminative embedding and a corresponding speaker label. The method also includes assigning a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment.

SPEECH PROCESSING SYSTEM AND SPEECH PROCESSING METHOD

A speech intelligibility enhancing system for enhancing speech, the system comprising: a speech input for receiving speech to be enhanced; an enhanced speech output to output the enhanced speech; and a processor configured to convert speech received by the speech input to enhanced speech to be output by the enhanced speech output, the processor being configured to: extract a portion of the speech received by the speech input; calculate the power of the portion; estimate a contribution due to late reverberation to the power of the portion of the speech when reverbed; calculate a target late reverberation power; determine a time t.sub.i for the estimated contribution due to late reverberation to decay to the target late reverberation power; calculate a pause duration, wherein the pause duration is calculated using the time t.sub.i; insert a pause having the calculated duration into the speech received by the speech input at a first location, wherein the first location is followed by the portion.

SPEECH PROCESSING SYSTEM AND SPEECH PROCESSING METHOD

A speech intelligibility enhancing system for enhancing speech, the system comprising: a speech input for receiving speech to be enhanced; an enhanced speech output to output the enhanced speech; and a processor configured to convert speech received by the speech input to enhanced speech to be output by the enhanced speech output, the processor being configured to: extract a portion of the speech received by the speech input; calculate the power of the portion; estimate a contribution due to late reverberation to the power of the portion of the speech when reverbed; calculate a target late reverberation power; determine a time t.sub.i for the estimated contribution due to late reverberation to decay to the target late reverberation power; calculate a pause duration, wherein the pause duration is calculated using the time t.sub.i; insert a pause having the calculated duration into the speech received by the speech input at a first location, wherein the first location is followed by the portion.