Patent classifications
G10L25/87
METHOD AND APPARATUS FOR RECONSTRUCTING VOICE CONVERSATION
A voice conversation reconstruction method performed by a voice conversation reconstruction apparatus is disclosed. The method includes acquiring speaker-specific voice recognition data about voice conversation, dividing the speaker-specific voice recognition data into a plurality of blocks using a boundary between tokens according to a predefined division criterion, arranging the plurality of blocks in chronological order irrespective of a speaker, merging blocks from continuous utterance of the same speaker among the arranged plurality of blocks, and reconstructing the plurality of blocks subjected to the merging in a conversation format in chronological order and based on a speaker.
METHOD AND APPARATUS FOR RECONSTRUCTING VOICE CONVERSATION
A voice conversation reconstruction method performed by a voice conversation reconstruction apparatus is disclosed. The method includes acquiring speaker-specific voice recognition data about voice conversation, dividing the speaker-specific voice recognition data into a plurality of blocks using a boundary between tokens according to a predefined division criterion, arranging the plurality of blocks in chronological order irrespective of a speaker, merging blocks from continuous utterance of the same speaker among the arranged plurality of blocks, and reconstructing the plurality of blocks subjected to the merging in a conversation format in chronological order and based on a speaker.
Digital assistant and a corresponding method for voice-based interactive communication based on detected user gaze indicating attention
Method for voice-based interactive communication using a digital assistant, wherein the method comprises, an attention detection step, in which the digital assistant detects a user attention and as a result is set into a listening mode; a speaker detection step, in which the digital assistant detects the user as a current speaker; a speech sound detection step, in which the digital assistant detects and records speech uttered by the current speaker, which speech sound detection step further comprises a lip movement detection step, in which the digital assistant detects a lip movement of the current speaker; a speech analysis step, in which the digital assistant parses said recorded speech and extracts speech-based verbal informational content from said recorded speech; and a subsequent response step, in which the digital assistant provides feed-back to the user based on said recorded speech.
Dynamic contextual dialog session extension
A dialog system is described that is capable of maintaining a single dialog session covering multiple user utterances, which may be separated by pauses or time gaps, and that continuously determines intent across the multiple utterances within a session.
SYSTEMS AND METHODS FOR VIRTUAL MEETING SPEAKER SEPARATION
A computer-implemented machine learning method for improving speaker separation is provided. The method comprises processing audio data to generate prepared audio data and determining feature data and speaker data from the prepared audio data through a clustering iteration to generate an audio file. The method further comprises re-segmenting the audio file to generate a speaker segment and causing to display the speaker segment through a client device.
Processing and visualising audio signals
The present disclosure relates to methods, computer programs, and computer-readable media for processing a voice audio signal. A method includes receiving, at an electronic device, a voice audio signal, identifying spoken phrases within the voice audio signal based on the detection of voice activity or inactivity, dividing the voice audio signal into a plurality of segments based on the identified spoken phrases, and in accordance with a determination that a selected segment of the plurality of segments has a duration, Tseg, longer than a threshold duration, T.sub.thresh, identifying a most likely location of a breath in the audio associated with the selected segment and dividing the selected segment into sub-segments based on the identified most likely location of a breath.
Processing and visualising audio signals
The present disclosure relates to methods, computer programs, and computer-readable media for processing a voice audio signal. A method includes receiving, at an electronic device, a voice audio signal, identifying spoken phrases within the voice audio signal based on the detection of voice activity or inactivity, dividing the voice audio signal into a plurality of segments based on the identified spoken phrases, and in accordance with a determination that a selected segment of the plurality of segments has a duration, Tseg, longer than a threshold duration, T.sub.thresh, identifying a most likely location of a breath in the audio associated with the selected segment and dividing the selected segment into sub-segments based on the identified most likely location of a breath.
DETECTION APPARATUS, METHOD AND PROGRAM FOR THE SAME
A detection device includes a labeling acoustic feature calculation unit configured to calculate a labeling acoustic feature from voice data, a time information acquisition unit configured to acquire a label with time information corresponding to the voice data from a label with no time information corresponding to the voice data and the labeling acoustic feature through a use of a labeling acoustic model configured to receive, as inputs, a label with no time information and a labeling acoustic feature and output a label with time information, an acoustic feature prediction unit configured to predict an acoustic feature corresponding to the label with time information and acquire a predicted value through a use of an acoustic model configured to receive, as an input, a label with time information and output an acoustic feature, an acoustic feature calculation unit configured to calculate an acoustic feature from the voice data, a difference calculation unit configured to determine an acoustic difference between the acoustic feature and the predicted value, and a detection unit configured to detect a labeling error on a basis of a relationship regarding which of the difference and a predetermined threshold value is larger or smaller than the other.
DETECTION APPARATUS, METHOD AND PROGRAM FOR THE SAME
A detection device includes a labeling acoustic feature calculation unit configured to calculate a labeling acoustic feature from voice data, a time information acquisition unit configured to acquire a label with time information corresponding to the voice data from a label with no time information corresponding to the voice data and the labeling acoustic feature through a use of a labeling acoustic model configured to receive, as inputs, a label with no time information and a labeling acoustic feature and output a label with time information, an acoustic feature prediction unit configured to predict an acoustic feature corresponding to the label with time information and acquire a predicted value through a use of an acoustic model configured to receive, as an input, a label with time information and output an acoustic feature, an acoustic feature calculation unit configured to calculate an acoustic feature from the voice data, a difference calculation unit configured to determine an acoustic difference between the acoustic feature and the predicted value, and a detection unit configured to detect a labeling error on a basis of a relationship regarding which of the difference and a predetermined threshold value is larger or smaller than the other.
Dialogue processing apparatus, a vehicle including the same, and a dialogue processing method
A dialogue processing apparatus includes: a speech input device configured to receive a speech signal of a user; a first buffer configured to store the received speech signal therein; an output device; and a controller. The controller is configured to: detect an utterance end time point on the basis of the stored speech signal; generate a second speech recognition result corresponding to a speech signal after the utterance end time point on the basis of whether an intention of the user is to be identified from a first speech recognition result corresponding to a speech signal before the utterance end time point; and control the output device to output a response corresponding to the intention of the user determined on the basis of at least one of the first speech recognition result or the second speech recognition result.