G10L15/05

Agent system, and, information processing method

An agent system includes: a recognizer configured to recognize speech including speech contents of an occupant in a mobile object; an acquirer configured to acquire an image including the occupant; and an estimator configured to compare wording included in the speech contents of the occupant recognized by the recognizer with unclear information which is stored in a storage and includes wording making the speech contents unclear, to estimate a first direction which is a sight direction of the occupant or a second direction which is indicated by the occupant on the basis of the image acquired by the acquirer when the speech contents of the occupant includes unclear wording, and to estimate an object which is located in the estimated first direction or the estimated second direction. The recognizer is configured to recognize the speech contents of the occupant on the basis of the object estimated by the estimator.

Agent system, and, information processing method

An agent system includes: a recognizer configured to recognize speech including speech contents of an occupant in a mobile object; an acquirer configured to acquire an image including the occupant; and an estimator configured to compare wording included in the speech contents of the occupant recognized by the recognizer with unclear information which is stored in a storage and includes wording making the speech contents unclear, to estimate a first direction which is a sight direction of the occupant or a second direction which is indicated by the occupant on the basis of the image acquired by the acquirer when the speech contents of the occupant includes unclear wording, and to estimate an object which is located in the estimated first direction or the estimated second direction. The recognizer is configured to recognize the speech contents of the occupant on the basis of the object estimated by the estimator.

CONTEXTUAL SUPPRESSION OF ASSISTANT COMMAND(S)

Some implementations process, using warm word model(s), a stream of audio data to determine a portion of the audio data that corresponds to particular word(s) and/or phrase(s) (e.g., a warm word) associated with an assistant command, process, using an automatic speech recognition (ASR) model, a preamble portion of the audio data (e.g., that precedes the warm word) and/or a postamble portion of the audio data (e.g., that follows the warm word) to generate ASR output, and determine, based on processing the ASR output, whether a user intended the assistant command to be performed. Additional or alternative implementations can process the stream of audio data using a speaker identification (SID) model to determine whether the audio data is sufficient to identify the user that provided a spoken utterance captured in the stream of audio data, and determine if that user is authorized to cause performance of the assistant command.

CONTEXTUAL SUPPRESSION OF ASSISTANT COMMAND(S)

Some implementations process, using warm word model(s), a stream of audio data to determine a portion of the audio data that corresponds to particular word(s) and/or phrase(s) (e.g., a warm word) associated with an assistant command, process, using an automatic speech recognition (ASR) model, a preamble portion of the audio data (e.g., that precedes the warm word) and/or a postamble portion of the audio data (e.g., that follows the warm word) to generate ASR output, and determine, based on processing the ASR output, whether a user intended the assistant command to be performed. Additional or alternative implementations can process the stream of audio data using a speaker identification (SID) model to determine whether the audio data is sufficient to identify the user that provided a spoken utterance captured in the stream of audio data, and determine if that user is authorized to cause performance of the assistant command.

Appropriate utterance estimate model learning apparatus, appropriate utterance judgement apparatus, appropriate utterance estimate model learning method, appropriate utterance judgement method, and program

Provided is technology for assessing whether uttered speech detected from input speech is speech suited to a prescribed purpose. A method comprises detecting, from input speech including speech uttered by a speaker and noise, the uttered speech corresponding to the speech uttered by the speaker, extracting an acoustic feature of the uttered speech, generating, from the uttered speech, a speech recognition result set with a recognition score, generating, from the speech recognition result set with the recognition score, a speech recognition result word vector expression set and a speech recognition result part-of-speech vector expression set, generating a target utterance estimation model, providing, using the target utterance estimation model, a probability of the uttered speech being suited to the prescribed purpose, and outputting the uttered speech and the speech recognition result set with the recognition score, the the uttered speech suitable to the prescribed purpose.

Appropriate utterance estimate model learning apparatus, appropriate utterance judgement apparatus, appropriate utterance estimate model learning method, appropriate utterance judgement method, and program

Provided is technology for assessing whether uttered speech detected from input speech is speech suited to a prescribed purpose. A method comprises detecting, from input speech including speech uttered by a speaker and noise, the uttered speech corresponding to the speech uttered by the speaker, extracting an acoustic feature of the uttered speech, generating, from the uttered speech, a speech recognition result set with a recognition score, generating, from the speech recognition result set with the recognition score, a speech recognition result word vector expression set and a speech recognition result part-of-speech vector expression set, generating a target utterance estimation model, providing, using the target utterance estimation model, a probability of the uttered speech being suited to the prescribed purpose, and outputting the uttered speech and the speech recognition result set with the recognition score, the the uttered speech suitable to the prescribed purpose.

Method and apparatus for outputting information

A method and an apparatus for outputting information are provided. The method includes acquiring voice information received within a preset time period before a device is awakened, where the device is provided with a wake-up model for outputting preset response information when a preset wake-up word is received; performing speech recognition on the voice information to obtain a recognition result; extracting feature information of the voice information in response to determining that the recognition result does not include a preset wake-up word; generating a counterexample training sample according to the feature information; and training the wake-up model using a counter-example training sample, and outputting the trained wake-up model.

Method and apparatus for outputting information

A method and an apparatus for outputting information are provided. The method includes acquiring voice information received within a preset time period before a device is awakened, where the device is provided with a wake-up model for outputting preset response information when a preset wake-up word is received; performing speech recognition on the voice information to obtain a recognition result; extracting feature information of the voice information in response to determining that the recognition result does not include a preset wake-up word; generating a counterexample training sample according to the feature information; and training the wake-up model using a counter-example training sample, and outputting the trained wake-up model.

Cascade Architecture for Noise-Robust Keyword Spotting
20230097197 · 2023-03-30 · ·

A method (400) includes receiving, at a first processor (110) of a user device (102), streaming multi-channel audio (118) captured by an array of microphones (107), each channel (119) including respective audio features. For each channel, the method also includes processing, by the first processor, using a first stage hotword detector (210), the respective audio features to determine whether a hotword is detected. When the first stage hotword detector detects the hotword, the method also includes the first processor providing chomped raw audio data (212) to a second processor that processes, using a first noise cleaning algorithm (250), the chomped raw audio data to generate a clean monophonic audio chomp (260). The method also includes processing, by the second processor using a second stage hotword detector (220), the clean monophonic audio chomp to detect the hotword.

Enhancing ASR System Performance for Agglutinative Languages

A training-stage technique trains a language model for use in an ASR system. The technique includes: obtaining a training corpus that includes a sequence of terms; determining that an original term in the training corpus is not present in a dictionary resource; segmenting the original term into two or more sub-terms using a segmentation resource; determining that the segmentation of the original term into the two or more sub-terms is a valid segmentation, based on two or more validity tests; and training the language model based on the terms that have been identified. A computer-implemented inference-stage technique applies the language model to produce ASR output results. The inference-stage technique merges a sub-term with a preceding term if these two terms are separated by no more than a prescribed interval of time.