G10L15/05

PERSONALIZED SPEECH QUERY ENDPOINTING BASED ON PRIOR INTERACTION(S)
20230230578 · 2023-07-20 ·

A personalized endpointing measure can be used to determine whether a user has finished speaking a spoken utterance. Various implementations include using the personalized endpointing measure to determine whether a candidate endpoint indicates a user has finished speaking the spoken utterance or whether the user has paused and has not finished speaking the spoken utterance. Various implementations include determining the personalized endpointing measure based on a portion of a text representation of the spoken utterance immediately preceding the candidate endpoint and a user-specific measure. Additionally or alternatively, the user-specific measure can be based on the text representation immediately preceding the candidate endpoint and one or more historical interactions with the user. In various implementations, each of the historical interactions are specific to the text representation and the user, and indicate whether a previous instance of the text representation was a previous endpoint for the user.

PERSONALIZED SPEECH QUERY ENDPOINTING BASED ON PRIOR INTERACTION(S)
20230230578 · 2023-07-20 ·

A personalized endpointing measure can be used to determine whether a user has finished speaking a spoken utterance. Various implementations include using the personalized endpointing measure to determine whether a candidate endpoint indicates a user has finished speaking the spoken utterance or whether the user has paused and has not finished speaking the spoken utterance. Various implementations include determining the personalized endpointing measure based on a portion of a text representation of the spoken utterance immediately preceding the candidate endpoint and a user-specific measure. Additionally or alternatively, the user-specific measure can be based on the text representation immediately preceding the candidate endpoint and one or more historical interactions with the user. In various implementations, each of the historical interactions are specific to the text representation and the user, and indicate whether a previous instance of the text representation was a previous endpoint for the user.

SYSTEMS AND METHODS FOR GENERATING DISAMBIGUATED TERMS IN AUTOMATICALLY GENERATED TRANSCRIPTIONS INCLUDING INSTRUCTIONS WITHIN A PARTICULAR KNOWLEDGE DOMAIN
20230230579 · 2023-07-20 ·

System and method for generating disambiguated terms in automatically generated transcriptions including instructions within a knowledge domain and employing the system are disclosed. Exemplary implementations may: obtain a set of transcripts representing various speech from users; obtain indications of correlated correct and incorrect transcriptions of spoken terms within the knowledge domain; obtain a vector generation model that generates vectors for individual instances of the transcribed terms in the set of transcripts that are part of the lexicography of the knowledge domain; use the vector generation model to generate the vectors such that a first set of vectors and a second set of vectors are generated that represent the instances of the first correctly transcribed term and the first incorrectly transcribed term, respectively; and train the vector generation model to reduce spatial separation of vectors generated for instances of correlated correct and incorrect transcriptions of spoken terms within the knowledge domain.

SYSTEMS AND METHODS FOR GENERATING DISAMBIGUATED TERMS IN AUTOMATICALLY GENERATED TRANSCRIPTIONS INCLUDING INSTRUCTIONS WITHIN A PARTICULAR KNOWLEDGE DOMAIN
20230230579 · 2023-07-20 ·

System and method for generating disambiguated terms in automatically generated transcriptions including instructions within a knowledge domain and employing the system are disclosed. Exemplary implementations may: obtain a set of transcripts representing various speech from users; obtain indications of correlated correct and incorrect transcriptions of spoken terms within the knowledge domain; obtain a vector generation model that generates vectors for individual instances of the transcribed terms in the set of transcripts that are part of the lexicography of the knowledge domain; use the vector generation model to generate the vectors such that a first set of vectors and a second set of vectors are generated that represent the instances of the first correctly transcribed term and the first incorrectly transcribed term, respectively; and train the vector generation model to reduce spatial separation of vectors generated for instances of correlated correct and incorrect transcriptions of spoken terms within the knowledge domain.

Adaptive batching to reduce recognition latency

Acoustic features are batched into two different batches. The second batch of the two batches is made in response to a detection of a word hypothesis output by a speech recognition network that received the first batch. The number of acoustic feature frames of the second batch is equal to a second batch size greater than the first batch size. The second batch is also to the speech recognition network for processing.

Adaptive batching to reduce recognition latency

Acoustic features are batched into two different batches. The second batch of the two batches is made in response to a detection of a word hypothesis output by a speech recognition network that received the first batch. The number of acoustic feature frames of the second batch is equal to a second batch size greater than the first batch size. The second batch is also to the speech recognition network for processing.

Dynamic voice input detection for conversation assistants

A processor may receive data regarding a context for a first dialog turn. The processor may monitor a voice input from a user for the first dialog turn. The processor may detect a first pause in the voice input, the first pause having a duration that satisfies a time threshold. The processor may receive, based on the first pause, first voice input data. The processor may analyze the first voice input data. The processor may determine that additional time is recommended for the voice input to be provided by the user.

Dynamic voice input detection for conversation assistants

A processor may receive data regarding a context for a first dialog turn. The processor may monitor a voice input from a user for the first dialog turn. The processor may detect a first pause in the voice input, the first pause having a duration that satisfies a time threshold. The processor may receive, based on the first pause, first voice input data. The processor may analyze the first voice input data. The processor may determine that additional time is recommended for the voice input to be provided by the user.

Hindrance speech portion detection using time stamps

A computer-implemented method of detecting a portion of audio data to be removed is provided. The method includes obtaining a recognition result of audio data. The recognition result includes recognized text data and time stamps. The method also includes extracting one or more candidate phrases from the recognition result using n-gram counts. The method further includes, for each candidate phrase, making pairs of same phrases with different time stamps and clustering the pairs of the same phrase by using differences in time stamps. The method includes further determining a portion of the audio data to be removed using results of the clustering.

Speech command verification

A system and method performs speech command verification to determine if audio data includes a representation of a speech command. A first neural network may process portions of the audio data before and after a representation of a wake trigger in the audio data. A second neural network may process the audio data using a recurrent neural network to determine if the audio data includes a representation of a wake trigger.