G10L2015/226

Subtitle generation using background information

A video is received. One or more subtitles are determined for the video. Whether a word found in a background of the video is similar to a word found in the one or more subtitles is determined. Responsive to determining the word found in the background of the video is similar to the word found in the one or more subtitles, one or more updated subtitles are generated. The one or more updated subtitles include the word found in the background of the video and remove the word found in the one or more subtitles that is similar. A metric for the one or more updated subtitles is calculated. Whether the metric is larger than a threshold is determined. Responsive to determining the metric is larger than the threshold, the video is updated to include the one or more updated subtitles.

Dynamic speech recognition methods and systems with user-configurable performance

Methods and systems are provided for assisting operation of a vehicle using speech recognition. One method involves identifying a user-configured speech recognition performance setting value selected from among a plurality of speech recognition performance setting values, selecting a speech recognition model configuration corresponding to the user-configured speech recognition performance setting value from among a plurality of speech recognition model configurations, where each speech recognition model configuration of the plurality of speech recognition model configurations corresponds to a respective one of the plurality of speech recognition performance setting values, and recognizing an audio input as an input state using the speech recognition model configuration corresponding to the user-configured speech recognition performance setting value.

System for modifying speech recognition and beamforming using a depth image

A system includes a speech recognition processor, a depth sensor coupled to the speech recognition processor, and an array of microphones coupled to the speech recognition processor. The depth sensor is operable to calculate a distance and a direction from the array of microphones to a source of audio data. The speech recognition processor is operable to select an acoustic model as a function of the distance and the direction from the array of microphones to the source of audio data. The speech recognition processor is operable to apply the distance measure in the microphone array beam formation so as to boost portions of the signals originating from the source of audio data and to suppress portions of the signals resulting from noise.

CROSS-LINGUAL SPEECH RECOGNITION
20220383862 · 2022-12-01 · ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for cross-lingual speech recognition are disclosed. In one aspect, a method includes the actions of determining a context of a second computing device. The actions further include identifying, by a first computing device, an additional pronunciation for a term of multiple terms. The actions further include including the additional pronunciation for the term in the lexicon. The actions further include receiving audio data of an utterance. The actions further include generating a transcription of the utterance by using the lexicon that includes the multiple terms and the pronunciation for each of the multiple terms and the additional pronunciation for the term. The actions further include after generating the transcription of the utterance, removing the additional pronunciation for the term from the lexicon. The actions further include providing, for output, the transcription.

Distributed identification in networked system
11683320 · 2023-06-20 · ·

The present disclosure is generally directed to a data processing system for customizing content in a voice activated computer network environment. With user consent, the data processing system can improve the efficiency and effectiveness of auditory data packet transmission over one or more computer networks by, for example, increasing the accuracy of the voice identification process used in the generation of customized content. The present solution can make accurate identifications while generating fewer audio identification models, which are computationally intensive to generate.

Language models using non-linguistic context
09842592 · 2017-12-12 · ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for language models using non-linguistic context. In some implementations, context data indicating non-linguistic context for the utterance is received. Based on the context data, feature scores for one or more non-linguistic features are generated. The feature scores for the non-linguistic features are provided to a language model trained to process scores for non-linguistic features. The output from the language model is received, and a transcription for the utterance is determined using the output of the language model.

METHOD, APPARATUS AND COMPUTER-READABLE MEDIA FOR TOUCH AND SPEECH INTERFACE WITH AUDIO LOCATION

Method, apparatus, and computer-readable media for touch and speech interface, with audio location, includes structure and/or function whereby at least one processor: (i) receives a touch input from a touch device; (ii) establishes a touch-speech time window; (iii) receives a speech input from a speech device; (iii) determines whether the speech input is present in a global dictionary; (iv) determines a location of a sound source from the speech device; (v) determines whether the touch input and the location of the speech input are both within a same region; (vi) if the speech input is in the dictionary, determines whether the speech input has been received within the window; and (vii) if the speech input has been received within the window, and the touch input and the speech input are both within the same region, activates an action corresponding to both the touch input and the speech input.

SPECIFYING PREFERRED INFORMATION SOURCES TO AN ASSISTANT
20230186908 · 2023-06-15 ·

Implementations relate to interactions between a user and an automated assistant during a dialog between the user and the automated assistant. Some implementations relate to processing received user request input to determine that it is of a particular type that is associated with a source parameter rule and, in response, causing one or more sources indicated as preferred by the source parameter rule and one or more additional sources not indicated by the source parameter rule to be searched based on the user request input. Further, those implementations relate to identifying search results of the search(es), and generating, in dependence on the search results, a response to the user request that includes content from search result(s) of the preferred source(s) and/or content from search result(s) of the additional source(s). Generating the response further includes including, in the response, some indication that indicates whether the source parameter rule was followed or violated in generating the response.

Methods and apparatus for detecting a voice command

According to some aspects, a method of monitoring an acoustic environment of a mobile device, at least one computer readable medium encoded with instructions that, when executed, perform such a method and/or a mobile device configured to perform such a method is provided. The method comprises receiving acoustic input from the environment of the mobile device while the mobile device is operating in the low power mode, detecting whether the acoustic input includes a voice command based on performing a plurality of processing stages on the acoustic input, wherein at least one of the plurality of processing stages is performed while the mobile device is operating in the low power mode, and using at least one contextual cue to assist in detecting whether the acoustic input includes a voice command.

CONTEXTUAL BIASING FOR SPEECH RECOGNITION

A method includes receiving audio data encoding an utterance and obtaining a set of bias phrases corresponding to a context of the utterance. Each bias phrase includes one or more words. The method also includes processing, using a speech recognition model, acoustic features derived from the audio to generate an output from the speech recognition model. The speech recognition model includes a first encoder configured to receive the acoustic features, a bias encoder configured to receive data indicating the obtained set of bias phrases, a bias encoder, and a decoder configured to determine likelihoods of sequences of speech elements based on output of the first attention module and output of the bias attention module. The method also includes determining a transcript for the utterance based on the likelihoods of sequences of speech elements.