G10L2015/025

Recommending Results In Multiple Languages For Search Queries Based On User Profile
20230237098 · 2023-07-27 ·

Systems and methods for a media guidance application that generates results in multiple languages for search queries. In particular, the media guidance application resolves multiple language barriers by taking automatic and manual user language settings and applying those settings to a variety of potential search results.

MINIMUM WORD ERROR RATE TRAINING FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS

Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses a set of speech recognition hypothesis samples, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.

DATA SORTING FOR GENERATING RNN-T MODELS
20230237987 · 2023-07-27 ·

A computer-implemented method for preparing training data for a speech recognition model is provided including obtaining a plurality of sentences from a corpus, dividing each phoneme in each sentence of the plurality of sentences into three hidden states, calculating, for each sentence of the plurality of sentences, a score based on a variation in duration of the three hidden states of each phoneme in the sentence, and sorting the plurality of sentences by using the calculated scores.

AUDIO CONFIGURATION SWITCHING IN VIRTUAL REALITY

Various aspects of the subject technology relate to systems, methods, and machine-readable media for communication a shared artificial reality environment. Various aspects may include receiving an indication of artificial reality location information for a user. Aspects may also include determining an audio configuration for the user based on the artificial reality location information or an application. Aspects may also include determining a switch point for changing the audio configuration for audio between the user and the another user, such as based on the location of the another user. Aspects may also include changing the audio configuration to another audio configuration based on the switch point. Aspects may include outputting audio based on the another audio configuration.

Segment-based speaker verification using dynamically generated phrases
11568879 · 2023-01-31 · ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for verifying an identity of a user. The methods, systems, and apparatus include actions of receiving a request for a verification phrase for verifying an identity of a user. Additional actions include, in response to receiving the request for the verification phrase for verifying the identity of the user, identifying subwords to be included in the verification phrase and in response to identifying the subwords to be included in the verification phrase, obtaining a candidate phrase that includes at least some of the identified subwords as the verification phrase. Further actions include providing the verification phrase as a response to the request for the verification phrase for verifying the identity of the user.

Real time correction of accent in speech audio signals
11715457 · 2023-08-01 · ·

Systems and methods for real-time correction of an accent in a speech audio signal are provided. A method includes dividing the speech audio signal into a stream of input chunks, an input chunk from the stream of input chunks including a pre-defined number of frames of the speech audio signal, extracting, by an acoustic features extraction module from the input chunk and a context associated with the input chunk, acoustic features, the context is a pre-determined number of the frames preceding the input chunk in the stream; extracting, by a linguistic features extraction module from the input chunk and the context, linguistic features, receiving a speaker embedding for a human speaker, providing the speaker embedding, the acoustic features, and the linguistic features to a synthesis module to generate a melspectrogram with a reduced accent, providing the melspectrogram to a vocoder to generate an output chunk of an output audio signal.

INCREASING USER ENGAGEMENT THROUGH QUERY SUGGESTION
20230022515 · 2023-01-26 ·

Systems and methods are presented herein for increasing user engagement with an interface by suggesting commands or queries for the user. A plurality of content items available for consumption are identified and metadata for each of the plurality of content items is retrieved. One or more candidate voice commands are generated based on a plurality of voice command templates based on a target verb and a subset of the metadata corresponding to the plurality of the content items available for consumption. A recall score is generated for each candidate voice command based at least in part on a detection of phonetic features that match between clauses of each candidate voice command. At least the candidate voice command with the highest recall score is selected and output using a suggestion system.

Analysis of an automatically generated transcription
11562743 · 2023-01-24 · ·

There is provided a computer implemented method of aligning an automatically generated transcription of an audio recording to a manually generated transcription of the audio recording comprising: identifying non-aligned text fragments, each located between respective two non-continuous aligned text-fragments of the automatically generated transcription, each aligned text-fragment matching words of the manually generated transcription, for each respective non-aligned text fragment: mapping a target keyword of the manually generated transcription to phonemes, mapping the respective non-aligned text fragment to a corresponding audio-fragment of the audio recording, mapping the audio-fragment to phonemes, identifying at least some of the phonemes of the audio-fragment that correspond to the phonemes of the target keyword, and mapping the identified at least some of the phonemes of the audio-fragment to a corresponding word of the automatically generated transcript, wherein the corresponding word is an incorrect automated transcription of the target word appearing in the manually generated transcription.

SYSTEMS AND METHODS FOR GENERATING DISAMBIGUATED TERMS IN AUTOMATICALLY GENERATED TRANSCRIPTIONS INCLUDING INSTRUCTIONS WITHIN A PARTICULAR KNOWLEDGE DOMAIN
20230230579 · 2023-07-20 ·

System and method for generating disambiguated terms in automatically generated transcriptions including instructions within a knowledge domain and employing the system are disclosed. Exemplary implementations may: obtain a set of transcripts representing various speech from users; obtain indications of correlated correct and incorrect transcriptions of spoken terms within the knowledge domain; obtain a vector generation model that generates vectors for individual instances of the transcribed terms in the set of transcripts that are part of the lexicography of the knowledge domain; use the vector generation model to generate the vectors such that a first set of vectors and a second set of vectors are generated that represent the instances of the first correctly transcribed term and the first incorrectly transcribed term, respectively; and train the vector generation model to reduce spatial separation of vectors generated for instances of correlated correct and incorrect transcriptions of spoken terms within the knowledge domain.

Stylizing text-to-speech (TTS) voice response for assistant systems

In one embodiment, a method includes receiving a voice input from a user and determining a first style of the voice input, based on first features extracted from the voice input. A second style for a voice response having second features may then be determined based on the first style. Finally, the voice response may be generated based on the second features of the second style, and this voice response may be provided in response to the voice input.