G10L15/083

ELECTRONIC APPARATUS AND METHOD OF CONTROLLING THE SAME
20220189478 · 2022-06-16 · ·

An electronic apparatus including an interface to communicate with an external apparatus and a processor configured to identify a command for an external apparatus based on a second audio signal received after a first audio signal identified as corresponding to a trigger command for the external apparatus, identify a state of the external apparatus based on whether the identified command is capable of being performed by the external apparatus, and transmit information, which corresponds to a function to be performed by the external apparatus by the identified command, to the external apparatus through the interface based on the identified state of the external apparatus.

Conditional wake word eventing based on environment
11361756 · 2022-06-14 · ·

In one aspect, a playback device includes at least one microphone configured to detect sound. The playback detects sound via the one or more microphones and determines whether (i) the detected sound includes a voice input, (ii) the detected sound excludes background speech, and (iii) the voice input includes a command keyword. In response to the determining, the playback device performs a playback function corresponding to the command keyword.

Streaming Action Fulfillment Based on Partial Hypotheses
20220180868 · 2022-06-09 · ·

A method for streaming action fulfillment receives audio data corresponding to an utterance where the utterance includes a query to perform an action that requires performance of a sequence of sub-actions in order to fulfill the action. While receiving the audio data, but before receiving an end of speech condition, the method processes the audio data to generate intermediate automated speech recognition (ASR) results, performs partial query interpretation on the intermediate ASR results to determine whether the intermediate ASR results identify an application type needed to perform the action and, when the intermediate ASR results identify a particular application type, performs a first sub-action in the sequence of sub-actions by launching a first application to execute on the user device where the first application is associated with the particular application type. The method, in response to receiving an end of speech condition, fulfills performance of the action.

Dialog system with automatic reactivation of speech acquiring mode

Embodiments of the disclosure generally relate to a dialog system allowing for automatically reactivating a speech acquiring mode after the dialog system delivers a response to a user request. The reactivation parameters, such as a delay, depend on a number of predetermined factors and conversation scenarios. The embodiments further provide for a method of operating of the dialog system. An exemplary method comprises the steps of: activating a speech acquiring mode, receiving a first input of a user, deactivating the speech acquiring mode, obtaining a first response associated with the first input, delivering the first response to the user, determining that a conversation mode is activated, and, based on the determination, automatically re-activating the speech acquiring mode within a first predetermined time period after delivery of the first response to the user.

Speech-processing system

A system may include first and second speech-processing systems with corresponding first and second wakewords. An utterance may contain two or more wakewords. The system determines which speech-processing system to use to perform further audio processing and to determine a response to the utterance.

A METHOD AND SYSTEM FOR CONTENT INTERNATIONALIZATION & LOCALISATION
20220172709 · 2022-06-02 ·

A method of processing a video file to generate a modified video file, the modified video file including a translated audio content of the video file, the method comprising: receiving the video file; accessing a facial model or a speech model for a specific speaker, wherein the facial model maps speech to facial expressions, and the speech model maps text to speech; receiving a reference content for the originating video file for the specific speaker; generating modified audio content for the specific speaker and/or modified facial expression for the specific speaker; and modifying the video file in accordance with the modified content and/or the modified expression to generate the modified video file.

PHONEME-BASED CONTEXTUALIZATION FOR CROSS-LINGUAL SPEECH RECOGNITION IN END-TO-END MODELS

A method includes receiving audio data encoding an utterance spoken by a native speaker of a first language, and receiving a biasing term list including one or more terms in a second language different than the first language. The method also includes processing, using a speech recognition model, acoustic features derived from the audio data to generate speech recognition scores for both wordpieces and corresponding phoneme sequences in the first language. The method also includes rescoring the speech recognition scores for the phoneme sequences based on the one or more terms in the biasing term list, and executing, using the speech recognition scores for the wordpieces and the rescored speech recognition scores for the phoneme sequences, a decoding graph to generate a transcription for the utterance.

Systems and Methods for Automatic Candidate Assessments in an Asynchronous Video Setting

In an illustrative embodiment, systems and methods for automating recorded candidate assessments include receiving a submission for an available position including a question response recording for each of one or more interview questions. For each question response recording, a transcript can be generated by applying a speech-to-text algorithm to an audio portion of the recording. The systems and methods can detect, within the transcript, identifiers each associated with the personality aspects by applying a natural language classifier trained to detect words and phrases associated with the personality aspects of the personality model. Scores may be calculated for each of the personality aspects based on a relevance of the respective personality aspect to the respective interview question and detected identifiers. The scores can be presented within a user interface screen responsive to receiving a request to view interview results.

Methods and apparatus to determine audio source impact on an audience of media

Methods, apparatus, systems and articles of manufacture to determine audio source impact on an audience of media are disclosed. A disclosed example method includes dividing monitored audio into successive audio segments including a first audio segment and a second audio segment. The example method also includes generating a first confidence value from the first audio segment and a second confidence value from the second audio segment, the first confidence value associated with a presence of a first audio source in the first audio segment, the second confidence value associated with a presence of the first audio source in the second audio segment. The example method includes identifying whether the monitored audio is associated with a presentation of first media based on the first confidence value and the second confidence value.

GENERATION OF INTERACTIVE AUDIO TRACKS FROM VISUAL CONTENT
20220157300 · 2022-05-19 · ·

Generating audio tracks is provided. The system selects a digital component object having a visual output format. The system determines to convert the digital component object into an audio output format. The system generates text for the digital component object. The system selects, based on context of the digital component object, a digital voice to render the text. The system constructs a baseline audio track of the digital component object with the text rendered by the digital voice. The system generates, based on the digital component object, non-spoken audio cues. The system combines the non-spoken audio cues with the baseline audio form of the digital component object to generate an audio track of the digital component object. The system provides the audio track of the digital component object to the computing device for output via a speaker of the computing device.