Patent classifications
G10L15/005
REAL-TIME SPEECH-TO-SPEECH GENERATION (RSSG) AND SIGN LANGUAGE CONVERSION APPARATUS, METHOD AND A SYSTEM THEREFORE
Areal-time speech-to-speech generator and sign gestures converter system is disclosed. The system is still challenging for deaf or hearing impaired people. Embodiments of the invention provide direct speech to speech translation system and further conversion to sign gestures is disclosed. Direct speech to speech translation and further sign gesture conversion uses a one-tier approach, creating a unified-model for whole application. The single-model ecosystem takes in audio (MEL spectrogram) as an input and gives out audio (MEL spectrogram) as an output to a speech-sign converter device with a display. This solves the bottleneck problem by converting the translated speech directly to sign language gesture from first language with emotion by preserving phonetic information along the way. This model needs parallel audio samples in two languages. The training methodology involves augmenting or changing both sides of the audio equally and later converts to sign gestures which are being displayed on a speech-sign converter device.
AUTOMATED CALL REQUESTS WITH STATUS UPDATES
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, relating to synthetic call status updates. In some implementations, a method includes determining, by a task manager module, that a triggering event has occurred to provide a current status of a user call request. The method may then determine, by the task manager module, the current status of the user call request. A representation of the current status of the user call request is generated. Then, the generated representation of the current status of the user call request is provided to the user.
DYNAMIC LANGUAGE SELECTION OF AN AI VOICE ASSISTANCE SYSTEM
The computer-implemented method provides for a digital virtual assistant (DVA) receiving input spoken in a first language by a user. The DVA determines a context of a current situation based on language and identity of individuals within a proximity of the DVA. The DVA determines whether the context of the current situation includes providing a response using a second language. In response to determining the context of the current situation calls for providing the response in the second language, the DVA determines the second language based on the context, and the DVA responds to the input spoken in the first language by the user, such that the response includes a dynamic selection of the second language and is based on an interaction context of the user and the DVA, and reference to a corpus of interaction context usage of the second language in a historically similar situation.
METHOD AND APPARATUS FOR MULTILINGUAL SPEECH RECOGNITION BASED ON ARTIFICIAL INTELLIGENCE MODELS
A method for automatic multilingual speech recognition may comprise: recognizing input audio data by a speech recognizer; classifying the audio data by a speech language classifier; activating, by an output layer selector coupled to the speech recognizer, any one projection output layer of a plurality projection output layers respectively connected to the speech recognizer according to language classification information received from the speech language classifier, an output unit of the activated projection output layer being configured as several bytes; and recombining outputs output in a unit of the several bytes by the activated projection output layer, and outputting the recombined output as an automatic speech recognition result for the audio data.
VOICE-BASED CONTROL OF SEXUAL STIMULATION DEVICES
A system and method for voice-based control of sexual stimulation devices. In some configurations, the system and method involve receiving voice data, analyzing the voice data to detect spoken commands, and generating control signals based on the commands. In some configurations, the system and method involve receiving voice data, analyzing the voice data for non-speech vocalizations, detecting voice stress patterns, and generating control signals based on the detected patterns. In some configurations, the analyses of the voice data are performed by machine learning algorithms which may be trained on associations between speech and non-speech vocalizations of a user while the user engages in one or more voice-based training tasks, associating speech and non-speech vocalizations with controls of the sexual stimulation device. In some configurations, machine learning algorithms are used to make the associations. In some configurations, data from other biometric sensors is included in the associations.
COMPUTER SYSTEMS EXHIBITING IMPROVED COMPUTER SPEED AND TRANSCRIPTION ACCURACY OF AUTOMATIC SPEECH TRANSCRIPTION (AST) BASED ON A MULTIPLE SPEECH-TO-TEXT ENGINES AND METHODS OF USE THEREOF
In some embodiments, an exemplary inventive system for improving computer speed and accuracy of automatic speech transcription includes at least components of: a computer processor configured to perform: generating a recognition model specification for a plurality of distinct speech-to-text transcription engines; where each distinct speech-to-text transcription engine corresponds to a respective distinct speech recognition model; receiving at least one audio recording representing a speech of a person; segmenting the audio recording into a plurality of audio segments; determining a respective distinct speech-to-text transcription engine to transcribe a respective audio segment; receiving, from the respective transcription engine, a hypothesis for the respective audio segment; accepting the hypothesis to remove a need to submit the respective audio segment to another distinct speech-to-text transcription engine, resulting in the improved computer speed and the accuracy of automatic speech transcription and generating a transcript of the audio recording from respective accepted hypotheses for the plurality of audio segments.
VIDEO HIGHLIGHT EXTRACTION METHOD AND SYSTEM, AND STORAGE MEDIUM
The present disclosure relates to a video highlight extraction method and system, and a storage medium. The method includes: obtaining a to-be-processed online class video and a teacher-student interaction feature and dividing the to-be-processed online class video into a plurality of target videos; respectively analysis on pictures corresponding to all frames of a target video, to obtain a visual feature set of a student and a visual feature set of a teacher in the pictures corresponding to the frames; determining timeliness of student feedback; performing speech recognition on the speech segment corresponding to the student and the speech segment corresponding to the teacher and extracting a key word, to determine fluency of language of the teacher, fluency of language of the student, and correctness of teaching knowledge; and determining a highlight in the to-be-processed online class video according to priorities of the target videos.
Dynamic language and command recognition
Systems and methods are described for processing and interpreting audible commands spoken in one or more languages. Speech recognition systems disclosed herein may be used as a stand-alone speech recognition system or comprise a portion of another content consumption system. A requesting user may provide audio input (e.g., command data) to the speech recognition system via a computing device to request an entertainment system to perform one or more operational commands. The speech recognition system may analyze the audio input across a variety of linguistic models, and may parse the audio input to identify a plurality of phrases and corresponding action classifiers. In some embodiments, the speech recognition system may utilize the action classifiers and other information to determine the one or more identified phrases that appropriately match the desired intent and operational command associated with the user's spoken command.
Large-scale multilingual speech recognition with a streaming end-to-end model
A method of transcribing speech using a multilingual end-to-end (E2E) speech recognition model includes receiving audio data for an utterance spoken in a particular native language, obtaining a language vector identifying the particular language, and processing, using the multilingual E2E speech recognition model, the language vector and acoustic features derived from the audio data to generate a transcription for the utterance. The multilingual E2E speech recognition model includes a plurality of language-specific adaptor modules that include one or more adaptor modules specific to the particular native language and one or more other adaptor modules specific to at least one other native language different than the particular native language. The method also includes providing the transcription for output.
Systems, methods, and apparatus for language acquisition using socio-neuorocognitive techniques
Provided are various mechanisms and processes for language acquisition using socio-neurocognitive techniques.