Patent classifications
G10L15/04
Systems and methods for automatic speech translation
A method for providing automatic interpretation may include receiving, by a processor, audible speech from a speech source, generating, by the processor, in real-time, a speech transcript by applying an automatic speech recognition model on the speech, segmenting, by the processor, the speech transcript into speech segments based on a content of the speech by applying a segmenter model on the speech transcript, compressing, by the processor, the speech segments based on the content of the speech by applying a compressor model on the speech segments, generating, by the processor, a translation of the speech by applying a machine translation model on the compressed speech segments, and generating, by the processor, audible translated speech based on the translation of the speech by applying a text to speech model on the translation of the speech.
Online Interview Method and System
One or more embodiments of the present specification relates to an online interview method and system. The method includes: establishing communication connection between an interviewing terminal and an interviewed terminal through a network; and acquiring communication information between the interviewing terminal and the interviewed terminal. The communication information includes one or more types of audio information, video information, and text information. The interviewing terminal includes one or more of a first host terminal, a second host terminal, and a text processing terminal. The first host terminal is configured to host an interview, and the first host terminal displays an interview outline and/or information of the interviewed terminal. The second host terminal is configured to host the interview and/or participate in the consultation of interview questions. The text processing terminal converts the audio information and/or the audio information in the video into corresponding text information.
Online Interview Method and System
One or more embodiments of the present specification relates to an online interview method and system. The method includes: establishing communication connection between an interviewing terminal and an interviewed terminal through a network; and acquiring communication information between the interviewing terminal and the interviewed terminal. The communication information includes one or more types of audio information, video information, and text information. The interviewing terminal includes one or more of a first host terminal, a second host terminal, and a text processing terminal. The first host terminal is configured to host an interview, and the first host terminal displays an interview outline and/or information of the interviewed terminal. The second host terminal is configured to host the interview and/or participate in the consultation of interview questions. The text processing terminal converts the audio information and/or the audio information in the video into corresponding text information.
RAPID GENERATION OF VISUAL CONTENT FROM AUDIO
A video is generated from an audio file by transcribing the audio file into texts and breaking the audio file into one or more segments or shots used as scenes. A media piece is then matched to each shot; the media pieces are properly contextualized based on the text or attributes of the audio associated with the shot, the overall script or theme, an intended audience, or other factors. The resulting video is then created by stitching the media pieces together.
RAPID GENERATION OF VISUAL CONTENT FROM AUDIO
A video is generated from an audio file by transcribing the audio file into texts and breaking the audio file into one or more segments or shots used as scenes. A media piece is then matched to each shot; the media pieces are properly contextualized based on the text or attributes of the audio associated with the shot, the overall script or theme, an intended audience, or other factors. The resulting video is then created by stitching the media pieces together.
VOICE TRANSLATION AND VIDEO MANIPULATION SYSTEM
A communication modification system including an audio gathering unit that gathers an audio stream, a language detection unit that converts the audio stream into text, where the language detection unit correlates portions of the text with audio portions of the audio stream, and the language detection unit determines a first and second deviation in the audio stream portion based on the text portion and audio portion gathered by the audio gathering unit.
VOICE TRANSLATION AND VIDEO MANIPULATION SYSTEM
A communication modification system including an audio gathering unit that gathers an audio stream, a language detection unit that converts the audio stream into text, where the language detection unit correlates portions of the text with audio portions of the audio stream, and the language detection unit determines a first and second deviation in the audio stream portion based on the text portion and audio portion gathered by the audio gathering unit.
AUDIO FILTER EFFECTS VIA SPATIAL TRANSFORMATIONS
An audio system of a client device applies transformations to audio received over a computer network. The transformations (e.g., HRTFs) effect changes in apparent source positions of the received audio, or of segments thereof. Such transformations may be used to achieve “animation” of audio, in which the source positions of the audio or audio segments appear to change over time (e.g., circling around the listener). Additionally, segmentation of audio into distinct semantic audio segments, and application of separate transformations for each audio segment, can be used to intuitively differentiate the different audio segments by causing them to sound as if they emanated from different positions around the listener.
AUDIO FILTER EFFECTS VIA SPATIAL TRANSFORMATIONS
An audio system of a client device applies transformations to audio received over a computer network. The transformations (e.g., HRTFs) effect changes in apparent source positions of the received audio, or of segments thereof. Such transformations may be used to achieve “animation” of audio, in which the source positions of the audio or audio segments appear to change over time (e.g., circling around the listener). Additionally, segmentation of audio into distinct semantic audio segments, and application of separate transformations for each audio segment, can be used to intuitively differentiate the different audio segments by causing them to sound as if they emanated from different positions around the listener.
Networked devices, systems, and methods for intelligently deactivating wake-word engines
In one aspect, a playback deice is configured to identify in an audio stream, via a second wake-word engine, a false wake word for a first wake-word engine that is configured to receive as input sound data based on sound detected by a microphone. The first and second wake-word engines are configured according to different sensitivity levels for false positives of a particular wake word. Based on identifying the false wake word, the playback device is configured to (i) deactivate the first wake-word engine and (ii) cause at least one network microphone device to deactivate a wake-word engine for a particular amount of time. While the first wake-word engine is deactivated, the playback device is configured to cause at least one speaker to output audio based on the audio stream. After a predetermined amount of time has elapsed, the playback device is configured to reactivate the first wake-word engine.