Patent classifications
G06F16/7834
System and method for using multimedia content as search queries
There is provided a method for searching a plurality of information sources using a multimedia element, the method may include receiving at least one multimedia element; generating, by a signature generator, for the at least one multimedia element at least one signature that is unidirectional, and yields compression; generating at least one textual search query using the at least one signature; wherein the generating of the textual search query comprises: (a) searching for at least one matching stored signature that matches one or more of the at least one signature; and (b) using a mapping between stored signatures and textual search queries, selecting at least one textual search query mapped to at least one matching stored signature; searching the plurality of information sources using the at least one textual search query; and causing a display of search results retrieved from the plurality of information sources.
Using Video Clips as Dictionary Usage Examples
Implementations are provided for automatically mining corpus(es) of electronic video files for video clips that contain spoken utterances that are suitable usage examples to accompany or compliment dictionary definitions. These video clips may then be associated with target n-grams in a searchable database, such as a database underlying an online dictionary. In various implementations, a set of candidate video clips in which a target n-gram is uttered in a target context may be identified from a corpus of electronic video files. For each candidate video clip of the set, pre-existing manual subtitles associated with the candidate video clip may be compared to text generated based on speech recognition processing of an audio portion of the candidate video clip. Based at least in part on the comparing, a measure of suitability as a dictionary usage example may be calculated for the candidate video clip.
Tagging an image with audio-related metadata
In one aspect, an example method to be performed by a computing device includes (a) receiving a request to use a camera of the computing device; (b) in response to receiving the request, (i) using a microphone of the computing device to capture audio content and (ii) using the camera of the computing device to capture an image; (c) identifying reference audio content that has at least a threshold extent of similarity with the captured audio content; and (d) outputting an indication of the identified reference audio content while displaying the captured image.
Model-based dubbing to translate spoken audio in a video
Model-based dubbing techniques are implemented to generate a translated version of a source video. Spoken audio portions of a source video may be extracted and semantic graphs generated that represent the spoken audio portions. The semantic graphs may be used to produce translations of the spoken portions. A machine learning model may be implemented to generate replacement audio for the spoken portions using the translation of the spoken portion. A machine learning model may be implemented to generate modifications to facial image data for a speaker of the replacement audio.
System and method for physiological monitoring and feature set optimization for classification of physiological signal
This disclosure relates generally to physiological monitoring, and more particularly to feature set optimization for classification of physiological signal. In one embodiment, a method for physiological monitoring includes identifying clean physiological signal training set from an input physiological signal based on a Dynamic Time Warping (DTW) of segments associated with the physiological signal. An optimal features set is extracted from a clean physiological signal training set based on a Maximum Consistency and Maximum Dominance (MCMD) property associated with the optimal feature set that strictly optimizes on the objective function, the conditional likelihood maximization over different selection criteria such that diverse properties of different selection parameters are captured and achieves Pareto-optimality. The input physiological signal is classified into normal signal components and abnormal signal components using the optimal features set.
Methods, systems, and apparatuses to respond to voice requests to play desired video clips in streamed media based on matched close caption and sub-title text
Methods, Systems, and Apparatuses are described to implement voice search in media content for requesting media content of a video clip of a scene contained in the media content streamed to the client device; for capturing the voice request for the media content of the video clip to display at the client device wherein the streamed media content is a selected video streamed from a video source; for applying a NLP solution to convert the voice request to text for matching to a set of one or more words contained in at least close caption text of the selected video; for associating matched words to close caption text with a start index and an end index of the video clip contained in the selected video; and for streaming the video clip to the client device based on the start index and the end index associated with matched closed caption text.
Manufacture of NFTs from film libraries
Methods and processes for manufacture of an image product from a digital image. An object in the digital image is detected and recognized. Object metadata is assigned to the object, the object metadata linking sound to the object in the digital image which produced the sound. At least one cryptographic hash of the object metadata is generated, and the hash is written to a node of a transaction processing network.
Text-driven video synthesis with phonetic dictionary
Presented herein are novel approaches to synthesize video of the speech from text. In a training phase, embodiments build a phoneme-pose dictionary and train a generative neural network model using a generative adversarial network (GAN) to generate video from interpolated phoneme poses. In deployment, the trained generative neural network in conjunction with the phoneme-pose dictionary convert an input text into a video of a person speaking the words of the input text. Compared to audio-driven video generation approaches, the embodiments herein have a number of advantages: 1) they only need a fraction of the training data used by an audio-driven approach; 2) they are more flexible and not subject to vulnerability due to speaker variation; and 3) they significantly reduce the preprocessing, training, and inference times.
Playback of audio content along with associated non-static media content
This disclosure concerns the provision of media, and more particularly streaming of media. In particular, one aspect herein relates to a method performed by a server system of streaming an audio content item to an electronic device. In response to receiving a request message from the electronic device, a selected audio content item is retrieved from a first storage. Descriptive metadata including an origin-ID associated with the retrieved audio content item is determined. A second storage is browsed utilizing said metadata including the origin-ID to locate non-static media content item(s) associated with the origin-ID. In response to finding a non-static media content item associated with the origin ID, the selected audio content item is sent along with the located non-static media content item to the electronic device for simultaneous presentation of the audio content item and the located non-static media content item.
METHOD AND APPARATUS FOR REPRESENTING SPACE OF INTEREST OF AUDIO SCENE
Aspects of the disclosure include methods, apparatuses, and non-transitory computer-readable storage mediums for representing a space of interest of an audio scene. One apparatus includes processing circuitry that decodes audio scene data for the audio scene. The audio scene data includes (i) audio content for a plurality of items representing the audio scene and (ii) a first syntax element indicating a type of a subset of the plurality of items. The subset of the plurality of items represents the space of interest of the audio scene. The processing circuitry determines a part of the audio content for the subset of the plurality of items based on the type of the subset of the plurality of items indicated in the first syntax element. The processing circuitry renders the determined part of the audio content.