Patent classifications
G06V20/46
Thumbnail Image Replacement
Methods for recognizing thumbnails may include operations of receiving an identification of a thumbnail source for content, receiving the thumbnail, computing a hash value for the thumbnail, and associating the hash value with the thumbnail. Operations for content characterization may include launching an image analysis application, selecting a top level category to apply to a thumbnail, providing the thumbnail to the image analysis application, applying the selected top level category to the thumbnail to determine if the thumbnail satisfies the top level category, if satisfied, associating the top level category with the thumbnail, and repeating the one or more of the above operations with respect to a second category. Operations may include receiving an identification of a node to receive a thumbnail, obtaining a node selected category, receiving a proposed thumbnail to provide to the node, and determining if the proposed thumbnail has been previously recognized and categorized.
FEW-SHOT ACTION RECOGNITION
Methods and systems of training a neural network include training a feature extractor and a classifier using a first set of training data that includes one or more base cases. The classifier is trained with few-shot adaptation using a second set of training data, smaller than the first set of training data, while keeping parameters of the feature extractor constant.
METHOD FOR OPTIMIZING PROCESS OF DISPLAYING VIDEO STREAMS WITH SPECIFIED EVENT, APPARATUS EMPLOYING METHOD, AND COMPUTER READABLE STORAGE MEDIUM
A method for optimizing a process of displaying video streams with specified event receives video streams. The video streams are sequenced based on a specified arrangement role to from a video stream queue. By analyzing each video streams, whether each video stream includes a specified event is determined. If each video stream without the specified event, the video streams are outputted based on the video stream queue. If the video stream includes the specified event, the video stream with the specified event is adjusted to be priority in the video stream queue, and the video streams are outputted based on the updated video stream queue. The video streams with the specified event can be prioritized processed and focus. A video stream processing apparatus and a computer readable storage medium applying the method are also provided.
Person replacement utilizing deferred neural rendering
Techniques are disclosed for performing video synthesis of audiovisual content. In an example, a computing system may determine first parameters of a face and body of a source person from a first frame in a video shot. The system also determines second parameters of a face and body of a target person. The system determines that the target person is a replacement for the source person in the first frame. The system generates third parameters of the target person based on merging the first parameters with the second parameters. The system then performs deferred neural rendering of the target person based on a neural texture that corresponds to a texture space of the video shot. The system then outputs a second frame that shows the target person as the replacement for the source person.
Scene-aware video encoder system and method
Embodiments of the present disclosure discloses a scene-aware video encoder system. The scene-aware encoder system transforms a sequence of video frames of a video of a scene into a spatio-temporal scene graph. The spatio-temporal scene graph includes nodes representing one or multiple static and dynamic objects in the scene. Each node of the spatio-temporal scene graph describes an appearance, a location, and/or a motion of each of the objects (static and dynamic objects) at different time instances. The nodes of the spatio-temporal scene graph are embedded into a latent space using a spatio-temporal transformer encoding different combinations of different nodes of the spatio-temporal scene graph corresponding to different spatio-temporal volumes of the scene. Each node of the different nodes encoded in each of the combinations is weighted with an attention score determined as a function of similarities of spatio-temporal locations of the different nodes in the combination.
Methods, systems, and media for adaptive presentation of a video content item based on an area of interest
Methods, systems, and media for adaptive presentation of a video content item based on an area of interest are provided. In some embodiments, the method comprises: causing a video content item to be presented within a viewport having first dimensions in connection with a web page, wherein the video content item is associated with area of interest information corresponding to one or more frames of the video content item; determining that the first dimensions associated with the viewport have changed in which the viewport is currently associated with second dimensions; determining that a modified video content item should be presented within the viewport having the second dimensions in response to determining that the first dimensions associated with the viewport have changed, wherein the modified video content item includes an area of interest based on the area of interest information associated with the video content item and wherein portions of at least one frame of the modified video content item are removed based on the second dimensions of the viewport; and causing the modified video content item to be presented within the viewport having the second dimensions.
System for deliverables versioning in audio mastering
Some implementations of the disclosure relate to using a model trained on mixing console data of sound mixes to automate the process of sound mix creation. In one implementation, a non-transitory computer-readable medium has executable instructions stored thereon that, when executed by a processor, causes the processor to perform operations comprising: obtaining a first version of a sound mix; extracting first audio features from the first version of the sound mix obtaining mixing metadata; automatically calculating with a trained model, using at least the mixing metadata and the first audio features, mixing console features; and deriving a second version of the sound mix using at least the mixing console features calculated by the trained model.
Generating structured data from screen recordings
Generating structured data from screen recordings is disclosed, including: obtaining, from a client device, a screen recording of a user's activities on the client device with respect to a task; performing, at a server, video validation on the screen recording, including by determining whether the screen recording matches a set of validation parameters associated with the task; and generating a set of structured data based at least in part on the video validation.
System and method for providing unsupervised domain adaptation for spatio-temporal action localization
A system and method for providing unsupervised domain adaption for spatio-temporal action localization that includes receiving video data associated with a source domain and a target domain that are associated with a surrounding environment of a vehicle. The system and method also include analyzing the video data associated with the source domain and the target domain and determining a key frame of the source domain and a key frame of the target domain. The system and method additionally include completing an action localization model to model a temporal context of actions occurring within the key frame of the source domain and the key frame of the target domain and completing an action adaption model to localize individuals and their actions and to classify the actions based on the video data. The system and method further include combining losses to complete spatio-temporal action localization of individuals and actions.
Video visual relation detection methods and systems
Methods and systems for detecting visual relations in a video are disclosed. A method comprises: decomposing the video sequence into a plurality of segments; for each segment, detecting objects in frames of the segment; tracking the detected objects over the segment to form a set of object tracklets for the segment; for the detected objects, extracting object features; for pairs of object tracklets of the set of object tracklets, extracting relativity features indicative of a relation between the objects corresponding to the pair of object tracklets; forming relation feature vectors for pairs of object tracklets using the object features of objects corresponding to respective pairs of object tracklets and the relativity features of the respective pairs of object tracklets; and generating a set of segment relation prediction results from the relation features vectors; generating a set of visual relation instances for the video sequence by merging the segment prediction results from different segments; and generating a set of visual relation detection results from the set of visual relation instances.