Patent classifications
G06V20/47
Techniques for dense video descriptions
Techniques and apparatus for generating dense natural language descriptions for video content are described. In one embodiment, for example, an apparatus may include at least one memory and logic, at least a portion of the logic comprised in hardware coupled to the at least one memory, the logic to receive a source video comprising a plurality of frames, determine a plurality of regions for each of the plurality of frames, generate at least one region-sequence connecting the determined plurality of regions, apply a language model to the at least one region-sequence to generate description information comprising a description of at least a portion of content of the source video. Other embodiments are described and claimed.
GENERATING A VIDEO SEGMENT OF AN ACTION FROM A VIDEO
A computer-implemented method includes receiving a video that includes multiple frames. The method further includes identifying a start time and an end time of each action in the video based on application of one or more of an audio classifier, an RGB classifier, and a motion classifier. The method further includes identifying video segments from the video that include frames between the start time and the end time for each action in the video. The method further includes generating a confidence score for each of the video segments based on a probability that a corresponding action corresponds to one or more of a set of predetermined actions. The method further includes selecting a subset of the video segments based on the confidence score for each of the video segments.
Systems and Methods for Extracting and Matching Descriptors from Data Structures Describing an Image Sequence
A compact image sequence descriptor (101), used for describing an image sequence, comprises a segment global descriptor (113) for at least one segment within the sequence, which includes global descriptor information for respective images, relating to interest points within the video content of the images. The segment global descriptor (113) includes a base descriptor (121), which is a global descriptor associated with a representative frame (120) of the image sequence, and a number of relative descriptors (125). The relative descriptors contain information of a respective global descriptor relative to the base descriptor allowing to reconstruct an exact or approximated global descriptor associated with a respective image of the image sequence. The image sequence descriptor (101) may further include a segment local descriptor (114) for a segment, comprising a set of encoded local feature descriptors.
AUTOMATIC ANIMATION TRIGGERING FROM VIDEO
A computer-implemented method includes identifying interesting moments from a video. The video is received and includes image frames. Continual motion of one or more objects in the video is identified based on identifying foreground motion in the image frames. Video segments from the video that include the continual motion are generated. A segment score for each of the video segments is generated based on animation criteria. Responsive to one or more of segment scores exceeding the threshold animation score, one or more corresponding video segments are selected. An animation is generated based on the one or more corresponding video segments.
System and method for processing a video stream to extract highlights
With the widespread availability of video cameras, we are facing an ever-growing enormous collection of unedited and unstructured video data. Due to lack of an automatic way to generate highlights from this large collection of video streams, these videos can be tedious and time consuming to index or search. The present invention is a novel method of online video highlighting, a principled way of generating a short video highlight summarizing the most important and interesting contents of a potentially very long video, which is costly both time-wise and financially for manual processing. Specifically, the method learns a dictionary from given video using group sparse coding, and updates atoms in the dictionary on-the-fly. A highlight of the given video is then generated by combining segments that cannot be sparsely reconstructed using the learned dictionary. The online fashion of the method enables it to process arbitrarily long videos and starts generating highlights before seeing the end of the video, both attractive characteristics for practical applications.
Storage system of original frame of monitor data and storage method thereof
A storage system of original frames of monitor data and a storage method thereof are provided. The storage system includes a monitor sensor, an event marking circuit, a data storage circuit and a frame processing circuit. The monitor sensor provides a plurality of original frames. The event marking circuit has an input terminal coupling to the monitor sensor and an output terminal, and is used for determining an event intensity of a corresponding one of the original frames and marks the event intensity on the corresponding original frame. The data storage circuit is coupled to the output terminal and is used for completely storing the original frames. The frame processing circuit is coupled to the data storage circuit and is used for checking whether the original frames within the data storage circuit are deleted according to the event intensities.
Systems and Methods for Identifying Activities in Media Contents Based on Prediction Confidences
There is provided a system comprising a memory and a processor configured to receive a media content depicting an activity, extract a first plurality of features from a first segment of the media content, make a first prediction that the media content depicts a first activity based on the first plurality of features, wherein the first prediction has a first confidence level, extract a second plurality of features from a second segment of the media content, the second segment temporally following the first segment in the media content, make a second prediction that the media content depicts the first activity based on the second plurality of features, wherein the second prediction has a second confidence level, determine that the media content depicts the first activity based on the first prediction and the second prediction, wherein the second confidence level is at least as high as the first confidence level.
EVENT/OBJECT-OF-INTEREST CENTRIC TIMELAPSE VIDEO GENERATION ON CAMERA DEVICE WITH THE ASSISTANCE OF NEURAL NETWORK INPUT
An apparatus including an interface and a processor. The interface may be configured to receive pixel data generated by a capture device. The processor may be configured to generate video frames in response to the pixel data, perform computer vision operations on the video frames to detect objects, perform a classification of the objects detected based on characteristics of the objects, determine whether the classification of the objects corresponds to a user-defined event and generate encoded video frames from the video frames. The encoded video frames may be communicated to a cloud storage service. The encoded video frames may comprise a first sample of the video frames selected at a first rate when the user-defined event is not detected and a second sample of the video frames selected at a second rate while the user-defined event is detected. The second rate may be greater than the first rate.
Associative object tracking systems and methods
Systems and methods track a first object when continuous tracking information for the first object is not available. The systems and methods detect when the tracking information for the first object is not available. A last time of a last determined location of the first object is determined and a second object closest to the last determined location at the last time is determined. The location of the first object is associated with a location of the second object if tracking information for the first object is not available.
IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD, AND STORAGE MEDIUM
An image processing apparatus that selects images for digest reproduction from a plurality of images, comprises: an information acquisition unit configured to acquire, for every image, shooting information generated at a time of shooting; an image evaluation unit configured to derive evaluation values for images based on the shooting information and an evaluation criterion; and an image selection unit configured to select images for digest reproduction by ranking images based on the evaluation values, wherein the image evaluation unit changes the evaluation criterion based on information on a lens used in shooting the images.