G06F16/7343

Systems and methods for video retrieval and grounding

Methods and systems are described for performing video retrieval together with video grounding. A word-based query for a video is and encoded into a query representation using a trained query encoder. One or more similar video representations are identified, from a plurality of video representations that are similar to the query representation. Each similar video representation represents a respective relevant video. A grounding is generated for each relevant video by forward propagating each respective similar video representation together with the query representation through a trained grounding module. The relevant videos or identifiers of the relevant videos are outputted together with the grounding generated for each relevant video.

Image-based query language system for performing database operations on images and videos
11526548 · 2022-12-13 · ·

A user device transmit a natural language query with a request for a description of videos stored in a media repository. A query system receives the query and determines a command associated with obtaining the requested description of the videos stored in the media repository requested by the query. The determined command corresponds to an image analysis to perform on at least a portion of the stored videos. The query system determines, based at least in part on the determined command, an artificial intelligence model to execute on at least the portion of the stored videos. The query system determines, by executing the determined artificial intelligence model, a model output that includes the requested description of the videos stored in the media repository. The query system provides a response to the query. The response includes the requested description of the videos stored in the media repository.

Managing data queries

One method includes receiving a database query, receiving information about a database table in data storage populated with data elements, producing a structural representation of the database table that includes a formatted data organization reflective of the database table and is absent the data elements of the database table, and providing the structural representation and the database query to a plan generator capable of producing a query plan representing operations for executing the database query on the database table. Another method includes receiving a query plan from a plan generator, the plan representing operations for executing a database query on a database table, and producing a dataflow graph from the query plan, wherein the dataflow graph includes at least one node that represents at least one operation represented by the query plan, and includes at least one link that represents at least one dataflow associated with the query plan.

Multi-detector probabilistic reasoning for natural language queries

Systems and methods for solving queries on image data are provided. The system includes a processor device coupled to a memory device. The system includes a detector manager with a detector application programming interface (API) to allow external detectors to be inserted into the system by exposing capabilities of the external detectors and providing a predetermined way to execute the external detectors. An ontology manager exposes knowledge bases regarding ontologies to a reasoning engine. A query parser transforms a natural query into query directed acyclic graph (DAG). The system includes a reasoning engine that uses the query DAG, the ontology manager and the detector API to plan an execution list of detectors. The reasoning engine uses the query DAG, a scene representation DAG produced by the external detectors and the ontology manager to answer the natural query.

METHOD AND SYSTEM FOR RETRIEVING VIDEO SEGMENT BY A SEMENTIC QUERY

Provided is a method of detecting a semantics section in a video. The method includes extracting all video features by inputting an inputted video to a pre-trained first deep neural network algorithm, extracting a query sentence feature by inputting an inputted query sentence to a pre-trained second deep neural network algorithm, generating video-query relation integration feature information in which all of the video features and the query sentence feature have been integrated by inputting all of the video features and the query sentence feature to a plurality of scaled-dot product attention layers, and estimating a video segment corresponding to the query sentence in the video based on the video-query relation integration feature information.

Verbal queries relative to video content
11665406 · 2023-05-30 · ·

Disclosed are various embodiments for processing verbal queries relative to video content. A verbal query that is associated with a portion of video content is received. The verbal query specifies a relative frame location. An action is performed based at least in part on the portion of the video content at the relative frame location.

Method and system for retrieving video temporal segments

A method and a system for retrieving video temporal segments are provided. In the method, a video is analyzed to obtain frame feature information of the video; the frame feature information is input into an encoder to output first data relating to temporal information of the video; the first data and a retrieval description for retrieving video temporal segments of the video are input into a decoder to output second data; attention computation training is conducted according to the first data and the second data; video temporal segments of the video corresponding to the retrieval description are determined according to the attention computation training.

VIDEO EVENT RECOGNITION METHOD, ELECTRONIC DEVICE AND STORAGE MEDIUM

Technical solutions for video event recognition relate to the fields of knowledge graphs, deep learning and computer vision. A video event graph is constructed, and each event in the video event graph includes: M argument roles of the event and respective arguments of the argument roles, with M being a positive integer greater than one. For a to-be-recognized video, respective arguments of the M argument roles of a to-be-recognized event corresponding to the video are acquired. According to the arguments acquired, an event is selected from the video event graph as a recognized event corresponding to the video.

METHOD OF LIVE VIDEO EVENT DETECTION BASED ON NATURAL LANGUAGE QUERIES, AND AN APPARATUS FOR THE SAME

A method of real-time video event detection includes: obtaining, based on a natural language query, a query vector; performing multimodal feature extraction on a video stream to obtain a video vector, obtaining a similarity score by comparing the query vector to the video vector; comparing the similarity score to a predetermined threshold; and activating, based on the similarity score being above the predetermined threshold, an action trigger. The multimodal feature extraction is performed using a plurality of overlapping windows that include sequential frames of the video stream.

Method and system for generating elements of recorded information in response to a secondary user's natural language input
11163826 · 2021-11-02 ·

The invention relates to a computerized method and computer-based system for generating elements of recorded information for a secondary user in response to the secondary user's natural language input. The recorded information could be in the form of, for example, video, audio, audiovisual, text files, or other recordable media. The method and system of the invention permit a secondary user to access, in real time, information of an original source (e.g., allows a descendant to obtain a multimedia response stored by or on behalf of an ancestor) via a computer network, with the response being accessible via a television, audio player, Bluetooth or wireless device, or any other electronic and digital system. The access to such information can be initiated by the secondary user's input provided through use of, for example, voice response technology, including speech recognition and natural language software. The ability to access the information as recorded by the original source increases the perceived and, hopefully the actual, level of validity and accuracy, while also simulating, with multiple secondary user communication entries and responses, a ‘face-to-face conversation’ between the secondary user and the original source.