Patent classifications
G06V10/86
DIGITAL IMAGE ANNOTATION AND RETRIEVAL SYSTEMS AND METHODS
In a digital image annotation and retrieval system, a machine learning model identifies an image feature in an image and generates a plurality of question prompts for the feature. For a particular feature, a feature annotation is generated, which can include capturing a narrative, determining a plurality of narrative units, and mapping a particular narrative unit to the identified image feature. An enriched image is generated using the generated feature annotation. The enriched image includes searchable metadata comprising the feature annotation and the plurality of question prompts.
SYSTEM AND METHOD FOR VERIFYING USER BY SECURITY TOKEN COMBINED WITH BIOMETRIC DATA PROCESSING TECHNIQUES
Embodiments of the inventive concept provide a system and method that verifies a user through a security token combined with biometric information processing techniques capable of changing and canceling the conventional encryption key without storing biometric information-related data corresponding to a user's personal information.
SYSTEMS AND METHODS FOR IMAGE PROCESSING USING NATURAL LANGUAGE
Embodiments of the disclosure provide a machine learning model for generating a predicted executable command for an image. The learning model includes an interface configured to obtain an utterance indicating a request associated with the image, an utterance sub-model, a visual sub-model, an attention network, and a selection gate. The machine learning model generates a segment of the predicted executable command from weighted probabilities of each candidate token in a predetermined vocabulary determined based on the visual features, the concept features, current command features, and the utterance features extracted from the utterance or the image.
SYSTEMS AND METHODS FOR IMAGE PROCESSING USING NATURAL LANGUAGE
Embodiments of the disclosure provide a machine learning model for generating a predicted executable command for an image. The learning model includes an interface configured to obtain an utterance indicating a request associated with the image, an utterance sub-model, a visual sub-model, an attention network, and a selection gate. The machine learning model generates a segment of the predicted executable command from weighted probabilities of each candidate token in a predetermined vocabulary determined based on the visual features, the concept features, current command features, and the utterance features extracted from the utterance or the image.
METHOD FOR VISUAL LOCALIZATION AND RELATED APPARATUS
Visual localization method and related apparatus are disclosed. In the method, a first candidate image sequence is determined from image library, the image library being configured to construct electronic map, image frames in the first candidate image sequence being sequentially arranged according to degrees of matching with first image, and the first image being an image collected by a camera; an order of the image frames in the first candidate image sequence is adjusted according to target window to obtain second candidate image sequence, the target window being multiple successive image frames including target image frame and determined from the image library, the target image frame being an image matching with second image, which is collected by the camera before the first image is collected, in the image library; and target posture of the camera when the first image is collected is determined according to the second candidate image sequence.
METHOD FOR VISUAL LOCALIZATION AND RELATED APPARATUS
Visual localization method and related apparatus are disclosed. In the method, a first candidate image sequence is determined from image library, the image library being configured to construct electronic map, image frames in the first candidate image sequence being sequentially arranged according to degrees of matching with first image, and the first image being an image collected by a camera; an order of the image frames in the first candidate image sequence is adjusted according to target window to obtain second candidate image sequence, the target window being multiple successive image frames including target image frame and determined from the image library, the target image frame being an image matching with second image, which is collected by the camera before the first image is collected, in the image library; and target posture of the camera when the first image is collected is determined according to the second candidate image sequence.
SELF-SUPERVISED VISUAL-RELATIONSHIP PROBING
Methods and systems disclosed herein relate generally to systems and methods for generating visual relationship graphs that identify relationships between objects depicted in an image. A vision-language application uses transformer encoders to generate a graph structure, in which the graph structure represents a dependency between a first region and a second region of an image. The dependency indicates that a contextual representation of the first region was derived, at least in part, by processing the second region. The contextual representation identifies a predicted identity of an image object depicted in the first region. The predicted identity is determined at least in part by identifying a relationship between the first region and other data objects associated with various modalities.
OPEN VOCABULARY INSTANCE SEGMENTATION
Systems and methods for image processing are described. Embodiments of the present disclosure receive a training image and a caption for the training image, wherein the caption includes text describing an object in the training image; generate a pseudo mask for the object using a teacher network based on the text describing the object; generate a mask for the object using a student network; and update parameters of the student network based on the mask and the pseudo mask.
OPEN VOCABULARY INSTANCE SEGMENTATION
Systems and methods for image processing are described. Embodiments of the present disclosure receive a training image and a caption for the training image, wherein the caption includes text describing an object in the training image; generate a pseudo mask for the object using a teacher network based on the text describing the object; generate a mask for the object using a student network; and update parameters of the student network based on the mask and the pseudo mask.
Automated digital document generation from digital videos
Techniques are described that support automated generation of a digital document from digital videos using machine learning. The digital document includes textual components that describe a sequence of entity and action descriptions from the digital video. These techniques are usable to generate a single digital document based on a plurality of digital videos as well as incorporate user-specified constraints in the generation of the digital document.