Patent classifications
G06V20/635
System and method for training a model using localized textual supervision
Systems and methods for training a model are described herein. In one example, a system for training the model includes a processor and a memory in communication with the processor having a training module. The training module has instructions that cause the processor to determine a contrastive loss using a self-supervised contrastive loss function, adjust, based on the contrastive loss, model weights a visual backbone that generated feature maps and/or a textual backbone that generated feature vectors. The training module also has instructions that cause the processor to determine a localized loss using a supervised loss function that compares an image-caption attention map with visual identifiers and adjust, based on the localized loss, the model weights the visual backbone and/or the textual backbone.
SYSTEMS AND METHODS FOR MANAGING CAPTIONS
The present disclosure generally relates to embodiments for a live communication interface for managing captions.
SEMANTICALLY-GUIDED TEMPLATE GENERATION FROM IMAGE CONTENT
Techniques for template generation from image content includes extracting information associated with an input image. The information comprises: 1) layout information indicating positions of content corresponding to a content type of a plurality of content types within the input image; and 2) text attributes indicating at least a font of text included in the input image. A user-editable template having the characteristics of the input image is generated based on the layout information and the text attributes
DISPLAY CONTROL INTEGRATED CIRCUIT APPLICABLE TO PERFORMING REAL-TIME VIDEO CONTENT TEXT DETECTION AND SPEECH AUTOMATIC GENERATION IN DISPLAY DEVICE
A display control integrated circuit (IC) applicable to performing real-time video content text detection and speech automatic generation in a display device may include a pre-processing circuit, a character recognition circuit and a post-processing circuit. The pre-processing circuit may input a video signal to obtain a real-time video content carried by the video signal, and perform preliminary text detection on the real-time video content to generate a series of segmented character images to indicate a subtitle. The character recognition circuit may perform character recognition on the series of segmented character images to generate a series of characters, respectively. The post-processing circuit may perform vocabulary correction on the series of characters to selectively replace any erroneous character with a correct character to generate one or more vocabularies, for performing speech automatic generation.
INTERACTIVE VIEWING EXPERIENCES BY DETECTING ON-SCREEN TEXT
Systems, methods, and devices for an interactive viewing experience by detecting on-screen data are disclosed. One or more frames of video data are analyzed to detect regions in the visual video content that contain text. A character recognition operation can be performed on the regions to generate textual data. Based on the textual data and the regions, a graphical user interface (GUI) definition to can be generated. The GUI definition can be used to generate a corresponding GUI superimposed onto the visual video content to present users with controls and functionality with which to interact with the text or enhance the video content. Context metadata can be determined from external sources or by analyzing the continuity of audio and visual aspects of the video data. The context metadata can then be used to improve the character recognition or inform the generation of the GUI.
Language agnostic drift correction
Systems, methods, and computer-readable media are disclosed for language-agnostic subtitle drift detection and correction. A method may include determining subtitles and/or captions from media content (e.g., videos), the subtitles and/or captions corresponding to dialog in the media content. The subtitles may be broken up into segments which may be analyzed to determine a likelihood of drift (e.g., a likelihood that the subtitles are out of synchronization with the dialog in the media content) for each segment. For segments with a high likelihood of drift, the subtitles may be incrementally adjusted to determine an adjustment that eliminates and/or reduces the amount of drift and the drift in the segment may be corrected based on the drift amount detected. A linear regression model and/or human blocks determined by human operators may be used to otherwise optimize drift correction.
ELECTRONIC DEVICE AND CONTROL METHOD THEREFOR
A control method for an electronic device includes displaying a video in a first area of a display; identifying a writing area, including at least one piece of writing information, from the video displayed in the first area; acquiring a first image corresponding to the identified writing area, so as to display the first image in a second area of the display; based on a change being detected in the at least one writing information, identifying a type of the change; and acquiring a second image formed by correcting a first image based on the type of the identified change, so as to display the second image in the second area.
SYSTEMS AND METHODS FOR OPEN VOCABULARY OBJECT DETECTION
Embodiments described herein provide methods and systems for open vocabulary object detection of images. given a pre-trained vision-language model and an image-caption pair, an activation map may be computed in the image that corresponds to an object of interest mentioned in the caption. The activation map is then converted into a pseudo bounding-box label for the corresponding object category. The open vocabulary detector is then directly supervised by these pseudo box-labels, which enables training object detectors with no human-provided bounding-box annotations.
SYSTEMS AND METHODS FOR AUGMENTED REALITY APPLICATION FOR ANNOTATIONS AND ADDING INTERFACES TO CONTROL PANELS AND SCREENS
Example implementations described herein systems and method for providing a platform to facilitate augmented reality (AR) overlays, which can involve stabilizing video received from a first device for display on a second device and for input made to a portion of the stabilized video at the second device, generating an AR overlay on a display of the first device corresponding to the portion of the stabilized video.
IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD, AND STORAGE MEDIUM
A method of controlling an image processing apparatus includes: obtaining a candidate image group including a plurality of images; determining a specific condition for preferentially selecting an image from the candidate image group: analyzing the images in the candidate image group; analyzing captions attached to the images in the candidate image group; and selecting a specific image from the candidate image group based on results of the determining the specific condition, the analyzing the images, and the analyzing the captions.