Patent classifications
G06V30/19127
Cross-Modal Weak Supervision For Media Classification
Methods, systems, and storage media for classifying content across media formats based on weak supervision and cross-modal training are disclosed. The system can maintain a first feature classifier and a second feature classifier that classifies features of content having a first and second media format, respectively. The system can extract a feature space from a content item using the first feature classifier and the second feature classifier. The system can apply a set of content rules to the feature space to determine content metrics. The system can correlate a set of known labelled data to the feature space to construct determinative training data. The system can train a discrimination model using the content item and the determinative training data. The system can classify content using the discrimination model to assign a content policy to the second content item.
CONTINUOUS LEARNING FOR DOCUMENT PROCESSING AND ANALYSIS
A document processing method includes receiving one or more documents, performing optical character recognition on the one or more documents to detect words comprising symbols in the one or more documents, and determining a encoding value for each of the symbols. It further includes applying a first hash function to each encoding value to generate a first set of hashed symbol values, applying a second hash function to each hashed symbol value to generate a vector array including a second set of hashed symbol values, and applying a linear transformation to each value of the second set of hashed symbol values of the vector array. The method also includes applying an irreversible non-linear activation function to the vector array to obtain abstract values associated with the symbols and saving the abstract values to train a neural network to detect fields in an input document.
Method and system for document data extraction
Certain aspects of the present disclosure provide techniques for extracting data from a document. An example method generally includes identifying a bounding polygon of the region from an electronic image of the document and extracting data from within the bounding polygon of the region. The method further includes generating revised extracted data based on the extracted data, and combining the revised extracted data with other data extracted from the electronic image of the document to generate input data for a data processing application.
Character encoding and decoding for optical character recognition
The present disclosure provides techniques for encoding and decoding characters for optical character recognition. The techniques involve determining sets of numbers for encoding a character set where each number in a particular set of numbers for encoding a particular character is mapped to a graphical unit (e.g., radical) of the particular character. A mapping between each set of numbers in the possible encodings and the character set may be determined based the closest character already encoded. A machine learning model may be trained to perform optical character recognition using training data labeled using the set of encodings and the mappings.
CONTENT-AWARE BIFURCATED UPSCALING
Certain aspects of the present disclosure provide a method, including: receiving input image data in a first resolution, wherein the input image data comprises text data and graphic data; generating scaled graphic data at a second resolution based on the graphic data at the first resolution and a first scaling factor, wherein the second resolution is based on the first resolution and the first scaling factor; generating scaled text data based on the text data and a second scaling factor; and generating output image data in the second resolution based on the scaled text data and the scaled graphic data.
Digital Content Layout Encoding for Search
Digital content layout encoding techniques for search are described. In these techniques, a layout representation is generated (using machine learning automatically and without user intervention) that describes a layout of elements included within the digital content. In an implementation, the layout representation includes a description of both spatial and structural aspects of the elements in relation to each other. To do so, a two-pathway pipeline that is configured to model layout from both spatial and structural aspects using a spatial pathway, and a structural pathway, respectively. In one example, this is also performed through use of multi-level encoding and fusion to generate a layout representation.
METHOD AND APPARATUS FOR EDITING AN IMAGE AND METHOD AND APPARATUS FOR TRAINING AN IMAGE EDITING MODEL, DEVICE AND MEDIUM
A method for training an image editing model includes steps described below. Covering processing is performed on a region of interest determined in an original image so that a background image sample is formed, and content corresponding to the region of interest is determined as a sample of content of interest; the background image sample and the sample of the content of interest are input into an image editing model; fusion processing is performed on a background image feature and a feature of the region of interest by using the image editing model so that a fusion feature is formed; an image reconstruction operation is performed according to the fusion feature by using the image editing model so that a reconstructed image is output; and optimization training is performed on the image editing model according to a loss relationship between the reconstructed image and the original image.
Feature compression and localization for autonomous devices
Systems, methods, tangible non-transitory computer-readable media, and devices associated with object localization and generation of compressed feature representations are provided. For example, a computing system can access source data and target data. The source data can include a source representation of an environment including a source object. The target data can include a compressed target feature representation of the environment. The compressed target feature representation can be based on compression of a target feature representation of the environment produced by machine-learned models. A source feature representation can be generated based on the source representation and the machine-learned models. The machine-learned models can include machine-learned feature extraction models or machine-learned attention models. A localized state of the source object with respect to the environment can be determined based on the source feature representation and the compressed target feature representation.
Information processing device, information processing method and computer readable storage medium
An information processing device, an information processing method, a computer readable storage medium are provided. The information processing device comprises processing circuitry configured to: construct, for each of a plurality of indexes, a sample unit set for the index based on a plurality of minimum labeled sample units related to the index which are obtained and labeled from an original sample set; and extract, for at least a part of the constructed plurality of sample unit sets, a minimum labeled sample unit from each sample unit set, and generate a labeled training sample based on the extracted minimum labeled sample unit. A sample unit set is constructed based on minimum labeled sample units that are labeled manually, and a labeled training sample is generated automatically based on such sample unit sets, thereby generating the labeled training sample automatically to a certain degree, and reducing manual participation.
End to end trainable document extraction
A processor may receive an image and identify a plurality of characters in the image using a machine learning (ML) model. The processor may generate at least one word-level bounding box indicating one or more words including at least a subset of the plurality of characters and/or may generate at least one field-level bounding box indicating at least one field including at least a subset of the one or more words. The processor may overlay the at least one word-level bounding box and the at least one field-level bounding box on the image to form a masked image including a plurality of optically-recognized characters and one or more predicted fields for at least a subset of the plurality of optically-recognized characters.