Patent classifications
G06V30/18057
Convex representation of objects using neural network
Methods, systems, and apparatus including computer programs encoded on a computer storage medium, for generating convex decomposition of objects using neural network models. One of the methods includes receiving an input that depicts an object. The input is processed using a neural network to generate an output that defines a convex representation of the object. The output includes, for each of a plurality of convex elements, respective parameters that define a position of the convex element in the convex representation of the object.
METHOD AND SYSTEM FOR TABLE STRUCTURE RECOGNITION VIA DEEP SPATIAL ASSOCIATION OF WORDS
State of art techniques that utilize spatial association based Table structure Recognition (TSR) have limitation in selecting minimal but most informative word pairs to generate digital table representation. Embodiments herein provide a method and system for TSR from an table image via deep spatial association of words using optimal number of word pairs, analyzed by a single classifier to determine word association. The optimal number of word pairs are identified by utilizing immediate left neighbors and immediate top neighbors approach followed redundant word pair elimination, thus enabling accurate capture of structural feature of even complex table images via minimal word pairs. The reduced number of word pairs in combination with the single classifier trained to determine the word associations into classes comprising as same cell, same row, same column and unrelated, provides TSR pipeline with reduced computational complexity, consuming less resources still generating more accurate digital representation of complex tables.
Content capturing system and content capturing method
A content capturing system is suitable for capturing content in an image of a document. The content capturing system includes a processor and a storage device. The processor accesses the program stored in the storage device to implement a cutting module and a processing module. The cutting module receives a corrected image. The content in the corrected image includes a plurality of text areas, and the cutting module inputs the corrected image or a first text area into a convolutional neural network. The convolutional neural network outputs the coordinates of the first text area. The cutting module cuts the first text area according to the coordinates of the first text area. The cutting module inputs the cut first text area into a text recognition system and obtains a plurality of first characters in the first text area from the text recognition system.
CONTENT CAPTURING SYSTEM AND CONTENT CAPTURING METHOD
A content capturing system is suitable for capturing content in an image of a document. The content capturing system includes a processor and a storage device. The processor accesses the program stored in the storage device to implement a cutting module and a processing module. The cutting module receives a corrected image. The content in the corrected image includes a plurality of text areas, and the cutting module inputs the corrected image or a first text area into a convolutional neural network. The convolutional neural network outputs the coordinates of the first text area. The cutting module cuts the first text area according to the coordinates of the first text area. The cutting module inputs the cut first text area into a text recognition system and obtains a plurality of first characters in the first text area from the text recognition system.
Generating responses to queries about videos utilizing a multi-modal neural network with attention
The present disclosure relates to systems, methods, and non-transitory computer-readable media for generating a response to a question received from a user during display or playback of a video segment by utilizing a query-response-neural network. The disclosed systems can extract a query vector from a question corresponding to the video segment using the query-response-neural network. The disclosed systems further generate context vectors representing both visual cues and transcript cues corresponding to the video segment using context encoders or other layers from the query-response-neural network. By utilizing additional layers from the query-response-neural network, the disclosed systems generate (i) a query-context vector based on the query vector and the context vectors, and (ii) candidate-response vectors representing candidate responses to the question from a domain-knowledge base or other source. To respond to a user's question, the disclosed systems further select a response from the candidate responses based on a comparison of the query-context vector and the candidate-response vectors.
Recurrent Deep Neural Network System for Detecting Overlays in Images
In one aspect, an example method includes a processor (1) applying a feature map network to an image to create a feature map comprising a grid of vectors characterizing at least one feature in the image and (2) applying a probability map network to the feature map to create a probability map assigning a probability to the at least one feature in the image, where the assigned probability corresponds to a likelihood that the at least one feature is an overlay. The method further includes the processor determining that the probability exceeds a threshold, and responsive to the processor determining that the probability exceeds the threshold, performing a processing action associated with the at least one feature.
DOCUMENT SEGMENTATION FOR OPTICAL CHARACTER RECOGNITION
An approach to identifying text within an image may be presented. The approach can receive an image. The approach can classify an image on a pixel-by-pixel basis whether the pixel is text. The approach can generate bounding boxes around groups of pixels that are classified as text. The approach can mask sections of an image that where pixels are not classified as text. The approach may be used as a pre-processing technique for optical character recognition in documents, scanned images, or still images.
METHOD OF TRAINING AN IMAGE CLASSIFICATION MODEL
A method of training a neural network for classifying an image into one of a plurality of classes, the method comprising: extracting, from the neural network, a plurality of subclass center vectors for each class; inputting an image into the neural network, wherein the image is associated with a predetermined class; generating, using the neural network, an embedding vector corresponding to the input image; determining a similarity score between the embedding vector and each of the plurality of subclass center vectors; updating parameters of the neural network in dependence on a plurality of the similarity scores using an objective function; extracting a plurality of updated parameters from the neural network; and updating each subclass center vector in dependence on the extracted updated parameters.
CONVEX REPRESENTATION OF OBJECTS USING NEURAL NETWORK
Methods, systems, and apparatus including computer programs encoded on a computer storage medium, for generating convex decomposition of objects using neural network models. One of the methods includes receiving an input that depicts an object. The input is processed using a neural network to generate an output that defines a convex representation of the object. The output includes, for each of a plurality of convex elements, respective parameters that define a position of the convex element in the convex representation of the object.
2D document extractor
There is provided a 2D document extractor for extracting entities from a structured document, the 2D document extractor includes a first convolutional neural network (CNN), a second CNN, and a third recurrent neural network (RNN). A plurality of text sequences and structural elements indicative of location of the text sequences in the document are received. The first CNN encodes the text sequences and structural elements to obtain a 3D encoded image indicative of semantic characteristics of the text sequences and having the structure of the document. The second CNN compresses the 3D encoded image to obtain a feature vector, the feature vector being indicative of a combination of spatial characteristics and semantic characteristics of the 3D encoded image. The third RNN decodes the feature vector to extract the text entities, a given text entity being associated with a text sequence.