G06V30/412

Table item information extraction with continuous machine learning through local and global models

A bipartite application implements a table auto-completion (TAC) algorithm on the client side and the server side. A client module runs a local model of the TAC algorithm on a user device and a server module runs a global model of the TAC algorithm on a server machine. The local model is continuously adapted through on-the-fly training, with as few as a negative example, to perform TAC on the client side, one document at a time. Knowledge thus learned by the local model is used to improve the global model on the server side. The global model can be utilized to automatically and intelligently extract table information from a large number of documents with significantly improved accuracy, requiring minimal human intervention even on complex tables.

SYSTEM AND METHOD FOR FORMAT-AGNOSTIC DOCUMENT INGESTION
20230237101 · 2023-07-27 ·

A system for format-agnostic document ingestion including a document ingestion server and a database is disclosed. The server is configured to receive an image of a document comprising text in an unknown format, convert the image, using OCR, into a plurality of text elements a content, a size, and an absolute position. The server is also configured to retrieve data detectors from the database, each associated with a data type anticipated to be in the document, and comprising at least one identifier and direction, and at least one validation criteria. The server is also configured to identify a potential descriptor by comparing the content of each text element with the at least one identifier, and then determine if the text element pointed to by the data detector meets the validation criteria. Finally, the server is configured to associate the validated text element with the data detector, and store the content.

SYSTEM AND METHOD FOR FORMAT-AGNOSTIC DOCUMENT INGESTION
20230237101 · 2023-07-27 ·

A system for format-agnostic document ingestion including a document ingestion server and a database is disclosed. The server is configured to receive an image of a document comprising text in an unknown format, convert the image, using OCR, into a plurality of text elements a content, a size, and an absolute position. The server is also configured to retrieve data detectors from the database, each associated with a data type anticipated to be in the document, and comprising at least one identifier and direction, and at least one validation criteria. The server is also configured to identify a potential descriptor by comparing the content of each text element with the at least one identifier, and then determine if the text element pointed to by the data detector meets the validation criteria. Finally, the server is configured to associate the validated text element with the data detector, and store the content.

AUTOMATIC SELECTION OF TEMPLATES FOR EXTRACTION OF DATA FROM ELECTRONIC DOCUMENTS

A computer-implemented method for automatic template selection for extracting data from an input electronic document is provided. The method includes receiving a first set of candidate templates and an input electronic document. For each candidate template, a template similarity ratio value is calculated that represents a similarity of the candidate template to the input electronic document. The first set of candidate templates are ranked according to the template similarity ratios and then matched to the input electronic document resulting in generating a normalized similarity score for each particular candidate from among the candidate templates. Differences in normalized similarity scores of successive pairs of the candidate templates is determined and a breaking point is established. A second set of candidate templates is formed by selecting candidate templates that are ranked above the breaking point. Data from the input electronic document is extracted using the second set of candidate templates.

AUTOMATIC SELECTION OF TEMPLATES FOR EXTRACTION OF DATA FROM ELECTRONIC DOCUMENTS

A computer-implemented method for automatic template selection for extracting data from an input electronic document is provided. The method includes receiving a first set of candidate templates and an input electronic document. For each candidate template, a template similarity ratio value is calculated that represents a similarity of the candidate template to the input electronic document. The first set of candidate templates are ranked according to the template similarity ratios and then matched to the input electronic document resulting in generating a normalized similarity score for each particular candidate from among the candidate templates. Differences in normalized similarity scores of successive pairs of the candidate templates is determined and a breaking point is established. A second set of candidate templates is formed by selecting candidate templates that are ranked above the breaking point. Data from the input electronic document is extracted using the second set of candidate templates.

AUTOMATED DOCUMENT PROCESSING FOR DETECTING, EXTRACTNG, AND ANALYZING TABLES AND TABULAR DATA

According to one embodiment, a computer-implemented method for detecting and classifying columns of tables and/or tabular data arrangements within image data includes: detecting one or more tables and/or one or more tabular data arrangements within the image data; extracting the one or more tables and/or the one or more tabular data arrangements from the processed image data; and classifying either: a plurality of columns of the one or more extracted tables; a plurality of columns of the one or more extracted tabular data arrangements; or both the columns of the one or more extracted tables and the columns of the one or more extracted tabular data arrangements. Corresponding systems and computer program products are also disclosed.

AUTOMATED DOCUMENT PROCESSING FOR DETECTING, EXTRACTNG, AND ANALYZING TABLES AND TABULAR DATA

According to one embodiment, a computer-implemented method for detecting and classifying columns of tables and/or tabular data arrangements within image data includes: detecting one or more tables and/or one or more tabular data arrangements within the image data; extracting the one or more tables and/or the one or more tabular data arrangements from the processed image data; and classifying either: a plurality of columns of the one or more extracted tables; a plurality of columns of the one or more extracted tabular data arrangements; or both the columns of the one or more extracted tables and the columns of the one or more extracted tabular data arrangements. Corresponding systems and computer program products are also disclosed.

Image processing apparatus, image processing method, and storage medium
11568623 · 2023-01-31 · ·

An image processing apparatus obtains a read image of a document including a handwritten character, generates a first image formed by pixels of the handwritten character by extracting the pixels of the handwritten character from pixels of the read image using a first learning model for extracting the pixels of the handwritten character, estimates a handwriting area including the handwritten character using a second learning model for estimating the handwriting area, and performs handwriting OCR processing based on the generated first image and the estimated handwriting area.

Image processing apparatus, image processing method, and storage medium
11568623 · 2023-01-31 · ·

An image processing apparatus obtains a read image of a document including a handwritten character, generates a first image formed by pixels of the handwritten character by extracting the pixels of the handwritten character from pixels of the read image using a first learning model for extracting the pixels of the handwritten character, estimates a handwriting area including the handwritten character using a second learning model for estimating the handwriting area, and performs handwriting OCR processing based on the generated first image and the estimated handwriting area.

Apparatus and methods for extracting data from lineless table using delaunay triangulation and excess edge removal

A method for extracting data from lineless tables includes storing an image including a table in a memory. A processor operably coupled to the memory identifies a plurality of text-based characters in the image, and defines multiple bounding boxes based on the characters. Each of the bounding boxes is uniquely associated with at least one of the text-based characters. A graph including multiple nodes and multiple edges is generated based on the bounding boxes, using a graph construction algorithm. At least one of the edges is identified for removal from the graph, and removed from the graph to produce a reduced graph. The reduced graph can be sent to a neural network to predict row labels and column labels for the table.