Patent classifications
G06V30/413
Method and apparatus for automatically extracting information from unstructured data
Various methods, apparatuses/systems, and media for automatically extracting information from unstructured data are provided. A receiver receives digitized data of a document having unstructured data format. A processor applies machine learning models for sectioning the digitized data. An OCR device applies an OCR processing to the sectioned digitized data. The processor matches the sectioned digitized data to patterns and rules; applies classification models to the matched digitized data to identify entities and events from the sectioned digitized data; automatically link each entity with corresponding event in a hierarchical format to generate a document having structured data format; and output the document having the structured data with metadata having the linked entity with corresponding event in the hierarchical format to downstream applications.
ENHANCING DOCUMENTS PORTRAYED IN DIGITAL IMAGES
The present disclosure is directed toward systems and methods that efficiently and effectively generate an enhanced document image of a displayed document in an image frame captured from a live image feed. For example, systems and methods described herein apply a document enhancement process to a displayed document in an image frame that result in an enhanced document image that is cropped, rectified, un-shadowed, and with dark text against a mostly white background. Additionally, systems and method described herein determine whether a stored digital content item includes a displayed document. In response to determining that a stored digital content item does include a displayed document, systems and methods described herein generate an enhanced document image of a displayed document included in the stored digital content item.
ENHANCING DOCUMENTS PORTRAYED IN DIGITAL IMAGES
The present disclosure is directed toward systems and methods that efficiently and effectively generate an enhanced document image of a displayed document in an image frame captured from a live image feed. For example, systems and methods described herein apply a document enhancement process to a displayed document in an image frame that result in an enhanced document image that is cropped, rectified, un-shadowed, and with dark text against a mostly white background. Additionally, systems and method described herein determine whether a stored digital content item includes a displayed document. In response to determining that a stored digital content item does include a displayed document, systems and methods described herein generate an enhanced document image of a displayed document included in the stored digital content item.
SYSTEM AND METHOD FOR FORMAT-AGNOSTIC DOCUMENT INGESTION
A system for format-agnostic document ingestion including a document ingestion server and a database is disclosed. The server is configured to receive an image of a document comprising text in an unknown format, convert the image, using OCR, into a plurality of text elements a content, a size, and an absolute position. The server is also configured to retrieve data detectors from the database, each associated with a data type anticipated to be in the document, and comprising at least one identifier and direction, and at least one validation criteria. The server is also configured to identify a potential descriptor by comparing the content of each text element with the at least one identifier, and then determine if the text element pointed to by the data detector meets the validation criteria. Finally, the server is configured to associate the validated text element with the data detector, and store the content.
SYSTEM AND METHOD FOR FORMAT-AGNOSTIC DOCUMENT INGESTION
A system for format-agnostic document ingestion including a document ingestion server and a database is disclosed. The server is configured to receive an image of a document comprising text in an unknown format, convert the image, using OCR, into a plurality of text elements a content, a size, and an absolute position. The server is also configured to retrieve data detectors from the database, each associated with a data type anticipated to be in the document, and comprising at least one identifier and direction, and at least one validation criteria. The server is also configured to identify a potential descriptor by comparing the content of each text element with the at least one identifier, and then determine if the text element pointed to by the data detector meets the validation criteria. Finally, the server is configured to associate the validated text element with the data detector, and store the content.
MULTI-PAGE DOCUMENT RECOGNITION IN DOCUMENT CAPTURE
Techniques to capture document data are disclosed. It is determined that a sequence of pages in a stream of document page images comprise a single multi-page document. Data is extracted from two or more different pages included in the sequence. The data extracted from two or more different pages included in the sequence of pages is used to populate a data entry form associated with the multi-page document.
MULTI-PAGE DOCUMENT RECOGNITION IN DOCUMENT CAPTURE
Techniques to capture document data are disclosed. It is determined that a sequence of pages in a stream of document page images comprise a single multi-page document. Data is extracted from two or more different pages included in the sequence. The data extracted from two or more different pages included in the sequence of pages is used to populate a data entry form associated with the multi-page document.
METHOD FOR GENERATING A HANDWRITING VECTOR
One variation of a method includes: accessing a handwriting sample comprising a set of user glyphs handwritten by a user; for each character in a set of characters, identifying a subset of user glyphs corresponding to the character in the handwriting sample, characterizing variability of a set of spatial features across the subset of user glyphs, and storing variability of the set of spatial features across the subset of user glyphs in a character container corresponding to the character; and compiling the set of character containers into a handwriting model for the user. The method further includes: accessing a text string comprising a combination of characters in the set of characters; for each instance of each character in the text string, inserting a set of variability parameters into the handwriting model to generate a synthetic glyph representing the character; and assembling the set of synthetic glyphs into a print file.
METHOD FOR GENERATING A HANDWRITING VECTOR
One variation of a method includes: accessing a handwriting sample comprising a set of user glyphs handwritten by a user; for each character in a set of characters, identifying a subset of user glyphs corresponding to the character in the handwriting sample, characterizing variability of a set of spatial features across the subset of user glyphs, and storing variability of the set of spatial features across the subset of user glyphs in a character container corresponding to the character; and compiling the set of character containers into a handwriting model for the user. The method further includes: accessing a text string comprising a combination of characters in the set of characters; for each instance of each character in the text string, inserting a set of variability parameters into the handwriting model to generate a synthetic glyph representing the character; and assembling the set of synthetic glyphs into a print file.
AUTOMATED DOCUMENT PROCESSING FOR DETECTING, EXTRACTNG, AND ANALYZING TABLES AND TABULAR DATA
According to one embodiment, a computer-implemented method for detecting and classifying columns of tables and/or tabular data arrangements within image data includes: detecting one or more tables and/or one or more tabular data arrangements within the image data; extracting the one or more tables and/or the one or more tabular data arrangements from the processed image data; and classifying either: a plurality of columns of the one or more extracted tables; a plurality of columns of the one or more extracted tabular data arrangements; or both the columns of the one or more extracted tables and the columns of the one or more extracted tabular data arrangements. Corresponding systems and computer program products are also disclosed.