Patent classifications
G06F40/177
Information processing apparatus for complementing a heading of a table
An embodiment of the present invention provides an information processing apparatus capable of complementarily adding an attribute not included in a table in order to detect tables having a corresponding relationship. An information processing apparatus as an embodiment of the present invention includes a complementer. The complementer complementarily adds an attribute not included in a first table based on a content of at least one of the first table and an electronic document including the first table.
Information processing apparatus for complementing a heading of a table
An embodiment of the present invention provides an information processing apparatus capable of complementarily adding an attribute not included in a table in order to detect tables having a corresponding relationship. An information processing apparatus as an embodiment of the present invention includes a complementer. The complementer complementarily adds an attribute not included in a first table based on a content of at least one of the first table and an electronic document including the first table.
APPLICATION-SPECIFIC OPTICAL CHARACTER RECOGNITION CUSTOMIZATION
A method for customizing an optical character recognition system is disclosed. The optical character recognition system includes a general-purpose decoder configured to convert character images, recognized in a digital image, into text based on a general-purpose text structure. An application-specific customization is received. The application-specific customization includes an application-specific text structure that differs from the general-purpose text structure. A customized model is generated based on the application-specific customization. An enhanced application-specific decoder is generated by modifying the general-purpose decoder to, during run-time execution of the optical character recognition system, leverage the customized model to convert character images demonstrating the application-specific text structure into text.
APPLICATION-SPECIFIC OPTICAL CHARACTER RECOGNITION CUSTOMIZATION
A method for customizing an optical character recognition system is disclosed. The optical character recognition system includes a general-purpose decoder configured to convert character images, recognized in a digital image, into text based on a general-purpose text structure. An application-specific customization is received. The application-specific customization includes an application-specific text structure that differs from the general-purpose text structure. A customized model is generated based on the application-specific customization. An enhanced application-specific decoder is generated by modifying the general-purpose decoder to, during run-time execution of the optical character recognition system, leverage the customized model to convert character images demonstrating the application-specific text structure into text.
TECHNIQUES FOR IMAGE CONTENT EXTRACTION
Embodiments are directed to techniques for image content extraction. Some embodiments include extracting contextually structured data from document images, such as by automatically identifying document layout, document data, document metadata, and/or correlations therebetween in a document image, for instance. Some embodiments utilize breakpoints to enable the system to match different documents with internal variations to a common template. Several embodiments include extracting contextually structured data from table images, such as gridded and non-gridded tables. Many embodiments are directed to generating and utilizing a document template database for automatically extracting document image contents into a contextually structured format. Several embodiments are directed to automatically identifying and associating document metadata with corresponding document data in a document image to generate a machine-facilitated annotation of the document image. In some embodiments, the machine-facilitated annotation may be used to generate a template for the template database.
Statistical join methods for data management
Aspects of the disclosure relate to joining data tables. A computing platform may input two or more tables into a statistical join function, which may initiate execution of the statistical join function, and where executing the statistical join function comprises applying one or more of: an end condition function, a partition tables function, or an outer join function to generate a new table that includes information from the two or more tables. The computing platform may send, to a user device, the new table and one or more commands directing the user device to display the new table, which may cause the user device to display the new table.
Statistical join methods for data management
Aspects of the disclosure relate to joining data tables. A computing platform may input two or more tables into a statistical join function, which may initiate execution of the statistical join function, and where executing the statistical join function comprises applying one or more of: an end condition function, a partition tables function, or an outer join function to generate a new table that includes information from the two or more tables. The computing platform may send, to a user device, the new table and one or more commands directing the user device to display the new table, which may cause the user device to display the new table.
Table header detection using global machine learning features from orthogonal rows and columns
A method, system and computer-usable medium for detecting headers in various documents, such as PDF and HTML files. The files are converted to a two dimensional array or table, having orthogonal rows and columns. Either rows or columns are determined to include headers. For determining if rows include headers. For each row in the array or table, pair wise comparison is performed for each cell of each column that is orthogonal to that row. The pair wise comparison scores or values are summed up for each orthogonal column to that row and the sum across for all the orthogonal columns to row provide a score or value for that row. Row scores are evaluated relative to one another to determine likelihood of headers in the row. For determining if columns have headers, similar calculation is performed between columns and their orthogonal rows.
Table header detection using global machine learning features from orthogonal rows and columns
A method, system and computer-usable medium for detecting headers in various documents, such as PDF and HTML files. The files are converted to a two dimensional array or table, having orthogonal rows and columns. Either rows or columns are determined to include headers. For determining if rows include headers. For each row in the array or table, pair wise comparison is performed for each cell of each column that is orthogonal to that row. The pair wise comparison scores or values are summed up for each orthogonal column to that row and the sum across for all the orthogonal columns to row provide a score or value for that row. Row scores are evaluated relative to one another to determine likelihood of headers in the row. For determining if columns have headers, similar calculation is performed between columns and their orthogonal rows.
Information extraction from open-ended schema-less tables
Systems and methods for generating and annotating cell documents include extracting tables from a document using a table extraction engine. Headers are extracted for each of the tables using a header detection engine. Cells are extracted from each of the tables using a cell extraction engine. A cell document is generated for each of the cells which are each correlated to corresponding portions of the headers, each cell document recording the correlation between the cells and the headers. Each cell document is annotated to generate annotated cell documents with a cell recognition model trained to perform natural language processing on the cell documents by classifying each term in each of the cell documents and extracting relationships between the terms of each of the cell documents.