Patent classifications
G06F40/163
System And Method For Extracting Structured Information From Implicit Tables
A system and method for extracting structured information from an implicit table is disclosed. The system and method provide a way to locate and categorize structured information from an implicit table. More specifically, the system and method provide a way of determining which part of an input image document includes a dominant table and which parts of the dominant table make up rows and columns. These details give meaning to the structured information of the implicit table. These details can be used to properly place the structured information from the implicit table into a two-dimensional data structure, such as a data structure in a relational database. In other words, the structured information from a scanned or digital Portable Document Format (PDF) document can be extracted and placed into a useful format, such as a relational database.
Method for inferring blocks of text in electronic documents
A method for processing an electronic document with characters includes adjusting the characters to identify lines and words; generating a cluster encompassing all of the lines and the words; setting the cluster as a target; determining whether the target can be divided; in response to determining that the target can be divided, dividing the target into a first plurality of sub-clusters; identifying blocks of text based on the first sub-clusters; and generating a new electronic document with paragraphs and sections based on the blocks of text.
Methods for Processing and Verifying a Document
Embodiments described herein provide a computer-implemented method of creating a digest of a document. The document to be processed and analysed may be a physical document, or it may already be in a digital form. In the case of starting from a physical document, the document is first scanned, so as to obtain an image of the document. The digital document is then processed using an algorithm or function to obtain one or more datasets comprising a plurality of position independent values. Each of the datasets may correspond to a different line of text or field of text within the document. The one or more datasets are then encoded, the encoded data being used to generate a digest associated with the document, and wherein the digest comprises a plurality of short hashes corresponding to each dataset. The generated digest can then be used to print a digital signature on the document, which can be used to later verify the authenticity of the document or a copy thereof.
AUTOMATICALLY GENERATING A WEBSITE SPECIFIC TO AN INDUSTRY
Systems and methods of the present invention provide for one or more server computers communicatively coupled to a network and configured to: store data records associated with an industry, with tags defining the content, layout or style of a website; aggregate industry related data records via data entry or extraction; receive a request to automatically generate a website in a specific industry; query a database for the most frequently occurring website features; and automatically generate the website according to the most frequently occurring website features.
BLOCKWISE EXTRACTION OF DOCUMENT METADATA
Methods, computer program products, and systems are presented. The methods include, for instance: obtaining a document image, wherein the document image includes a plurality of objects; identifying a plurality of macroblocks within the document image; performing microblock processing within macroblocks of the plurality of macroblocks, wherein the microblock processing includes examining content of microblocks within a macroblock for extraction of key-value pairs, the examining content including performing an ontological analysis of microblocks, wherein the microblock processing includes associating confidence levels to the extracted key-value pairs; and outputting metadata based on the performing microblock processing within macroblocks of the plurality of macroblocks.
Device and method for providing recommended words for character input
A device and method for providing recommended words for a character input by a user are provided. The method by which the device provides recommended words includes: receiving an input for inputting a character in a character input window; recommending at least one pseudo-morpheme including the input character by analyzing the input character; recommending at least one extended word including a selected pseudo-morpheme in response to receiving an input for selecting one of the at least one pseudo-morpheme; and displaying a selected extended word in response to receiving an input for selecting one of the at least one extended word.
Column Inferencer
A method for processing an electronic document (ED) to infer columns in the ED, where the ED comprises a plurality of characters. The method includes generating a mark-up version of the ED having text-layout attributes of the characters in the ED, where the characters are grouped into paragraphs based on the text-layout attributes, and each paragraph corresponds to a paragraph bounding box surrounding a corresponding paragraph, generating border pieces by initiating a pair of left scan and right scan from each paragraph bounding box to identify any adjacent paragraph bounding box, and generating, based at least on the border pieces, column borders for use in inferring the columns in the ED, where at least one column has a vertically aligned portion of the paragraphs.
Document Analyzer, Document Analysis Method, and Computer-Readable Storage Medium Storing Program
The document analyzer includes a hardware processor. The hardware processor analyzes a construction of a passage with multiple techniques, thereby obtaining multiple analysis results. For each of unit segments related to the construction of the passage, the hardware processor identifies segment areas with the respective techniques based on the analysis results. For each of the unit segments, the hardware processor selects a segment area based on the analysis results from the segment areas identified with the respective techniques.
AUTOMATIC DETECTION AND REMOVAL OF TYPOGRAPHIC RIVERS IN ELECTRONIC DOCUMENTS
Embodiments are disclosed for removing typographic rivers from electronic documents. The method may include receiving an electronic document including a plurality of words for automatic typographic correction. A typographic river is identified in the electronic document, the typographic river including a plurality of nodes, each node including an empty glyph. A candidate adjustment that removes the first node of the plurality of nodes is identified and the candidate adjustment is applied to the electronic document.
Intent classification using non-correlated features
A system for classifying a language sample intent by receiving a language sample including a set of features, identifying language sample features, determining a tokenization score for the language sample according to the language sample features, eliminating duplicate features according to the tokenization score, determining a term frequency (tf) according to the identified features and the tokenization score, determining an inverse document frequency (idf) according to the identified features and the tokenization score, and generating a term frequency-inverse document frequency (tf-idf) matrix for the identified features.