Apparatus and Method for Recognizing Image-Based Content Presented in a Structured Layout
20210295101 · 2021-09-23
Inventors
Cpc classification
G06N7/01
PHYSICS
G06V30/18057
PHYSICS
G06V30/414
PHYSICS
G06F18/2148
PHYSICS
International classification
Abstract
A method for extracting information from a table includes steps as follows. Characters of a table are extracted. The characters are merged into n-gram characters. The n-gram characters are merged into words and text lines through a two-stage GNN mode. The two-stage GNN mode comprises sub steps as: spatial features, semantic features, CNN image features are extracted from a target source; a first GNN stage is processed to output graph embedding spatial features from the spatial features; and a second GNN stage is processed to output graph embedding semantic features and graph embedding CNN image features from the semantic features and the CNN image features, respectively. The text lines are merged into cells. The cells are grouped into rows, columns, and key based on one or more adjacency matrices, a row relationship among the cells, a column relationship among the cells, and a key-value relationship among the cells.
Claims
1. A method for recognizing and extracting information from data presented in a structured layout in a target source document, comprising: providing a character classifier; providing a graph neural network (GNN) having a pretrained feature embedding layer and a two-stage GNN mode; extracting text characters in the structured layout in the target source document by the character classifier; merging the text characters with two-dimensional positions thereof into n-gram characters by the character classifier; extracting semantic features from the target source document by the pretrained feature embedding layer of the GNN, wherein the semantic features comprise word meanings; manually defining spatial features of the target source document, wherein the spatial features comprise geometric feature of text bounding boxes such as coordinates, heights, widths, and aspect ratios in the document; using a convolution neural network (CNN) layer to obtain CNN image features of the target source document, wherein the CNN image features represent features of mid-point of a text box of the document and comprises one or more of font sizes and font types of the text characters, and explicit separators in the text of the document; merging the n-gram characters into words and text lines by the GNN; wherein the two-stage GNN mode having a first GNN stage and a second GNN stage; wherein the first GNN stage comprises: generating graph embedding spatial features from the spatial features; wherein the second GNN stage comprises: generating graph embedding semantic features and graph embedding CNN image features from the semantic features and the CNN image features, respectively; merging the text lines into cells by the GNN; grouping the cells into rows, columns, and key-value pairs by the GNN, wherein results of the grouping being represented by one or more adjacency matrices, and a row relationship among the cells, a column relationship among the cells, and a key-value relationship among the cells.
2. The method of claim 1, further comprising: generating content of the table in a form of editable electronic data according to the row relationship among the cells, the column relationship among the cells, and the key-value relationship among the cells.
3. The method of claim 2, wherein the content of the table includes at least one data set having a key and at least one value that matches the key.
4. The method of claim 2, further comprising preserving the content of the table into extensible markup language (XML).
5. The method of claim 1, wherein the first GNN stage further comprises: generating from the spatial features by a first GNN a first weight matrix for the semantic features and a second weight matrix for the CNN image.
6. The method of claim 5, wherein the second GNN stage further comprises: generating the graph embedding semantic features from the semantic features and the first weight matrix for the semantic features by a second GNN configured by the first weight matrix; and generating the graph embedding CNN image features from the CNN image features and the second weight matrix for the CNN image features by a third GNN configured by the second weight matrix.
7. The method of claim 1, wherein the merging of the n-gram characters into the words and the text lines comprises: generating a word probability matrix during the merging of the n-gram characters into the words; and introducing the word probability matrix during the merging of the words into the text lines, wherein the word probability matrix serves as a weight matrix to the GNN; and wherein the one or more adjacency matrices comprise a word adjacency matrix obtained by applying an argmax function to the word probability matrix.
8. The method of claim 1, wherein the merging of the text lines into the cells comprises generating a cell probability matrix; wherein the grouping of the cells into the rows, the columns, and the key-value pairs comprises: introducing the cell probability matrix into the grouping to serve as a weight matrix to the GNN; and wherein the one or more adjacency matrices comprise a cell adjacency matrix obtained by applying an argmax function to the cell probability matrix.
9. The method of claim 1, further comprising: capturing an image of the structured layout by using an optical scanner, wherein the text characters are extracted from the image by the character classifier.
10. The method of claim 1, wherein the merging of the text characters with two-dimensional positions thereof into n-gram characters by the character classifier uses one of Docstrum algorithm, Voronoi algorithm, and X-Y Cut algorithm.
11. An apparatus for recognizing and extracting information from data presented in a structured layout, comprising: a character classifier configured to: extract text characters in the structured layout in the target source document; merge the text characters with two-dimensional positions thereof into n-gram characters; a convolution neural network (CNN) layer configured to obtain CNN image features of the target source document, wherein the CNN image features represent features of mid-point of a text box of the document and comprises one or more of font sizes and font types of the text characters, and explicit separators in the text of the document; and a graph neural network (GNN) having a two-stage GNN mode; wherein the two-stage GNN mode having a pretrained feature embedding layer and a first GNN stage and a second GNN stage; wherein the pretrained feature embedding layer is configured to extract semantic features from the target source document, wherein the semantic features comprise word meanings; wherein the first GNN stage comprises: generating graph embedding spatial features from spatial features of the target source document, the spatial features being manually defined and comprising geometric feature of text bounding boxes such as coordinates, heights, widths, and aspect ratios in the target source document; wherein the second GNN stage comprises: generating graph embedding semantic features and graph embedding CNN image features from the semantic features and the CNN image features, respectively; wherein the GNN is configured to: merge the n-gram characters into words and text lines; merge the text lines into cells by the GNN; group the cells into rows, columns, and key-value, wherein results of the grouping being represented by one or more adjacency matrices, a row relationship among the cells, a column relationship among the cells, and a key-value relationship among the cells.
12. The apparatus of claim 11, wherein the GNN is further configured to generate content of the table in a form of editable electronic data according to the adjacency matrices.
13. The apparatus of claim 12, wherein the content of the table includes at least one data set having a key and at least one value that matches the key.
14. The apparatus of claim 12, wherein the processor is further configured to store the content of the table into extensible markup language (XML).
15. The apparatus of claim 11, wherein the first GNN stage further comprises: generating from the spatial features by a first GNN a first weight matrix for the semantic features and a second weight matrix for the CNN image
16. The apparatus of claim 15, wherein the second GNN stage further comprises: generating the graph embedding semantic features from the semantic features and the first weight matrix for the semantic features by a second GNN configured by the first weight matrix; and generating the graph embedding CNN image features from the CNN image features and the second weight matrix for the CNN image features by a third GNN configured by the second weight matrix.
17. The apparatus of claim 11, wherein the merging of the n-gram characters into the words and the text lines comprises: generating a word probability matrix during the merging of the n-gram characters into the words; and introducing the word probability matrix during the merging of the words into the text lines, wherein the word probability matrix serves as a weight matrix to the GNN; and wherein the one or more adjacency matrices comprise a word adjacency matrix obtained by applying an argmax function to the word probability matrix.
18. The apparatus of claim 11, wherein the merging of the text lines into the cells comprises generating a cell probability matrix; wherein the grouping of the cells into the rows, the columns, and the key-value pairs comprises: introducing the cell probability matrix into the grouping to serve as a weight matrix to the GNN; and wherein the one or more adjacency matrices comprise a cell adjacency matrix obtained by applying an argmax function to the cell probability matrix.
19. The apparatus of claim 11, further comprising an optical scanner, wherein the text characters are extracted from the image by the character classifier.
20. The apparatus of claim 11, wherein the character classifier is further configured to merge the text characters with two-dimensional positions thereof into n-gram characters using one of Docstrum algorithm, Voronoi algorithm, and X-Y Cut algorithm.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] Embodiments of the invention are described in more detail hereinafter with reference to the drawings, in which:
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
DETAILED DESCRIPTION
[0019] In the following description, methods and apparatuses for extracting information from an image-based content in a structured layout, and the likes are set forth as preferred examples. It will be apparent to those skilled in the art that modifications, including additions and/or substitutions may be made without departing from the scope and spirit of the invention. Specific details may be omitted so as not to obscure the invention; however, the disclosure is written to enable one skilled in the art to practice the teachings herein without undue experimentation.
[0020] The present invention provides a method and an apparatus for image-based structured layout content recognition, which can convert structured layout information of an electronic or physical document into editable electronic data and then store the editable electronic data. A structured layout is for texts to be distributed on a page of a document with certain arrangements, such as a table. In accordance with one embodiment of the present invention, an image-based table content recognition method is executed by at least two logical components: a character classifier and a multi-task GNN. An ordinarily skilled person in the art may easily envision and realize the logical components by implementing in software, firmware, and/or machine instructions executable in one or more computer processors, specially configured processors, or combinations thereof.
[0021] In accordance with one embodiment, the character classifier is a natural language processing (NLP) based character classifier for language character recognition. At design time, the character classifier is trained with a training data set containing characters of a selected language. For example, in the case where English is selected language, the training data set may contain characters A-Z and a-z. During training, images of different handwriting style/form or images of print writing with different fonts of each character of a useable number (e.g. 100 images per character) is fed to the character classifier, such that the training of the character classifier constructs a character feature database, so as to make the character classifier recognize the characters of the selected language. In various embodiments, the character classifier is constructed based on a neural network, such as convolutional neural network (CNN). In various embodiments, the character classifier also comprises using an OCR engine for performing conversion of images of typed, handwritten, or printed characters into machine codes. In still other embodiments, the number of process steps in the methods may be performed by one or more classifiers of various types and/or implementations made suitable to perform the tasks in the process steps.
[0022] In general, a GNN is a connectionist model that can capture the dependence of graphs via message passing between nodes of graphs and can update the hidden states of its nodes by a weighted sum of states of their neighborhood, so as to learn the distribution of large experimental data. Accordingly, GNNs are able to model a relationship between nodes in a graph and produce a numeric representation of it. One of the reasons for choosing the GNNs is that there are many readily available real-world data that can be represented in graphical form.
[0023]
[0024] The spatial feature 12 represents geometric features of the text bounding box, such as coordinates, height, width, and height width ratio (a.k.a. aspect ratio); the semantic feature 14 represents n-gram character embedding, word embedding, or text line embedding from a pretrained database (e.g. millions of raw data and text documents); and the CNN image feature 16 represents CNN/image features of the mid-point of the text bounding box, which may contain information of font size, font type, and explicit separator.
[0025] In one embodiment, the GNN is separated into three sub networks: a first GNN 24, a second GNN 30, and a third GNN 32. In another embodiment, the GNN is configured differently at different processing step or stage such that the differently configured GNN are labeled: a first GNN 24, a second GNN 30, and a third GNN 32. In the first GNN stage 20, the spatial features 12 is input into the first GNN 24, such that graph embedding spatial features, a first weight matrix for the semantic features 26, and a second weight matrix for the CNN image features 28 can be output from the first GNN 24.
[0026] In the second GNN stage 22, processing the semantic and CNN image features 12 and 14 is in a parallel manner. That is, the semantic features 12 and the CNN image features 14 may be fed to different GNNs. As shown in
[0027] In the two-stage GNN mode, the second GNN stage 22 is executed after the generation of the first weight matrix for the semantic features 26 and the second weight matrix for the CNN image features 28. As such, the first weight matrix for the semantic features 26 and the second weight matrix for the CNN image features 28 can be separated out, thereby further processing the semantic and CNN image features 12 and 14 with prevention of them exerting any influence on each other.
[0028] After the second GNN stage 22, in addition to the spatial, semantic, and CNN image features 12, 14, and 16 obtained prior to the first and second GNN stages 20 and 22, the graph embedding spatial features, the graph embedding semantic features, and the graph embedding CNN image features are further obtained. More specifically, compared with sequential modeling, GNN can learn the importance among text blocks more flexibly and precisely. The degree of importance among text blocks is used to generate text block representation that incorporates the context. Briefly, by processing the spatial, semantic, and CNN image features 12, 14, and 16 in the two-stage GNN mode, these features 12, 14, and 16 can be integrated to output the respective graph embedding features, which will be advantageous to accurately recognize a table content.
[0029] The following further describes the workflow for the table content recognition. Referring to
[0030] In S10, an image of a table in an electronic or physical document is captured. In various embodiments, the table-recognition system 100 may further include an optical scanner 102 electrically coupled to the character classifier 110 and the GNN 120, so as to capture the image and transmit it to either the character classifier 110 or the GNN 120. To illustrate, a table image 200 shown in
[0031] After capturing the table image, the method continues with S20. In S20, the image is transmitted to the character classifier 110 for character extraction. The character classifier 110 obtains the extracted information from characters in the table image 200. Specifically, the extracted information may include text, and coordination for each of the characters. In various embodiments, the character classifier 110 extracts information via OCR with a predetermining language. For example, an OCR engine for English can be selected. According to the exemplary table image 200 shown in
[0032] After obtaining the extraction information, the method continues with S30. In S30, the extracted characters with their two-dimensional positions (i.e. the coordinates thereof) are merged into n-gram characters. For example, in response to the exemplary table image 200 shown in
[0033] Referring to
[0034] In step S44, the n-gram-characters spatial features, semantic features, and CNN image features 212, 214, and 216 are processed by the GNN through a two-stage GNN mode, thereby integrating them into n-gram-characters graph embedding spatial features, semantic features, and CNN image features.
[0035] The graph embedding features are used to serve as merging materials to obtain words 220 of the table image. In response to the exemplary table image 200 shown in
[0036] Then, continuing with step S46, the words 220 are merged into the text lines 224 by the GNN with the two-stage GNN mode. In one embodiment, a text line probability matrix is introduced into the merging to serve as a weight matrix for obtaining the merging result to the text lines 224. Similarly, the text line probability matrix acts as an adjacency matrix for the words 220, and the text lines 224 are the “argmax set” of the text line probability matrix. To obtain the merging result to the text lines 224, cliques of an adjacency matrix for the words 220 are found, and the words in each clique are merged to “a text line”. In response to the exemplary table image 200 shown in
[0037] Referring to
[0038] In S54, the text line spatial features, semantic features, and CNN image features 230, 232, and 234 are processed by the GNN through a two-stage GNN mode, thereby integrating them into text line graph embedding spatial features, semantic features, and CNN image features. Herein, the two-stage GNN mode is the same as the descriptions to
[0039] Next, these graph embedding features are used to serve as merging materials for the cells 240, wherein each “cell” has meaningful sets of characters and/or words and form elements of the table. In response to the exemplary table image 200 shown in
[0040] Then, after obtaining the cells 240, the method continues with S60 for grouping the cells into rows, columns, and key-value pairs. As shown in
[0041] The grouping the cells 240 is executed based on the semantic features thereof. The reason for being based on the semantic features is that no matter how table layout changes, semantic is coherent within a cell, and semantic is similar within a column or row. As such, when the table recognition faces a case for segmenting a table having complex layout (e.g. nested row, nested column, overlap column, or irregular format), reducing accuracy of grouping cells of the table can be avoided by employing the semantic features of the text lines. Moreover, for the case of a table having a row span several columns or a column span several rows, considering the semantic features of the text lines can avoid low accuracy also.
[0042] In various embodiments, row, column, key-value pair probability matrices are introduced into the grouping to serve as weight matrices for obtaining the grouping results to the rows 250, the columns 252, and the key-value pairs 254, respectively. Similarly, these probability matrices act as adjacency matrices for the cells 240, and the rows 250, the columns 252, and the key-value pairs 254 are the “argmax sets” of the corresponding probability matrices, respectively. To obtain the merging result to the rows 250, the columns 252, or the key-value pairs 254, cliques of the corresponding adjacency matrix for the cells 240 are found, and the text lines in each clique are merged into “a row”, “a column”, or “a key-value pair”. In response to the exemplary table image 200 shown in
[0043] Thereafter, according to the obtained rows 250, columns 252, and key-value pairs 254, a row relationship among the cells, a column relationship among the cells, and a key-value relationship among the cells can be determined and obtained. To illustrate,
[0044] Referring to
[0045] S80 follows S70, in which S80 is preserving a structured data. In S80, according to the adjacency matrices, the table layout can be identified by the GNN 120, such that the GNN 120 can generate content of the table in a form of editable electronic data. Specifically, the statement “the table layout can be identified by the GNN 120” means the GNN 120 can extract information from the table with correct reading order. As such, the generated content of the table may include at least one data set having a key and at least one value, in which the value can match the key. Herein, the phrase “the value can match the key” means the value is linked to the key based on the image features, semantic features, and/or spatial features. In the end of S80, by the afore-described features and adjacency matrices, the content of table can be extracted as the structured data and preserved in WL, which will be advantageous to constructing indexes to help search and providing quantitative data.
[0046] The electronic embodiments disclosed herein may be implemented using computing devices, computer processors, or electronic circuitries including but not limited to application specific integrated circuits (ASIC), field programmable gate arrays (FPGA), and other programmable logic devices configured or programmed according to the teachings of the present disclosure. Computer instructions or software codes running in the computing devices, computer processors, or programmable logic devices can readily be prepared by practitioners skilled in the software or electronic art based on the teachings of the present disclosure.
[0047] All or portions of the electronic embodiments may be executed in one or more computing devices including server computers, personal computers, laptop computers, mobile computing devices such as smartphones and tablet computers.
[0048] The electronic embodiments include computer storage media having computer instructions or software codes stored therein which can be used to program computers or microprocessors to perform any of the processes of the present invention. The storage media can include, but are not limited to, floppy disks, optical discs, Blu-ray Disc, DVD, CD-ROMs, and magneto-optical disks, ROMs, RAMs, flash memory devices, or any type of media or devices suitable for storing instructions, codes, and/or data.
[0049] Various embodiments of the present invention also may be implemented in distributed computing environments and/or Cloud computing environments, wherein the whole or portions of machine instructions are executed in distributed fashion by one or more processing devices interconnected by a communication network, such as an intranet, Wide Area Network (WAN), Local Area Network (LAN), the Internet, and other forms of data transmission medium.
[0050] The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art.
[0051] The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated.