Efficient use of training data in data capture for Commercial Documents
20240005689 ยท 2024-01-04
Inventors
Cpc classification
G06V30/18143
PHYSICS
G06V30/414
PHYSICS
G06V30/19147
PHYSICS
International classification
G06V30/414
PHYSICS
Abstract
An automated method for capturing data from electronic images of commercial documents such as invoices, bills of lading, explanations of benefits, etc. is described. An optimal mapping between the fields of interest in an image of a page of a document and the corresponding fields of a pre-trained image of a page of a similar document is defined. This mapping allows an automatic precise extraction of data from the fields of interest in an image regardless of distortions the image is subjected to in the process of scanning.
Claims
1. A method of automatic data capture from commercial documents such as invoices, with an input image and a training image originating from the same source, the values and locations of fields of interest are known for said training image and using a computer performing the steps of: automatically obtaining the salient features of the training document image, said features consisting of words, their lengths, their constituent characters, and positions of geometric lines in the training document image, said lines being horizontal and vertical; automatically obtaining said features in the input document image; calculating the optimal correspondence between lines in the training image and the input image, said lines being horizontal and vertical; defining and calculating distances between horizontal and vertical lines and fields of interest in the training image and using these distances to calculate the positions of the fields of interest in the input image; defining and calculating combinations of geometric and string distances between words of the training image and the input image; automatically mapping the words and fields of the training image into words and fields of the input image, providing an optimal assignment of these words and fields; automatically capturing the words of interest in input images; and automatically capturing the single word fields of interest in the input image.
2. A method according to claim 1 of automatic data capture for multi-word fields according to which the coordinates of multi-word fields in an image are calculated using computer performing the steps of: Computing the coordinates of single word fields in the image by using corresponding coordinates of the single word fields in the training image; Computing the displacement of the input image relative to the training image; Applying said displacement to the coordinates of multi-word fields in the training image to obtain the coordinates of the multi-word fields in the input image.
3. (canceled)
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0007]
[0008]
[0009]
[0010]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0011] In what follows the system operates with two images from the same source: the image I from which the data has to be captured and the image T on which the system has been trained and learned all the data of interest.
[0012] The first step according to the preferred embodiment of the present invention is for any image I of a page of the document to find all the words in that image with all their bounding rectangles and their OCR identities (
GeoDist(W,w)=|x1x3|+|y1y3|+|x2x4|+|y2y4|,
where (x.sub.1, y.sub.1) and (x.sub.2, y.sub.2) are the Cartesian coordinates of the left upper corner and the right lower corner of word W and (x.sub.3, y.sub.3) and (x.sub.4, y.sub.4) are the corresponding coordinates of corners of word w. (
WordDistance(W,w)=u GeoDist(W,w)+v StringDist(W,w),
for each W and for each candidate word w, where u and v are appropriate weights. So, if there are k fields/words W captured in image T, k different distances are used.
[0013] Once the distance between W and w has been defined, a matrix of pair-wise distances WordDistance (W,w) is obtained for pairs of words (W,w) in two images I and T. The preferred embodiment for the present invention utilizes assignment algorithms that calculate the optimal correspondence/mapping of words (W,w) (matching in the sense of the shortest distance) based on the distance described above. Assignment algorithms are described in R. Burkard, M. Dell'Amico, S. Martello, Assignment Problems, SIAM, 2009, and incorporated by reference herein. The net result of this mapping is the captured set of fields in image I, as the desired subset X of words w that is in the one-to-one correspondence with the words W<->X.
[0014] If the same two permanent legends K and k can be found automatically and correlated in images I and T (such as unique words Invoice Number in both of them) then in another embodiment of the present invention it may be sufficient to calculate displacements of all the words W relative to K and apply the same displacements to find words X relative to legend k. It is not always possible to find permanent legends in images since they can be printed in a very noisy fashion or negatively or obscured by lines or other obstacles. However, the images I and T are most frequently shifted as a whole relative to one another providing largely the same displacement of fields of interest in two images. This circumstance also allows an independent verification of the results of the assignment method described above. The assignment algorithm runs in strongly polynomial time, thus making it an efficient method of using learning for data capture. If the displacement can be estimated from K and k only the words w having approximately the same displacement would participate in the calculations.
[0015] A modification of this method would utilize the same word distance as defined above but with the standard string (edit) distance between the legends K and k to arrive at the optimal correspondence of legends even if some of them are corrupted or only partially recognizable. This optimal correspondence of legends immediately allows the calculation of the displacement vector s between the images I and T, since all the legends and the corresponding fields are typically shifted in unison barring more severe non-linear distortions that are rarely observed outside of fax images. In essence, this is a process of an automatic registration of images. If the scanning process is sufficiently accurate only vertical and horizontal shifts will be present so that the application of the displacement vector s is sufficient. If skew or more severe affine distortions are present this method applied to three or more legends will provide the parameters of the full affine transformation that converts the coordinates of the fields in image I to the coordinates of corresponding fields in image T. The application of the assignment algorithm with WordDistance as defined above to all the pairs of training image fields of interest and all the candidate words in image I transformed via displacement vector s (or affine transformed if need be) will result in capturing of all the fields of interest in the image I.
[0016] Some fields of interest are multi-word fields such as addresses. The coordinates and extents of such fields are precisely known in the image T. Typically, the printing program allocates a fixed amount of real estate to each address. Once the correspondence of single word fields has been established it is possible to calculate the displacement of all multi-word fields in I relative to corresponding fields in the image T and thus capture them accurately (
[0017] All geometrical lines are known in the training image T including those that potentially border the fields of interest in images. The lines in the image I corresponding to the lines in the image T could be used to provide the positions of fields in the image I. Horizontal and vertical geometric line distances and optimal correspondence of these lines in two images were defined in U.S. Pat. No. 8,831,361 B2 which is incorporated as a reference herein. While there are several ways to define distances between geometric lines any good distance will provide a suitable measure of proximity between lines. In images with close layouts the corresponding distances between the lines bordering fields in the images I and T are designed to be the same and therefore the knowledge of these distances in T provides the knowledge of the corresponding distances in I thus providing the positions of sought fields. Namely, a distance between a horizontal line and a word can be defined as a vertical distance between the left upper corner of the bounding box of the word and the ordinate of the horizontal line. Similarly, a distance between a vertical line and a word can be defined as a horizontal distance between the left upper corner of the bounding box of the word and the abscissa of the vertical line. Measuring these distances in the image T provides estimates of the corresponding distances in the image I.