COMPUTER VISION METHOD FOR DETECTING DOCUMENT REGIONS THAT WILL BE EXCLUDED FROM AN EMBEDDING PROCESS AND COMPUTER PROGRAMS THEREOF
20220392240 · 2022-12-08
Assignee
Inventors
- Aruna Prem BIANZINO (Madrid, ES)
- Juan ELOSUA TOME (Madrid, ES)
- Diego PEREZ VIEITES (Madrid, ES)
- Sergio DE LOS SANTOS VILCHEZ (Madrid, ES)
Cpc classification
G06V10/273
PHYSICS
G06T1/0028
PHYSICS
G06V30/414
PHYSICS
G06V10/48
PHYSICS
G06T2201/0065
PHYSICS
G06V30/413
PHYSICS
International classification
Abstract
A method and computer programs for detecting document regions that will be excluded from a watermark embedding process are disclosed. The method comprises converting, by an adapter module, at least one page of a received document into a visual representation thereof, the visual representation keeping the position of the characters of the at least one page; receiving, by a text detector, the visual representation; processing, by the text detector, the visual representation using one or more artificial intelligence algorithms, and returning a list of invalid regions with their associated page positions as a result, wherein each invalid region of the list of invalid regions may have associated thereto a confidence score; and using, by a watermark embedding module or by a watermark extracting module, the list of invalid regions to provide a watermarked document or a message embedded in the document.
Claims
1. A method for detecting document regions that will be excluded from an embedding process, comprising: converting, by an adapter module, at least one page of a received document into a visual representation thereof, the visual representation keeping the position of the characters of the at least one page; receiving, by a text detector, the visual representation; processing, by the text detector, the visual representation using one or more artificial intelligence algorithms, and returning a list of invalid regions with their associated page positions as a result; and using, by a watermark embedding module or by a watermark extracting module, the list of invalid regions to provide a watermarked document or a message embedded in the document.
2. The method of claim 1, wherein the document comprises a digital or a digitalized document, and wherein each invalid region of the list of invalid regions has associated thereto a confidence score.
3. The method of claim 1, wherein the visual representation comprises an image.
4. The method of claim 3, wherein the image is a binary image where pixels associated to the characters are represented using a first color and where pixels associated to background are represented using a second color.
5. The method of claim 4, wherein the first color is white, and the second color is black.
6. The method of claim 1, wherein the at least one page is in image format, and wherein the converting step further comprises correcting a skew of the page by means of using a recognition algorithm.
7. The method of claim 6, wherein the recognition algorithm comprises a Hough Transform.
8. The method of claim 6, wherein before performing the correcting step, the method comprises preprocessing the at least one page by means of applying a Canny Edge Detection algorithm and different morphological operations on the at least one page.
9. The method of claim 8, further comprising resizing the at least one page to a lower resolution.
10. The method of claim 6, further comprising obtaining the position of the characters by means of using an image-based segmentation algorithm, the latter comprising a projection-based algorithm.
11. The method of claim 1, wherein the artificial intelligence algorithms comprise a deep neural network, and optionally a supervised learning algorithm.
12. The method of claim 1, wherein the list of invalid regions comprises a set of boxes where each box is defined by: an ‘x’ value indicating a left-up corner of the box in a horizontal coordinate, a ‘y’ value indicating a left-up corner of the box in a vertical coordinate, a ‘w’ value indicating a width of the box in pixels, and an ‘h’ value indicating a height of the box in pixels.
13. The method of claim 11, wherein the deep neural network is based on a detection-based architecture.
14. The method of claim 11, wherein the deep neural network is based on a segmentation-based architecture
15. A non-transitory computer-readable medium containing computer instructions stored therein for causing a computer processor to perform a method according to claim 1.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0031] The previous and other advantages and features will be more fully understood from the following detailed description of embodiments, with reference to the attached figures, which must be considered in an illustrative and non-limiting manner, in which:
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0039]
[0040] Similarly,
[0041] That is, the adapter module 18 is used to adapt the document 16 to the required input for the text detector 12. In some embodiments, the input (i.e., the visual representation 10) can be an image format, and the content of this image can be any kind of information that represents the position of the words. It can be the visual page as is or can be another type of representation.
[0042] In an embodiment, the use of abstract maps is proposed as visual representation 10. Particularly, these abstract maps are binary images with background to black pixels and words represented as white rectangles.
[0043]
[0044] When the page is in image format, there are multiple manners to obtain the position of the words using an image-based segmentation approach 34. The OCR methods usually return the position of the words, but in this embodiment a method based on projections is preferably proposed since it gives better time responses. Before applying the method based on projections the skew 32 of the image is corrected using a recognition algorithm, for example the Hough Transform.
[0045]
g(r,c)=G(r,c)*l(r,c),
[0046] where the Gaussian filter G is defined as:
[0047] The Gradient of g(r,c) is computed as:
M(r,c)=√{square root over (g.sub.r.sup.2(r,c)+g.sub.c.sup.2(r,c))},
and
θ(r,c)=tan.sup.−1[g.sub.c(r,c)+g.sub.r(r,c)].
[0048] A threshold T is applied to M(r,c):
[0049] where T is chosen so that all edge elements are kept while most of the noise is suppressed.
[0050] Then, the method suppresses non-maxima pixels in the edges in the MT(r,c) obtained above to thin the edge ridges. To do so, the method checks whether each non-zero MT (r,c) is greater than its two neighbors along the gradient direction θ(r, c). If so, MT (r,c) is unchanged, otherwise, is set to 0.
[0051] The previous result can be thresholded by two different thresholds ζ.sub.1 and ζ (where ζ.sub.1<ζ.sub.2) to obtain two binary images T.sub.1 and T.sub.2. Note that T.sub.2 with greater ζ.sub.2 has less noise and fewer false edges but greater gaps between edge segments, when compared to T.sub.1 with smaller ζ.sub.1.
[0052] The method links edge segments in T.sub.2 to form continuous edges. To do so, the method traces each segment in T.sub.2 to its end and then searches for its neighbors in T.sub.1 to find any edge segment in T.sub.1 to bridge the gap until reaching another edge segment in T.sub.2. The output of the Canny Edge Detection algorithm 40 is denoted here as I.sup.c(r,c).
[0053] The next steps are morphological operations in order to join and fill the gaps between chars and words and trying to convert text lines into straight lines. This is done at first with a dilate operation 42, and later with an erode operation 44. These morphological operations provides a cleaner image I.sup.m(r,c) with straight lines easier to be processed with the Hough Transform.
[0054] As the Hough Transform can be slow, the image is preferably resized 46 to work with a lower resolution image (75 dpi works well). This allows achieving the correct angle to de-skew the image at the same time that reduces the time required to get the result.
[0055] The Hough transform 50 is applied over the resized image in order to detect the angles of the text lines. The Hough Transform 50 converts the Euclidian space to the Hough Space. The Hough Space is a 2D plane that has a horizontal axis representing the slope and the vertical axis representing the intercept of a line on the edge image. A line on an edge image is represented in the form of y=ax+b (Hough, 1962). One line on the edge image produces a point on the Hough Space since a line is characterized by its slope and intercept b. Usually the Hough Space uses a different form of representing the straight lines called normal lines that passes through the origin and perpendicular to that straight line. The form of the normal line is ρ=x cos(θ)+y sin(θ) where ρ is the length of the normal line and θ is the angle between the normal line and the x axis. In this way each edge point (x.sub.i, y.sub.i) now generates a cosine curve in the Hough Space instead of a straight line. One line on the edge image still produces a point on the Hough Space. The lines are detected looking for these points in the Hough Space, for this task a threshold Th should be set. If the threshold Th is low then the number of lines obtained will be high, and it could be difficult to determine the more probable direction of the real lines. To avoid this problem, a recursive strategy is proposed. The threshold is set to a high value, if the lines found are less than 10, the process is repeated with a new threshold calculated as Th_i=¾*Th_{i−1}, where i indicates the number of the iteration. Other values different to 10 can be used, but it has been found empirically that this number works really well. When the number of lines is higher than 10, the process is stopped. The stop condition 62 also needs to avoid infinite loops (it is possible to find less than 10 lines in a page), so the method only allows to repeat the process a finite number of times.
[0056] In a normal case, after applying the Hough Transform 50, almost ten lines are obtained. The method obtains the vector of angles θs of these lines and calculates the histogram of these angles. The maximum of the histogram 54 is taken as the estimated angle (skew angle) θf.
[0057] The de-skew 60 is made just taking −θf and rotating the image. After the rotation, the obtained image Irot(r′,c′) has more width and height than the input image. To correct this distortion the crop operation is applied over the image. An easy way to do that is to calculate the new origin point (r0, c0) as:
[0058] where R′ and C′ are the number of rows and columns of the image Irot(r′,c′), and R and C are the number of rows and columns of the input image I(r,c). The crop is done over the image Irot(r′,c′) in the region delimited by the rows [r.sub.0, R] and the columns [c.sub.0, C].
[0059] After this crop operation the output of the skew correction module 32 is obtained; i.e. an image I′(r,c) where the rotation of the page was corrected and with the same size of the original image. This allows to the image-based segmentation achieve a high accuracy.
[0060] The performance of this skew correction module 32 was compared with other methods of the state of the art, such as the Leptonica method (Dan S. Bloomberg et al. Measuring document image skew and orientation. In Document Recognition II (pp. 302-316). SPIE), showing better performance in the precision of the angle obtained, the time required to make the estimation and the range of angles that the algorithm can correct.
[0061] After correcting the skew of the image, image-based segmentation 34 can be applied. As explained before, a method based on projections can be used. The method starts by binarizing the bitmap format document I′(r,c) and summing the pixels in horizontal direction, the horizontal projection Ph(r) is obtained as:
Ph(r)=Σ.sub.cI.sup.b(r,c),
[0062] where the binarized image is denoted as I.sup.b(r,c), r is the r-th row, c is the c-th column, and I(r,c) is the binarized document image.
[0063] By setting a proper threshold value, the bitmap format document can be segmented in text lines, finding optimal binarization thresholds from the image histogram. Once the projections are computed, and text lines are identified, each text line is segmented. For each line, its vertical projection Pvi(c) corresponds to:
P.sub.vi(C)=Σ.sub.rI.sup.b(r,c),
[0064] where, .sub.rI.sub.i.sup.b (r,c) is the binarized document image cropped to the i-th line.
[0065] Applying the same technique used to isolate text line using the horizontal projection Ph(r), spaces in each line can be identified and measured applying a proper threshold to the vertical projection Pvi(c).
[0066] The output of the segmentation 34 is a set of word lengths and a set of their corresponding locations in the document, denoted as set W and set L respectively 38.
[0067] If the document is in vectorial format, the system can read the information from the Page Description Language (PDL) based on the standard ISO 32000 known as PDF using a PDF interpreter 36. A PDF document consists of a collection of objects that together describe the appearance of one or more pages, possibly accompanied by additional interactive elements and higher-level application data. The system reads the trailer section of the file, gets the root information, and then reads the pages object. Inside of page objects the contents objects of the page can be found. The contents may have text objects that consist of operators that can show text strings, move the text position, and set text state and certain other parameters. The system is able to read and interpret all text state operators, text-positioning operators, and text-showing operators in order to obtain the set of words lengths and a set of their corresponding locations in the document, denoted as set W and L, respectively. In this case, the operation results faster. The values W and L 38 are used to create a (binary) map image M(r, c) 10 in the binary image creation module 39 where the pixels of the regions associated with the location L and word lengths W are set to white, and the rest of the pixels are set to black. L is a vector of pixel positions {(Ix1,Iy1), (Ix2, Iy2) . . . (Ix_n,Iy_n)} indicating the left-bottom corner of the word, and W is a vector of pair of values indicating the width and high of a word {(w1,h1), (w2,h2) . . . (w_n,h_n)} where the length of the vector L and W are the same since each pair is related to a word of the page. The map M(r,c) is created with the same size as the input image I(r,c) and with all their pixels with value zero (black pixels). Then, for each value i⊂[1,n] the regions delimited by [L.sub.i,L.sub.i+W.sub.j] are set to white (this is the same as the region [(Ix.sub.i,Iy.sub.i),(w.sub.i,h.sub.i)] for all i⊂[1,n]). An example of the input map can be seen on
[0068] Once the input to the text detector 12 is generated, in an embodiment, a deep neural network is used to predict in the inference process the regions corresponding to the invalid text. Particularly, the use of a Feature Pyramid Network (FPN) is proposed, although other networks can be equally used such as a Single Shot MultiBox Detector (SSD), among others. The FPN network works well with different scale objects, which can be useful for the data processed here, where the size of the object depends on the size of the list, table, etc. Of course, these models have to be trained to learn the weights of the network before being applied in the inference step. This is done using supervised learning, so the data used should be annotated in previous steps.
[0069] A possible implementation of the invention can be based on the use of the Focal Loss and the use of Transfer Learning tasks, using the COCO dataset as a reference. The Focal loss prevents the vast number of easy negatives from overwhelming the text detector 12 during training. The invention can also use the Efficientnet or the Mobilenet architectures as a backbone network in order to reduce the computation terms.
[0070] The output of the network will be the set of invalid regions 14 (a set of boxes representing invalid regions) B that represents multiple detections, where each box is defined by {x,y, w, h} values and x indicates the left-up corner in the horizontal coordinate, y indicates the left-up corner of the box in the vertical coordinate, w indicates the width in pixels, and h indicates the height of the box in pixels. The output also returns a score associated to each box. This score indicates the confidence level of the detection (
[0071] It should be noted that the problem can be solved as an object detection or an image segmentation problem. In this case, particularly, the detection object scenario is considered. Image segmentation is a further extension of object detection in which the method marks the presence of an object through pixel-wise masks generated for each object in the image. This technique is more granular than bounding box generation because this can help in determining the shape of each object present in the image. This granularity helps in various fields such as medical image processing, satellite imaging, etc., but it is less relevant for the tasks related here, where rectangle bounding boxes are enough to cover the goals of the system since we have corrected the geometric distortion of the pages.
[0072] One of the most important contributions of present invention is when these predictions are used as part of a digital text watermarking process based on shifts of the text elements. The solution may also be used for detecting specific objects in text documents, like tables etc., similarly, to state of the art algorithms, but with much faster processing time. The solution can be adapted for different algorithms as unwanted areas are labelled correctly in the train step of the neural network. A specific adaptation can be made using the watermarking method described in EP3477578B1, of the same authors of present invention.
[0073] With regard to
[0074] The present invention has been described in particular detail with respect to specific possible embodiments. Those of skill in the art will appreciate that the invention may be practiced in other embodiments. For example, the nomenclature used for components, capitalization of component designations and terms, the attributes, data structures, or any other programming or structural aspect is not significant, mandatory, or limiting, and the mechanisms that implement the invention or its features can have various different names, formats, and/or protocols. Further, the system and/or functionality of the invention may be implemented via various combinations of software and hardware, as described, or entirely in software elements. Also, particular divisions of functionality between the various components described herein are merely exemplary, and not mandatory or significant. Consequently, functions performed by a single component may, in other embodiments, be performed by multiple components, and functions performed by multiple components may, in other embodiments, be performed by a single component.
[0075] Certain aspects of the present invention include process steps or operations and instructions described herein in an algorithmic and/or algorithmic-like form. It should be noted that the process steps and/or operations and instructions of the present invention can be embodied in software, firmware, and/or hardware, and when embodied in software, can be downloaded to reside on and be operated from different platforms used by real-time network operating systems.
[0076] The scope of the present invention is defined in the following set of claims.