DISTRIBUTED COMPUTER SYSTEM FOR DOCUMENT AUTHENTICATION
20220237937 · 2022-07-28
Inventors
Cpc classification
G06V10/762
PHYSICS
G06V10/26
PHYSICS
G06V30/19093
PHYSICS
International classification
G06V30/413
PHYSICS
G06V10/26
PHYSICS
G06V10/762
PHYSICS
Abstract
Methods and distributed computer devices for automatically determining whether a document is genuine. The method involves generating an image of the document, pre-processing of the image to obtain at least one segment of the image with an area of interest and dividing the at least one segment into portions containing single characters and/or combinations of characters. A validation of at least two single characters and/or at least two combinations of characters is performed for each of the single character and/or character combinations for at least two different categories. Score values are created for each category for each validated single character and/or character combination. Feature vectors are created for each single character and/or character combination, with the respective score values for each category as components. The method involves classifying the feature vectors to determine whether the single character or character combination to which the feature vector is associated is genuine.
Claims
1. A method comprising: generating an image of a document to be audited; pre-processing the image to obtain at least one segment of the image with an area of interest; dividing the at least one segment into portions containing single characters and/or combinations of characters; performing a validation of a plurality of single characters and/or a plurality of combinations of characters, wherein the validation is carried out for each of the single character and/or character combinations for at least two different categories; creating score values for each category for each validated single character and/or each character combination; creating feature vectors for each single character and/or each character combination, wherein components of the feature vectors are the score values for the single character and/or the character combination for each respective category; and classifying the feature vectors to determine whether the single character or the character combination to which the feature vector is associated is genuine.
2. The method of claim 1, wherein the validation of the single characters and/or the character combinations and/or associated scoring of values for each category and/or the classification of the feature vectors is performed using an artificial neural network.
3. The method of claim 1, wherein the at least two different categories used for the validation of each single character and/or each character combination include font, overlay, background and foreground, font alignment, readability, completeness, usage of artificial filters, steganographic manipulation, or a combination thereof.
4. The method of claim 3, wherein validation of each single character and/or each character combination according to the background and foreground category comprises a bonding analysis of a character in a portion.
5. The method of claim 3, wherein the validation of each single character and/or each character combination according to the font alignment category comprises obtaining a distance between two adjacent characters and/or two adjacent combinations of characters.
6. The method of claim 3, wherein the validation of each single character and/or each character combination according to the artificial filter category comprises passing each character through an analysis dedicated to the identification of manipulation caused by artificial filter use.
7. The method of claim 3, wherein the wherein the validation of each single character and/or each character combination according to the steganographic manipulation category comprises an error level analysis applied to the document, wherein the error level analysis comprises a comparison of the image with a compressed version of the image.
8. The method of claim 1, wherein the classification involves a cluster analysis, wherein a single-character cluster analysis is performed for each feature vector associated with a corresponding single character, a multi-character cluster analysis is performed for feature vectors associated with a plurality of characters, and a document-wide cluster analysis is performed for all feature vectors associated with the characters of the document.
9. The method of claim 8, wherein the single-character cluster analysis comprises obtaining a similarity indication between at least two feature vectors associated with single characters, wherein, when the similarity indication obtained violates a defined threshold, the single character associated with the corresponding dissimilar feature vector is considered to be non-genuine; and wherein the multi-character cluster analysis comprises obtaining a similarity indication between at least two feature vectors associated with a combination of characters, wherein, when the similarity indication obtained violates a defined threshold, the plurality of characters associated with the corresponding dissimilar feature vector are considered to be non-genuine.
10. The method of claim 8, wherein the document-wide cluster analysis comprises obtaining a similarity indication between a feature vector associated with a single character or a combination of characters, and an aggregated mean feature vector associated with all characters in the entire document, wherein, when the similarity indication obtained violates a defined threshold, the single character associated with the corresponding dissimilar feature vector is considered to be non-genuine.
11. The method of claim 10, wherein obtaining the similarity indication comprises calculating a cosine similarity, wherein calculating the cosine similarity comprises calculating a dot product between at least two feature vectors and a magnitude of the at least two feature vectors.
12. The method of claim 11, wherein the defined similarity threshold lies between 0 and 1, and the threshold is violated if the similarity indication is equal to or lower than the defined similarity threshold.
13. A computer device comprising: at least one processor; and at least one non-volatile memory comprising executable instructions that, when executed by the at least one processor, cause the at least one processor to: generate an image of a document to be audited; pre-process the image to obtain at least one segment of the image with an area of interest; divide the at least one segment into portions containing single characters and/or combinations of characters; perform a validation of at least two single characters and/or at least two combinations of characters, wherein the validation is carried out for each of the single character and/or the character combinations for at least two different categories; create score values for each category for each validated single character and/or each validated character combination; create feature vectors for each single character and/or character combination, wherein components of the feature vectors are the score values for the single character and/or the character combination for each respective category; and classify the feature vectors to determine whether the single character or the character combination to which the feature vector is associated is genuine.
14. The computer device of claim 13, wherein the validation of the single characters and/or the character combinations and/or associated scoring of values for each category and/or the classification of the feature vectors is performed using an artificial neural network.
15. A computer program product comprising: a non-transitory computer-readable storage medium including program code instructions, wherein the program code instructions comprise: generate an image of a document to be audited; pre-process the image to obtain at least one segment of the image with an area of interest; divide the at least one segment into portions containing single characters and/or combinations of characters; perform a validation of at least two single characters and/or at least two combinations of characters, wherein the validation is carried out for each of the single character and/or the character combinations for at least two different categories; create score values for each category for each validated single character and/or each validated character combination; create feature vectors for each single character and/or each character combination, wherein components of the feature vectors are the score values for the single character and/or character combination for each respective category; and classify the feature vectors to determine whether the single character or the character combination to which the feature vector is associated is genuine.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0042] Examples of the invention are now described, also with reference to the accompanying drawings.
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062] The drawings and the description of the drawings are of examples of the invention and are not of the invention itself. Like reference signs refer to like elements throughout the following description of examples.
DETAILED DESCRIPTION
[0063] An example of a mobile device 1, in this example a mobile phone, which is scanning a receipt 2 is illustrated by
[0064] A flow chart illustrating activities of the method of automatically auditing a document to determine whether the document is genuine, is illustrated by
[0065] In an activity 200 an image of the document to be audited is generated. In subsequent activity 201, the image is pre-processed to obtain at least one segment of the image with an area of interest. In a subsequent activity 202, the at least one segment is divided into portions containing single characters and/or combinations of characters. In a subsequent activity 203 a validation of at least two single characters and/or at least two combinations of characters is performed. The validation is carried out for each of the single character and/or character combinations for at least two different categories. In a next activity 204, score values are created for each category for each validated single character and/or character combination. In a subsequent activity 205, feature vectors for each single character and/or character combination are created. The components of these feature vectors are score values for the single character and/or character combination for the respective category. In a subsequent activity 206, the feature vectors are classified to determine whether the single character or character combination to which the feature vector is associated is genuine.
[0066] Two different examples for an auditing result are illustrated by
[0067] Examples of receipt features that are analysed and possible audit recommendations following from the analysis are illustrated by
[0068] Three different examples of manipulated and/or altered receipts 21, 22, 23 are shown in
[0069] Three examples of degraded receipts are shown in
[0070] A variety of categories, dependent on which single characters and/or combinations of characters from an amount field 11 of a receipt 2 are validated are shown in
[0071] An example for identifying areas of interest on a receipt is illustrated by
[0072] A schematic block diagram of an example for a process flow from text localization to the presentation of audit results is provided by
[0073] This analysis results in a variety of score values for each category: In the font category 10, there is a score value 40 with the assumed value 0. For the overlay category 101, the corresponding score value 50 assumes the value 0. In the background (colour) vs foreground (colour) category the corresponding score value 60 also assumes the value 0. The validation in the OCR readability category 104 results in a corresponding score value 70 with a value of 1. A corresponding score 75 in the completeness category 105 has a value of 1. Finally, the validation in the artificial filter category 106 yields a corresponding score value 80 with a value of 0. The validation and scoring performed in this first step may be a single character validation. The overall result is a feature vector 160 with the respective score values 40, 50, 60, 65, 70, 75, 80 for each category as its components. The feature vector of this example has the following shape in component representation: (0,0,0,0,1,1,0).
[0074] Either based on the scores for each character, a word aggregate scores (mean) 33 are calculated in the example illustrated by
[0075] In the example of
[0076] An example for a concrete scoring of concrete passages on the receipt 2 is provided by
[0077] A plurality of feature vectors resulting from a single character and a combination of character analysis of the amount field 11 with the entry “€ 0.25” is schematically illustrated in
[0078] An example for a validation of single characters in the pricing field 11 in the background and foreground (bonding) category 102 (see
[0079] The background and foreground analysis/prediction 120 as shown in
[0080] An example for a validation of single characters in the pricing field 11 in the artificial filter category 106 (see
[0081] An example for a validation of single characters in the pricing field 11 in the character distance category 140 is shown in
[0082] A set of feature vectors 150 to 155 obtained from a pricing field 11′ are shown in
[0083] The first feature vector 150 is associated with the entire character combination “€ 0.35”, including a manipulated character “3”, that should actually read “2” (see
[0084] The alignment of the feature vectors 151, 152,153, associated with single characters “€”, “0”, “.” respectively, and of feature vector 150 associated with the untampered character combination “€ 0.25” is illustrated in a feature space (x,y) coordinate system in the lower left corner of
[0085] As depicted in the lower right corner of
[0086] A component representation of feature vectors 170 to 174 associated with combinations of characters 11″, 181 to 184, and the schematic alignment of these vectors in an n-dimensional feature space is illustrated by
[0087] The feature vector is 170 is associated with the character combination “€ 0.39” 11″, of which the single characters “3” and “9” are manipulated. The feature vector 171 is associated with the character combination “€ 0” 181, the feature vector 172 is associated with the character combination “0.” 182, the feature vector 173 is associated with the partly tampered character combination “0.3” 183 and the feature vector 174 is associated with the entirely tampered character combination “39” 184.
[0088] Like in
[0089] As can be seen in the feature space representation shown in the lower right corner of
[0090] Examples for feature vectors associated with characters combinations 191, 192, 193 as well as an aggregated mean feature vector associated with all characters in the entire document 190 are depicted in
[0091] As mentioned above, creating the aggregated mean feature vector, for example, involves calculating a mean value of all score values for a particular category to obtain a mean-score value for this category. This mean value would then be the corresponding feature vector component of said feature vector associated with all characters in the entire document. Alternatively, the single-character and/or multi-character score values are, for example, aggregated by summing every up every feature vector component to obtain a respective component of the feature vector associated with all characters.
[0092] As in
[0093] In the example illustrated by
[0094] A cluster 198 is defined that lies outside the aggregated mean feature vector associated with all characters in the entire document 190. It can be seen in the feature space representation in the lower left corner of
[0095] As can be seen in the depiction in the lower right-hand side of
[0096] Whether or not a feature vector lies within a cluster 168, 188, 198, 169, 189, 199 may depend on the similarity between two feature vectors.
[0097] A schematic flow diagram of an example for a method of calculating such a similarity indication of feature vectors is shown in
[0098] In an activity 400 to find the similarity y between two vectors A=[a.sub.1, a.sub.2, . . . , a.sub.n] and B=[b.sub.1, b.sub.2, . . . , b.sub.n], the cosine similarity of these two vectors A and B (or more precisely the cosine of the angle between the two vectors A and B, which represents the similarity score) is calculated using the following formula:
[0099] wherein ∥A∥ ∥B∥, corresponds to the (euclidean) l.sup.2 norm of the vectors A and B and the similarity score s lies between 0 and 1. The calculation of this norm involves the calculation of the dot product A*A and B*B.
[0100] Subsequently, a threshold t is defined in an activity 410, wherein the threshold lies between 0 and 1. Thereafter in a comparison activity 420 it is checked whether the similarity score s is equal or smaller than the defined threshold t.
[0101] In response to the comparison activity 420 yielding the result that the similarity score s is not smaller or equal to the threshold t the feature vectors A and B are considered to be dissimilar in activity 440. In response to the comparison activity 420 resulting in the finding that the similarity score s is indeed smaller or equal to the threshold t, the feature vectors A and B are considered to be similar in activity 430.
[0102] By choosing the threshold value t accordingly, a more or less restrictive similarity criterion can be set. As mentioned above, the similarity score s, corresponding to the cosine of the angle between two feature vectors may define the size of a cluster just as those described in conjunction with
[0103] A diagrammatic representation of an exemplary computer system 500 is shown in
[0104] The computer system 500 includes a processor 502, a main memory 504 and a network interface 508. The main memory 504 includes a user space, which is associated with user-run applications, and a kernel space, which is reserved for operating-system- and hardware-associated applications. The computer system 500 further includes a static memory 506, e.g. non-removable flash and/or solid state drive and/or a removable Micro or Mini SD card, which permanently stores software enabling the computer system 500 to execute functions of the computer system 500. Furthermore, it may include a video display 510, a user interface control module 514 and/or an alpha-numeric and cursor input device 512. Optionally, additional I/O interfaces 516, such as card reader and USB interfaces may be present. The computer system components 502 to 516 are interconnected by a data bus 518.
[0105] In some exemplary embodiments the software programmed to carry out the method described herein is stored on the static memory 506; in other exemplary embodiments external databases are used.
[0106] An executable set of instructions (i.e. software) 503 embodying any one, or all, of the methodologies described above, resides completely, or at least partially, permanently in the non-volatile memory 506. When being executed, process data resides in the main memory 504 and/or the processor 502.