SYSTEM AND METHOD FOR AUTOMATICALLY TAGGING DOCUMENTS

20230028664 · 2023-01-26

    Inventors

    Cpc classification

    International classification

    Abstract

    System and methods (100) for automatically tagging electronic documents are disclosed. An input module receives (102) an electronic document to be tagged. A preprocessing module then preprocesses (104) the electronic document to be tagged. The preprocessing of the electronic document comprises extracting a text from the electronic document to be tagged, replacing a number or a date in the extracted text with a predetermined symbol, and tokenizing the extracted text with the predetermined symbol into a plurality of tokens. After the preprocessing (104), a deep learning module determines (106) a tag for at least one of the plurality of tokens. The determined tag for the at least one token is then output (108) by an output module.

    Claims

    1. A computer-implemented method for tagging electronic documents, the method comprising: receiving, by an input module, an electronic document to be tagged; preprocessing, by a preprocessing module, the electronic document to be tagged, the preprocessing comprising: extracting a text from the electronic document to be tagged; replacing a number or a date in the extracted text with a predetermined symbol; and tokenizing the extracted text with the predetermined symbol into a first plurality of tokens; determining, by a deep learning module, a tag for at least one of the first plurality of tokens; and outputting, by an output module, the determined tag for the at least one of the first plurality of tokens.

    2. The method of claim 1, wherein tokenizing the extracted text into the first plurality of tokens processes the predetermined symbol as an undividable and/or unsplittable expression.

    3. The method of claim 1, wherein tokenizing the extracted text into the first plurality of tokens transfers the predetermined symbol into a single token.

    4. The method of claim 1, wherein the extracted text comprises more than one number and/or more than one date to be replaced, and wherein the same predetermined symbol is used for all numbers and/or all dates in the extracted text.

    5. The method of claim 1, wherein the predetermined symbol represents a shape, a format and/or a magnitude of the number and/or of the date to be replaced.

    6. The method of claim 1, wherein the deep learning module comprises an artificial neural network.

    7. The method of claim 1, wherein the deep learning module comprises a transformer-based deep learning model, and wherein the transformer-based deep learning model is based on a Bidirectional Encoder Representations from Transformers, BERT, model.

    8. The method of claim 1, wherein determining a tag for at least one of the first plurality of tokens comprises: transforming, by a neural network encoder, the first plurality of tokens into a plurality of numerical vectors in a latent space; and mapping, by a decoder, the plurality of numerical vectors into tags, the decoder comprising a dense neural network layer with a softmax activation function.

    9. The method of claim 1, wherein the determined tag is for a token representing the number and/or date in the electronic document to be tagged.

    10. The method of claim 1, wherein the electronic document to be tagged is a document including financial information and/or the electronic document to be tagged is from a financial domain; and/or wherein the determined tag is a tag from the eXtensive Business Reporting Language, XBRL.

    11. The method of claim 1, wherein the deep learning module is trained for determining the tag by: receiving, by the input module, a plurality of electronic documents as training dataset, the plurality of electronic documents comprising tags associated with text elements in the plurality of electronic documents; preprocessing, by the preprocessing module, the plurality of electronic documents, wherein each of the plurality of electronic documents is preprocessed by: extracting a text from each of the plurality of electronic documents; replacing a number or a date in the extracted text with the predetermined symbol; and tokenizing the extracted text with the predetermined symbol into a second plurality of tokens, wherein at least some of the second plurality of tokens are associated with one or more tags; and training, by a training module, the deep learning module with the second plurality of tokens along with the one or more tags.

    12. A computer-implemented method for training a deep learning module, the method comprising: receiving, by an input module, a plurality of electronic documents as training dataset, the plurality of electronic documents comprising tags associated with text elements in the plurality of electronic documents; preprocessing, by a preprocessing module, the plurality of electronic documents, wherein each of the plurality of electronic documents is preprocessed by: extracting a text from each of the plurality of electronic documents; replacing a number or a date in the extracted text with a predetermined symbol; and tokenizing the extracted text with the predetermined symbol into a plurality of tokens, wherein at least some of the plurality of tokens are associated with the tags; and training, by the training module, the deep learning module with the plurality of tokens along with the tags.

    13. The method of claim 12, wherein tokenizing the extracted text into the plurality of tokens processes the predetermined symbol as an undividable and/or unsplittable expression.

    14. The method of claim 12, wherein tokenizing the extracted text into the plurality of tokens transfers the predetermined symbol into a single token.

    15. The method of claim 12, wherein the extracted text comprises more than one number and/or more than one date to be replaced, and wherein the same predetermined symbol is used for all numbers and/or all dates in the extracted text.

    16. The method of claim 12, wherein the predetermined symbol represents a shape, a format and/or a magnitude of the number and/or of the date to be replaced.

    17. The method of claim 12, wherein the deep learning module comprises an artificial neural network.

    18. The method of claim 12, wherein the deep learning module comprises a transformer-based deep learning model, and wherein the transformer-based deep learning model is based on a Bidirectional Encoder Representations from Transformers, BERT, model.

    19. The method of claim 12, wherein the plurality of electronic documents includes financial information and/or the plurality of electronic document is from a financial domain; and/or wherein the tags associated with text elements in the plurality of electronic documents are tags from the eXtensive Business Reporting Language, XBRL.

    20. A data processing apparatus, comprising: at least one processor; memory storing computer-readable instructions that, when executed by the at least one processor, cause the data processing apparatus to: receive, by an input module of the data processing apparatus, an electronic document to be tagged; preprocess, by a preprocessing module of the data processing apparatus, the electronic document to be tagged, the preprocessing comprising: extracting a text from the electronic document to be tagged; replacing a number or a date in the extracted text with a predetermined symbol; and tokenizing the extracted text with the predetermined symbol into a first plurality of tokens; determine, by a deep learning module of the data processing apparatus, a tag for at least one of the first plurality of tokens; and output, by an output module of the data processing apparatus, the determined tag for the at least one of the first plurality of tokens.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0123] The forgoing and other features and advantages of the invention will become further apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, like reference numbers refer to like elements.

    [0124] FIG. 1 is a flowchart illustrating a method for tagging electronic documents according to an embodiment of the invention.

    [0125] FIG. 2 is a flowchart illustrating preprocessing steps according to an embodiment of the invention.

    [0126] FIG. 3 is a schematic drawing illustrating a deep learning module according to an embodiment of the invention.

    [0127] FIGS. 4A and 4B are schematic drawings illustrating two examples for tokenizing numeric expressions.

    [0128] FIGS. 5A and 5B are schematic drawings illustrating tagging processes for two exemplary sentences.

    [0129] FIG. 6 is a schematic drawing illustrating an exemplary output of a tagging process according to an embodiment of the invention.

    [0130] FIG. 7 is a flowchart illustrating a method for training a deep learning module according to an embodiment of the invention.

    [0131] FIG. 8 is a diagrammatic representation of a data processing system for performing the methods disclosed herein.

    DETAILED DESCRIPTION

    [0132] In the following, embodiments of the invention will be described in detail with reference to the accompanying drawings. It is to be understood that the following description of the embodiments is given only for the purpose of illustration and is not to be taken in a limiting sense. It should be noted that the drawings are to be regarded as being schematic representations only, and elements in the drawings are not necessarily to scale with each other. Rather, the representation of the various elements is chosen such that their function and general purpose become apparent to a person skilled in the art.

    [0133] FIG. 1 shows a flowchart illustrating a method 100 for tagging electronic documents according to an embodiment of the invention.

    [0134] At step 102, an input module receives a document that needs to be XBRL tagged. The document is a company filing. The input module receives the document as PDF file. Then, at step 104, the received document is preprocessed by a preprocessing module. The preprocessing module extracts text from the document, replaces numbers and dates in the extracted text with predetermined symbols, and then tokenizes the text with the predetermined symbols into text tokens.

    [0135] At step 106, a deep learning module uses a neural network encoder in order to transform the text tokens into numerical vectors in a latent space. Based on the numerical vectors in the latent space tags are then determined by a decoder of the deep learning module. For example, method 100 predicts that the expression “30.2$” included in the received document should be labeled with XBRL tag “Revenue”.

    [0136] Recommendations of the deep learning module are displayed in step 108 on a user's display. Also at step 108, the user will be able to store the new, XBRL-tagged document in an electronic storage in order to proceed to submit the document to a local securities commission.

    [0137] FIG. 2 shows a preprocessing process 200. Optional steps of the preprocessing process 200 are indicated by dashed boxes.

    [0138] The preprocessing process 200 receives a document as input. At step 202, text is extracted from the document. At optional step 204, the preprocessing process 200 then detects and removes tables from the document if the tables do not need to be tagged. Then, at optional step 206, the preprocessing process 200 extracts specific sections of the document that a user is interested in tagging. The specific sections are split into sentences and the sentences are normalized at optional step 208. At step 210, numbers in the sentences are then replaced with predetermined symbols so that financial amounts in the sentences are represented by undividable expressions indicating the magnitude and/or shape of the numbers. This avoids “overfragmentation” by a classic tokenizer. The modified sentences are then tokenized into tokens at step 212. At optional step 214, unneeded content is filtered out by a heuristic classifier.

    [0139] FIG. 3 is a schematic drawing illustrating a deep learning module 300 according to an embodiment of the invention.

    [0140] The deep learning module 300 takes a sequence of N tokens 302.sub.1, 302.sub.2 . . . 302.sub.n-1, 302.sub.n and predicts N tags 304.sub.1, 304.sub.2, . . . , 303.sub.n-1, 304.sub.n. The deep learning module 300 uses a neural network encoder 306 to encode the tokens 302.sub.1, 302.sub.2 . . . 302.sub.n-1, 302.sub.n into contextualized representations. The neural network encoder 306 is a specialized domain BERT model. In order for the deep learning module 300 to be able to produce the tags 304.sub.1, 304.sub.2 . . . 303.sub.n-1, 304.sub.n, the deep learning module 300 employs a task-specific decoder 308 after the neural network encoder 306, which includes a dense neural network layer with a softmax activation function. The decoder 308 is used to finally map vector representations received from the neural network encoder 306 into the tags 304.sub.1, 304.sub.2, . . . , 304.sub.n. The decoder's 308 number of units can be changed according to the task specification each time.

    [0141] The output of the deep learning module 300 is heavily dependent on the proper tokenization of numeric expressions.

    [0142] FIGS. 4A and 4B illustrate examples for tokenizing two numeric expressions.

    [0143] One example for a tokenization algorithm is WordPiece. The WordPiece tokenization algorithm is a subword tokenization algorithm and is used, for example, for BERT, DistilBERT, and Electra. Given a text, the WordPiece tokenization algorithm produces tokens (i.e., subword units).

    [0144] However, such a tokenization algorithm is not suitable for processing financial data (e.g., numerical expressions or date expressions) since it produces multiple, meaningless subword units when tokenizing. This happens because the tokenization algorithm splits tokens based on a vocabulary of commonly seen words. Since exact numbers (like 55,333.2) are not common, they are split into multiple, fragmented chunks. It is not possible to have an infinite vocabulary of all continuous numerical values. As it can be seen in FIG. 4A, input expression 402 for the word “dollars” produces output token 406. However, input expression 404 for the numerical value “50.2” produces three split output tokens 408, 410, and 412 associated with “50”, “.”, and “2”. This fragmentation of numerical or date expressions has a negative impact on the performance of classifying tokens as tags.

    [0145] In order to avoid this overfragmentation and to improve the performance, input tokens representing numerical values (e.g., financial amounts) or date values (e.g., deadlines) are pre-processed or special rules are applied so that such values are not split into multiple tokens.

    [0146] For the tagging of documents, the exact value or the exact numbers of the numerical expressions and/or of the date expressions are not of particular relevance. The magnitude and/or the shape of the numerical values or of the date values are usually sufficient. As shown in FIG. 4B, input token 454 is normalized by its magnitude/shape and then transformed to output token 458 having the value “XX.X”. Thus, a generalization of the numerical expression is used. This generalization improves the performance of the tagging task. Similar as in the example shown in FIG. 4A, input expression 452 for the word “dollars” still produces one output token 456 and does not differ from the example in FIG. 4A.

    [0147] FIG. 5A shows different phases of a tagging process 500 for a first exemplary sentence 502.

    [0148] The tagging process 500 expects the first exemplary sentence 502 as input. Numbers and dates in the sentence 502 are replaced with special symbols to create a preprocessed sentence 504. In the embodiment shown in FIG. 5A, the numeric expression “24.8” is replaced with symbol “XX.X”. Further, the numeric expressions “31” and “2020” in a date are replaced with symbol “XX” and “XXXX”, respectively. The preprocessed sentence 504 then gets tokenized into a plurality of tokens 506. Due to the use of the symbol “XX.X”, the number “24.8” is transformed into a single token. The numeric expressions in the date are transformed into tokens 506 representing the symbols “XX” and “XXXX”. After that, the tokens 506 pass through the encoder 306 and the decoder 308 as described above. The tagging process 500 generates one class prediction 508 for each token 506. The “0” class is used to refer to tokens that represent no specific class. The numeric expression “24.8” is classified as “Cash and Cash Equivalents”.

    [0149] A tagging process 500′ for a second exemplary sentence 502′ is shown in FIG. 5B.

    [0150] The tagging process 500′ is similar to the tagging process 500 shown in FIG. 5A. The second exemplary sentence 502′ is from the legal domain. The sentence 502′ includes a date expression, namely “Oct. 16, 2021”. A modified sentence 504′ is generated in which the date expression is normalized. The numeric expressions “16” and “2021” in the date are replaced with symbol “XX” and “XXXX”, respectively. The modified sentence 504′ is tokenized, by a classic tokenizer, into tokens 506′. The tagging process 500′ then generates one class prediction 508′ for each token 506′. The “O” class is again used to refer to tokens 506′ that represent no specific class. As can be seen in FIG. 5B, the tagging process 500′ classifies the token 506′ for the word “lease” as “Contract Type”. The tokens 506′ for the expression “Oct. 16, 2021” are all classified as “Termination Date”.

    [0151] FIG. 6 shows an example for a possible output 600 of a tagging process.

    [0152] Annotated tokens 602 are classified as specific XBRL tags 604 that represent financial attributes. The output 600 also illustrates how the XBRL tags 604 are heavily dependent on the context of sequences. For example, while tokens 602.sub.2 and 602.sub.3 for “50.0” and “50.3” are similar text-wise, they are associated with XBRL tag 604.sub.2 and XBRL tag 604.sub.3 which are different due to the context in which the tokens 602.sub.2 and 602.sub.3 are used.

    [0153] FIG. 7 shows a flowchart illustrating a method 700 for training a deep learning module.

    [0154] At step 706, the method 700 receives documents that already contain XBRL tags. Initially, the method 700 gets trained on a large first collection of labeled documents 702. The method 700 may later get updated via a second collection of labeled documents 704. At step 708, the labeled documents 702, 704 are preprocessed by a preprocessing module to generate tokens along with their labels. At step 710, the tokens and the labels are then used to train the deep learning module. In each epoch, the deep learning module tries at step 712 to predict N correct tags for a sequence of N tokens. At step 714, the deep learning module then compares its predictions (i.e., outputs) with the correct labels included in the labeled documents 702, 704 and calculates a loss function (e.g., Cross Entropy). Then, an optimizer (e.g., Adam) redefines the weights of a neural network inside the deep learning module in order to adjust and produce better predictions in the future.

    [0155] FIG. 8 depicts a diagrammatic representation of a data processing system 800 for implementing a method for tagging electronic documents and/or for training a deep learning module as described herein.

    [0156] As shown in FIG. 8, the data processing system 800 includes one or more central processing units (CPU) or processors 801 coupled to one or more user input/output (I/O) devices 802 and memory devices 803. Examples of I/O devices 802 may include, but are not limited to, keyboards, displays, monitors, touch screens, printers, electronic pointing devices such as mice, trackballs, styluses, touch pads, or the like. Examples of memory devices 803 may include, but are not limited to, hard drives (HDs), magnetic disk drives, optical disk drives, magnetic cassettes, tape drives, flash memory cards, random access memories (RAMs), read-only memories (ROMs), smart cards, etc. The data processing system 800 is coupled to a display 806, an information device 807 and various peripheral devices (not shown), such as printers, plotters, speakers, etc. through the I/O devices 802. The data processing system 800 is also coupled to external computers or other devices through a network interface 804, a wireless transceiver 805, or other means that is coupled to a network such as a local area network (LAN), wide area network (WAN), or the Internet.

    [0157] It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as is useful in accordance with a particular application.

    [0158] As used herein, the terms “comprises”, “comprising”, “includes”, “including”, “has”, “having”, or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, product, article, or apparatus that comprises a list of elements is not necessarily limited only to those elements but may include other elements not expressly listed or inherent to such method, process, product, article, or apparatus.

    [0159] Furthermore, the term “or” as used herein is generally intended to mean “and/or” unless otherwise indicated. For example, a condition A or B (as well as a condition A and/or B) is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).

    [0160] As used herein, including the claims that follow, a term preceded by “a” or “an” (and “the” or “said” when antecedent basis is “a” or “an”) includes both singular and plural of such term, unless clearly indicated within the claim otherwise (i.e., that the reference “a” or “an” clearly indicates only the singular or only the plural). Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.

    [0161] While specific embodiments are disclosed herein, various changes and modifications can be made without departing from the scope of the invention. For example, one or more steps may be performed in other than the recited order, and none or more depicted steps may be optional in accordance with aspects of the disclosure. The present embodiments are to be considered in all respects as illustrative and non-restrictive, and all changes coming within the meaning and equivalency range of the appended claims are intended to be embraced therein.