SYSTEM AND METHOD FOR AUTOMATICALLY TAGGING DOCUMENTS
20230028664 · 2023-01-26
Inventors
- Eleftherios Panagiotis Loukas (Agia Paraskevi, GR)
- Eirini Spyropoulou (Agia Paraskevi, GR)
- Prodromos Malakasiotis (Agia Paraskevi, GR)
- Emmanouil Fergadiotis (Agia Paraskevi, GR)
- Ilias Chalkidis (Agia Paraskevi, GR)
- Ioannis Androutsopoulos (Agia Paraskevi, GR)
- Georgios Paliouras (Agia Paraskevi, GR)
Cpc classification
G06F40/117
PHYSICS
G06F40/143
PHYSICS
International classification
G06F40/117
PHYSICS
G06F40/143
PHYSICS
Abstract
System and methods (100) for automatically tagging electronic documents are disclosed. An input module receives (102) an electronic document to be tagged. A preprocessing module then preprocesses (104) the electronic document to be tagged. The preprocessing of the electronic document comprises extracting a text from the electronic document to be tagged, replacing a number or a date in the extracted text with a predetermined symbol, and tokenizing the extracted text with the predetermined symbol into a plurality of tokens. After the preprocessing (104), a deep learning module determines (106) a tag for at least one of the plurality of tokens. The determined tag for the at least one token is then output (108) by an output module.
Claims
1. A computer-implemented method for tagging electronic documents, the method comprising: receiving, by an input module, an electronic document to be tagged; preprocessing, by a preprocessing module, the electronic document to be tagged, the preprocessing comprising: extracting a text from the electronic document to be tagged; replacing a number or a date in the extracted text with a predetermined symbol; and tokenizing the extracted text with the predetermined symbol into a first plurality of tokens; determining, by a deep learning module, a tag for at least one of the first plurality of tokens; and outputting, by an output module, the determined tag for the at least one of the first plurality of tokens.
2. The method of claim 1, wherein tokenizing the extracted text into the first plurality of tokens processes the predetermined symbol as an undividable and/or unsplittable expression.
3. The method of claim 1, wherein tokenizing the extracted text into the first plurality of tokens transfers the predetermined symbol into a single token.
4. The method of claim 1, wherein the extracted text comprises more than one number and/or more than one date to be replaced, and wherein the same predetermined symbol is used for all numbers and/or all dates in the extracted text.
5. The method of claim 1, wherein the predetermined symbol represents a shape, a format and/or a magnitude of the number and/or of the date to be replaced.
6. The method of claim 1, wherein the deep learning module comprises an artificial neural network.
7. The method of claim 1, wherein the deep learning module comprises a transformer-based deep learning model, and wherein the transformer-based deep learning model is based on a Bidirectional Encoder Representations from Transformers, BERT, model.
8. The method of claim 1, wherein determining a tag for at least one of the first plurality of tokens comprises: transforming, by a neural network encoder, the first plurality of tokens into a plurality of numerical vectors in a latent space; and mapping, by a decoder, the plurality of numerical vectors into tags, the decoder comprising a dense neural network layer with a softmax activation function.
9. The method of claim 1, wherein the determined tag is for a token representing the number and/or date in the electronic document to be tagged.
10. The method of claim 1, wherein the electronic document to be tagged is a document including financial information and/or the electronic document to be tagged is from a financial domain; and/or wherein the determined tag is a tag from the eXtensive Business Reporting Language, XBRL.
11. The method of claim 1, wherein the deep learning module is trained for determining the tag by: receiving, by the input module, a plurality of electronic documents as training dataset, the plurality of electronic documents comprising tags associated with text elements in the plurality of electronic documents; preprocessing, by the preprocessing module, the plurality of electronic documents, wherein each of the plurality of electronic documents is preprocessed by: extracting a text from each of the plurality of electronic documents; replacing a number or a date in the extracted text with the predetermined symbol; and tokenizing the extracted text with the predetermined symbol into a second plurality of tokens, wherein at least some of the second plurality of tokens are associated with one or more tags; and training, by a training module, the deep learning module with the second plurality of tokens along with the one or more tags.
12. A computer-implemented method for training a deep learning module, the method comprising: receiving, by an input module, a plurality of electronic documents as training dataset, the plurality of electronic documents comprising tags associated with text elements in the plurality of electronic documents; preprocessing, by a preprocessing module, the plurality of electronic documents, wherein each of the plurality of electronic documents is preprocessed by: extracting a text from each of the plurality of electronic documents; replacing a number or a date in the extracted text with a predetermined symbol; and tokenizing the extracted text with the predetermined symbol into a plurality of tokens, wherein at least some of the plurality of tokens are associated with the tags; and training, by the training module, the deep learning module with the plurality of tokens along with the tags.
13. The method of claim 12, wherein tokenizing the extracted text into the plurality of tokens processes the predetermined symbol as an undividable and/or unsplittable expression.
14. The method of claim 12, wherein tokenizing the extracted text into the plurality of tokens transfers the predetermined symbol into a single token.
15. The method of claim 12, wherein the extracted text comprises more than one number and/or more than one date to be replaced, and wherein the same predetermined symbol is used for all numbers and/or all dates in the extracted text.
16. The method of claim 12, wherein the predetermined symbol represents a shape, a format and/or a magnitude of the number and/or of the date to be replaced.
17. The method of claim 12, wherein the deep learning module comprises an artificial neural network.
18. The method of claim 12, wherein the deep learning module comprises a transformer-based deep learning model, and wherein the transformer-based deep learning model is based on a Bidirectional Encoder Representations from Transformers, BERT, model.
19. The method of claim 12, wherein the plurality of electronic documents includes financial information and/or the plurality of electronic document is from a financial domain; and/or wherein the tags associated with text elements in the plurality of electronic documents are tags from the eXtensive Business Reporting Language, XBRL.
20. A data processing apparatus, comprising: at least one processor; memory storing computer-readable instructions that, when executed by the at least one processor, cause the data processing apparatus to: receive, by an input module of the data processing apparatus, an electronic document to be tagged; preprocess, by a preprocessing module of the data processing apparatus, the electronic document to be tagged, the preprocessing comprising: extracting a text from the electronic document to be tagged; replacing a number or a date in the extracted text with a predetermined symbol; and tokenizing the extracted text with the predetermined symbol into a first plurality of tokens; determine, by a deep learning module of the data processing apparatus, a tag for at least one of the first plurality of tokens; and output, by an output module of the data processing apparatus, the determined tag for the at least one of the first plurality of tokens.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0123] The forgoing and other features and advantages of the invention will become further apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, like reference numbers refer to like elements.
[0124]
[0125]
[0126]
[0127]
[0128]
[0129]
[0130]
[0131]
DETAILED DESCRIPTION
[0132] In the following, embodiments of the invention will be described in detail with reference to the accompanying drawings. It is to be understood that the following description of the embodiments is given only for the purpose of illustration and is not to be taken in a limiting sense. It should be noted that the drawings are to be regarded as being schematic representations only, and elements in the drawings are not necessarily to scale with each other. Rather, the representation of the various elements is chosen such that their function and general purpose become apparent to a person skilled in the art.
[0133]
[0134] At step 102, an input module receives a document that needs to be XBRL tagged. The document is a company filing. The input module receives the document as PDF file. Then, at step 104, the received document is preprocessed by a preprocessing module. The preprocessing module extracts text from the document, replaces numbers and dates in the extracted text with predetermined symbols, and then tokenizes the text with the predetermined symbols into text tokens.
[0135] At step 106, a deep learning module uses a neural network encoder in order to transform the text tokens into numerical vectors in a latent space. Based on the numerical vectors in the latent space tags are then determined by a decoder of the deep learning module. For example, method 100 predicts that the expression “30.2$” included in the received document should be labeled with XBRL tag “Revenue”.
[0136] Recommendations of the deep learning module are displayed in step 108 on a user's display. Also at step 108, the user will be able to store the new, XBRL-tagged document in an electronic storage in order to proceed to submit the document to a local securities commission.
[0137]
[0138] The preprocessing process 200 receives a document as input. At step 202, text is extracted from the document. At optional step 204, the preprocessing process 200 then detects and removes tables from the document if the tables do not need to be tagged. Then, at optional step 206, the preprocessing process 200 extracts specific sections of the document that a user is interested in tagging. The specific sections are split into sentences and the sentences are normalized at optional step 208. At step 210, numbers in the sentences are then replaced with predetermined symbols so that financial amounts in the sentences are represented by undividable expressions indicating the magnitude and/or shape of the numbers. This avoids “overfragmentation” by a classic tokenizer. The modified sentences are then tokenized into tokens at step 212. At optional step 214, unneeded content is filtered out by a heuristic classifier.
[0139]
[0140] The deep learning module 300 takes a sequence of N tokens 302.sub.1, 302.sub.2 . . . 302.sub.n-1, 302.sub.n and predicts N tags 304.sub.1, 304.sub.2, . . . , 303.sub.n-1, 304.sub.n. The deep learning module 300 uses a neural network encoder 306 to encode the tokens 302.sub.1, 302.sub.2 . . . 302.sub.n-1, 302.sub.n into contextualized representations. The neural network encoder 306 is a specialized domain BERT model. In order for the deep learning module 300 to be able to produce the tags 304.sub.1, 304.sub.2 . . . 303.sub.n-1, 304.sub.n, the deep learning module 300 employs a task-specific decoder 308 after the neural network encoder 306, which includes a dense neural network layer with a softmax activation function. The decoder 308 is used to finally map vector representations received from the neural network encoder 306 into the tags 304.sub.1, 304.sub.2, . . . , 304.sub.n. The decoder's 308 number of units can be changed according to the task specification each time.
[0141] The output of the deep learning module 300 is heavily dependent on the proper tokenization of numeric expressions.
[0142]
[0143] One example for a tokenization algorithm is WordPiece. The WordPiece tokenization algorithm is a subword tokenization algorithm and is used, for example, for BERT, DistilBERT, and Electra. Given a text, the WordPiece tokenization algorithm produces tokens (i.e., subword units).
[0144] However, such a tokenization algorithm is not suitable for processing financial data (e.g., numerical expressions or date expressions) since it produces multiple, meaningless subword units when tokenizing. This happens because the tokenization algorithm splits tokens based on a vocabulary of commonly seen words. Since exact numbers (like 55,333.2) are not common, they are split into multiple, fragmented chunks. It is not possible to have an infinite vocabulary of all continuous numerical values. As it can be seen in
[0145] In order to avoid this overfragmentation and to improve the performance, input tokens representing numerical values (e.g., financial amounts) or date values (e.g., deadlines) are pre-processed or special rules are applied so that such values are not split into multiple tokens.
[0146] For the tagging of documents, the exact value or the exact numbers of the numerical expressions and/or of the date expressions are not of particular relevance. The magnitude and/or the shape of the numerical values or of the date values are usually sufficient. As shown in
[0147]
[0148] The tagging process 500 expects the first exemplary sentence 502 as input. Numbers and dates in the sentence 502 are replaced with special symbols to create a preprocessed sentence 504. In the embodiment shown in
[0149] A tagging process 500′ for a second exemplary sentence 502′ is shown in
[0150] The tagging process 500′ is similar to the tagging process 500 shown in
[0151]
[0152] Annotated tokens 602 are classified as specific XBRL tags 604 that represent financial attributes. The output 600 also illustrates how the XBRL tags 604 are heavily dependent on the context of sequences. For example, while tokens 602.sub.2 and 602.sub.3 for “50.0” and “50.3” are similar text-wise, they are associated with XBRL tag 604.sub.2 and XBRL tag 604.sub.3 which are different due to the context in which the tokens 602.sub.2 and 602.sub.3 are used.
[0153]
[0154] At step 706, the method 700 receives documents that already contain XBRL tags. Initially, the method 700 gets trained on a large first collection of labeled documents 702. The method 700 may later get updated via a second collection of labeled documents 704. At step 708, the labeled documents 702, 704 are preprocessed by a preprocessing module to generate tokens along with their labels. At step 710, the tokens and the labels are then used to train the deep learning module. In each epoch, the deep learning module tries at step 712 to predict N correct tags for a sequence of N tokens. At step 714, the deep learning module then compares its predictions (i.e., outputs) with the correct labels included in the labeled documents 702, 704 and calculates a loss function (e.g., Cross Entropy). Then, an optimizer (e.g., Adam) redefines the weights of a neural network inside the deep learning module in order to adjust and produce better predictions in the future.
[0155]
[0156] As shown in
[0157] It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as is useful in accordance with a particular application.
[0158] As used herein, the terms “comprises”, “comprising”, “includes”, “including”, “has”, “having”, or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, product, article, or apparatus that comprises a list of elements is not necessarily limited only to those elements but may include other elements not expressly listed or inherent to such method, process, product, article, or apparatus.
[0159] Furthermore, the term “or” as used herein is generally intended to mean “and/or” unless otherwise indicated. For example, a condition A or B (as well as a condition A and/or B) is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
[0160] As used herein, including the claims that follow, a term preceded by “a” or “an” (and “the” or “said” when antecedent basis is “a” or “an”) includes both singular and plural of such term, unless clearly indicated within the claim otherwise (i.e., that the reference “a” or “an” clearly indicates only the singular or only the plural). Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
[0161] While specific embodiments are disclosed herein, various changes and modifications can be made without departing from the scope of the invention. For example, one or more steps may be performed in other than the recited order, and none or more depicted steps may be optional in accordance with aspects of the disclosure. The present embodiments are to be considered in all respects as illustrative and non-restrictive, and all changes coming within the meaning and equivalency range of the appended claims are intended to be embraced therein.