SYSTEM FOR THE EXTRACTION OF INFORMATION FROM DOCUMENTS

20250077554 · 2025-03-06

Assignee

Inventors

Cpc classification

International classification

Abstract

The invention pertains to a system for the extraction of information from documents, in particular natural language documents, the system comprising an encoder with a neural network; and a retriever that is configured as a reasoning engine. The system is configured such that it supports user-defined queries for at least two pieces of information; the encoder is applied to the documents, in particular the natural language documents, to generate document encodings; the user-defined queries, to generate encoded instructions; the document encodings are queried by the retriever in lookup steps based on the encoded instructions.

Claims

1. A system for the extraction of information from documents, the system comprising: i) an encoder with a neural network; and ii) a retriever that is configured as a reasoning engine; the system being configured such that: it supports user-defined queries for at least two pieces of information; the encoder is applied to the documents, to generate document encodings; the user-defined queries, to generate encoded instructions; the document encodings are queried by the retriever in look-up steps based on the encoded instructions.

2. The system according to claim 1, wherein two or more encoders are used to generate the document encodings.

3. The system according to claim 1, wherein the document encodings are based on embeddings, and key/value pairs derived therefrom.

4. The system according to according to claim 1, configured such that the retriever is only used once for each encoded instruction.

5. The system according to claim 1, configured such that the document encodings are generated only once, for the document encodings to be queried by the retriever to answer the user-defined queries for the at least two pieces of information.

6. The system according to claim 1, configured such that the retriever i) in the look-up steps searches for information that is most relevant for encoded instructions; ii) reasons over the findings of the look-up steps; and iii) produces a condensed representation of the output of step ii).

7. The system according to claim 6, configured such that i) and ii) may be repeated for each encoded instruction before the condensed representation of the output is produced in iii).

8. The system according to claim 1, configured such that the retriever produces a query that can be used to search for yet missing information.

9. The system according to claim 1, further comprising a generator that creates a response to the user-defined query based on the output representation of the retriever.

10. The system according claim 9, configured such that the generator is only used once for creating a response to each encoded instruction.

11. A computer-implemented method for extraction of information from documents based on user-defined queries for at least two pieces of information, comprising the steps of: i) providing at least one document; ii) applying an encoder with a neural network, to a. the documents to generate document encodings; b. the user-defined queries, to generate encoded instructions iii) querying the document encodings by a retriever in look-up steps based on the encoded instructions, wherein the retriever is configured as a reasoning engine.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0053] The invention will now be described by way of more specific embodiments. These embodiments are not intended to limit the gist of the invention in any way, but rather serve to ease understanding of the invention.

[0054] FIG. 1: Overall model architecture;

[0055] FIG. 2: Exemplary document understanding task;

[0056] FIG. 3: Exemplary position embedding (bounding boxes);

[0057] FIG. 4: Exemplary use of position embeddings with the embeddings of corresponding words;

[0058] FIG. 5: General outline of information retrieval;

[0059] FIG. 6: Exemplary retrieval diagram.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0060] As outlined in FIG. 1, according to an exemplary overall model architecture, a natural language document (NLD) is provided as the information source. Encoders (E.sub.1, E.sub.2, E.sub.3) run through the natural language document (NLD) and generate document encodings (DE)/vectors (V), such as e.g. keys and values. Further, the system comprises yet another encoder (E.sub.4) that receives user-defined queries (Q.sub.1, . . . ) and generates encoded instructions (EI.sub.1, . . . ) therefrom. The document encodings (DE) are then queried by the retriever (R) in response to the user-defined queries (Q.sub.1, . . . ) and their respective encoded instructions (EI.sub.1, . . . ).

[0061] FIG. 2 shows a sales slip, by way of example, as an input document for a document understanding task. From top to bottom, the following information (answers) can be deducted from the boxed areas of the sales slip:

TABLE-US-00001 Query Answer company STARBUCKS STORE #10208 date Dec. 7, 2014 address 11302 EUCLID AVENUE, CLEVELAND, oh (216)229-0749 total 4.95

[0062] FIG. 3 illustrates how position embeddings can be generated by way of a bounding boxes principle. For each word or token, the corresponding coordinates (based on the upper and lower start and end points) are detected, thereby providing a bounding box (defined by the coordinates x.sub.start, y.sub.start, x.sub.end and y.sub.end) for the word or token. Further, some selected meta data, such as e.g. the page number (Page #) and the document number (Doc #) can be also included in the position embedding (illustrated on the right side of FIG. 3).

[0063] FIG. 4 illustrates how the position embeddings are used together with the embeddings of the respective word or token, i.e as vector representations of the word (or token). The vector representations are run through the system to predict the output text, i.e. the output of the generator (PDF BOOKMARK SAMPLE in this example). Optionally, the system may also provide the sequence of the input text from where the answer was taken as output positions, i.e. a bounding box of where the info for the answer came from. Accordingly, it is possible to not just generate the answer but also a bounding box for the text that led the system to generate the answer.

[0064] FIG. 5 is a general outline of the information retrieval in a system according to the invention. In a first step, with the use of encoders, embeddings are generated. Next, values and keys for each embedding are generated. Thereafter, a key vs. query comparison is carried out, e.g by way of a K-nearest Neighbours approach (KNN).

[0065] FIG. 6 shows an exemplary retrieval diagram.

[0066] In step (1), input text is embedded, i.e. words are turned into vector representations, and a key value pair is produced for every single embedding.

[0067] In step (2), a set of queries is produced from an embedded text input, which contains a question to answer or an instruction to execute. Key value pairs are produced the queries, too.

[0068] In step (3), the keys that were generated for the input text are compared with the keys that were generated for the queries to identify the most similar matches, using a K-Nearest Neighbour distance measure for the similarity measurement.

[0069] In step (4), the values from the keys are taken that were identified to be the most similar. The values are passed on for further processing.

[0070] In step (5), the status of the retriever is updated by jointly reasoning over the previously held information and the new information that was gathered in step (4).

[0071] In step (6), it is checked whether the information is (already) sufficient to answer the initial question or to execute the initial instruction. It is then determined whether the process ends or whether another set of (sub-) queries must be answered to conclude the reasoning process.

Speedup Test

[0072] The advantages of a system according to the invention are apparent from the following test. The execution times of a system according to the invention were compared with a standard Albert base model (arXiv: 1909.11942v6).

[0073] For the Albert model, a question answering head and a sliding window approach were used. The same procedure was also implemented by the authors of the Albert paper referred to above (cf. https://github.com/google-research/albert). For the system according to the invention, the model and the tokenizer from the hugginface transformer library were used (arXiv: 1910.03771v5).

[0074] The inference process was performed as outlined above.

[0075] As the input text for the models, the text of 100 random Squad v2 (arXiv: 1806.03822v1) samples was concatenated, resulting in a text consisting of 69505 characters. For the queries, questions from the selected Squad samples were sampled. The same input was used for both models.

[0076] All measurements were conducted on a Nvidia DGX 1, while using one GPU. The results were as follows:

TABLE-US-00002 # queries Albert New model Speedup 1 6.8 2.46 2.76 5 11.59 2.62 4.42 10 17.87 2.69 6.64 20 27.22 2.8 9.72 40 52.38 3.02 17.34 80 95.93 3.41 28.13

[0077] The above results show that the system according to the invention is much more efficient than a prior art system. Moreover, the more queries the more speedup can be achieved, which is of utmost practical relevance.