SYSTEM FOR THE EXTRACTION OF INFORMATION FROM DOCUMENTS
20250077554 · 2025-03-06
Assignee
Inventors
Cpc classification
International classification
Abstract
The invention pertains to a system for the extraction of information from documents, in particular natural language documents, the system comprising an encoder with a neural network; and a retriever that is configured as a reasoning engine. The system is configured such that it supports user-defined queries for at least two pieces of information; the encoder is applied to the documents, in particular the natural language documents, to generate document encodings; the user-defined queries, to generate encoded instructions; the document encodings are queried by the retriever in lookup steps based on the encoded instructions.
Claims
1. A system for the extraction of information from documents, the system comprising: i) an encoder with a neural network; and ii) a retriever that is configured as a reasoning engine; the system being configured such that: it supports user-defined queries for at least two pieces of information; the encoder is applied to the documents, to generate document encodings; the user-defined queries, to generate encoded instructions; the document encodings are queried by the retriever in look-up steps based on the encoded instructions.
2. The system according to claim 1, wherein two or more encoders are used to generate the document encodings.
3. The system according to claim 1, wherein the document encodings are based on embeddings, and key/value pairs derived therefrom.
4. The system according to according to claim 1, configured such that the retriever is only used once for each encoded instruction.
5. The system according to claim 1, configured such that the document encodings are generated only once, for the document encodings to be queried by the retriever to answer the user-defined queries for the at least two pieces of information.
6. The system according to claim 1, configured such that the retriever i) in the look-up steps searches for information that is most relevant for encoded instructions; ii) reasons over the findings of the look-up steps; and iii) produces a condensed representation of the output of step ii).
7. The system according to claim 6, configured such that i) and ii) may be repeated for each encoded instruction before the condensed representation of the output is produced in iii).
8. The system according to claim 1, configured such that the retriever produces a query that can be used to search for yet missing information.
9. The system according to claim 1, further comprising a generator that creates a response to the user-defined query based on the output representation of the retriever.
10. The system according claim 9, configured such that the generator is only used once for creating a response to each encoded instruction.
11. A computer-implemented method for extraction of information from documents based on user-defined queries for at least two pieces of information, comprising the steps of: i) providing at least one document; ii) applying an encoder with a neural network, to a. the documents to generate document encodings; b. the user-defined queries, to generate encoded instructions iii) querying the document encodings by a retriever in look-up steps based on the encoded instructions, wherein the retriever is configured as a reasoning engine.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0053] The invention will now be described by way of more specific embodiments. These embodiments are not intended to limit the gist of the invention in any way, but rather serve to ease understanding of the invention.
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0060] As outlined in
[0061]
TABLE-US-00001 Query Answer company STARBUCKS STORE #10208 date Dec. 7, 2014 address 11302 EUCLID AVENUE, CLEVELAND, oh (216)229-0749 total 4.95
[0062]
[0063]
[0064]
[0065]
[0066] In step (1), input text is embedded, i.e. words are turned into vector representations, and a key value pair is produced for every single embedding.
[0067] In step (2), a set of queries is produced from an embedded text input, which contains a question to answer or an instruction to execute. Key value pairs are produced the queries, too.
[0068] In step (3), the keys that were generated for the input text are compared with the keys that were generated for the queries to identify the most similar matches, using a K-Nearest Neighbour distance measure for the similarity measurement.
[0069] In step (4), the values from the keys are taken that were identified to be the most similar. The values are passed on for further processing.
[0070] In step (5), the status of the retriever is updated by jointly reasoning over the previously held information and the new information that was gathered in step (4).
[0071] In step (6), it is checked whether the information is (already) sufficient to answer the initial question or to execute the initial instruction. It is then determined whether the process ends or whether another set of (sub-) queries must be answered to conclude the reasoning process.
Speedup Test
[0072] The advantages of a system according to the invention are apparent from the following test. The execution times of a system according to the invention were compared with a standard Albert base model (arXiv: 1909.11942v6).
[0073] For the Albert model, a question answering head and a sliding window approach were used. The same procedure was also implemented by the authors of the Albert paper referred to above (cf. https://github.com/google-research/albert). For the system according to the invention, the model and the tokenizer from the hugginface transformer library were used (arXiv: 1910.03771v5).
[0074] The inference process was performed as outlined above.
[0075] As the input text for the models, the text of 100 random Squad v2 (arXiv: 1806.03822v1) samples was concatenated, resulting in a text consisting of 69505 characters. For the queries, questions from the selected Squad samples were sampled. The same input was used for both models.
[0076] All measurements were conducted on a Nvidia DGX 1, while using one GPU. The results were as follows:
TABLE-US-00002 # queries Albert New model Speedup 1 6.8 2.46 2.76 5 11.59 2.62 4.42 10 17.87 2.69 6.64 20 27.22 2.8 9.72 40 52.38 3.02 17.34 80 95.93 3.41 28.13
[0077] The above results show that the system according to the invention is much more efficient than a prior art system. Moreover, the more queries the more speedup can be achieved, which is of utmost practical relevance.