Computer-Implemented Method of Domain-Specific Full-Text Document Search

20200210491 ยท 2020-07-02

Assignee

Inventors

Cpc classification

International classification

Abstract

A computer-implemented method for domain-specific full-text document search including indexing of documents set of steps and querying documents set of steps in which three main processes are involved: preparation of embeddings, indexing of a set of relevant documents, and querying of the indexed documents.

Claims

1. A computer-implemented method for domain-specific full-text document search including indexing of documents set of steps and querying documents set of steps, characterized in that during indexing of documents set of steps: In step 1, text analysis from segmentation to basic syntactic dependencies and morphological features using a neural network trained previously on an unrelated training corpus T1, coming from a similar domain or a general domain, on a large corpus C in the language of the documents to be later indexed and containing also the documents to be indexed, if available, is performed resulting in a corpus R1; In step 2, semantic analysis is performed with use of neural networks either after the text analysis with use of the corpus R1 as an input, or as an alternative, step 2 is performed jointly with the text analysis, taking input to step 1 directly, resulting in corpus R2, while the semantic processing engine used is trained on a corpus T2 which has to contain semantic relations in the form of directed dependencies between content words, manually prepared, and extended to multilingual cases by known multitask techniques; Next in step 3, linking of all named and other non-verb entities in the corpus R2 to any large or small scale ontology O3B that contains at least a paragraph-long description of the ontology entry is performed resulting in Verb entities (predicates) linked to semantic classes O3B consisting of multilingual sets of normalized predicates, while the semantic classes are created based on extraction from parallel corpora and manual pruning using multiple annotation and majority voting technologies and pre-prepared data T3A for sense-based verb classification and T3B for named entity recognition; Next in step 4, the corpus R2 and the corpus R3 are merged, resulting in a corpus R4, where the corpus R3 is fully grounded, while the merge is performed as straightforward substitution of entities from the corpus R3 to labelled graphs containing the semantic analysis in the corpus R2; Next in step 5, word-, lemma- and nametype- and grounded entities embeddings are created from the corpus R4 based on their local and global context within R4, as expressed in the semantic structure contained in R4 resulting in a set of tables E5; while during indexing of documents set of steps, steps 1 to 4 are performed on every document Di to be indexed, resulting in an annotated document DiR4, followed next by mapping all entities in DiR4 of every document Di processed to embeddings using the set of tables E5, while the resulting embeddings are stored with the document and text positions in the form of a multidimensional index X, the dimensions of which will be determined at indexing time by minimizing the cost of access, using an optimizing technique called Minimum Description Length method, resulting in documents indexed by entity embeddings taken from E5; while during querying documents set of steps, a user input query Q inserted into simple full-text window is analyzed with use of steps from 1 to 4, as if the query is a document itself resulting in an annotated query Q4 and then entities identified in the annotated query Q4 are mapped to embeddings using tables E5, resulting in a set of embeddings Q5; embeddings Q5 are used in an approximate search performed by multidimensional search methods through index X resulting in a set A of documents found, each associated with a real number representing similarity to the query Q; returned documents and positions in them matching the query are pruned to a predefined number of outputs set by the user at query time; and returned documents are ranked by similarity and presented on the computer screen together with additional information on a total number of documents found.

Description

LIST OF DRAWINGS OF EXEMPLARY EMBODIMENTS

[0045] The attached schemes serve to illustrate the invention, where

[0046] FIG. 1 Scheme of set of steps to create set of embedding tables

[0047] FIG. 2 Scheme of indexing document set of steps

[0048] FIG. 3 Scheme of querying documents set of steps

EXEMPLARY EMBODIMENT

[0049] Following three main processes involved in the present method are demonstrated in the enclosed schemes.

[0050] FIG. 1 shows an example of a creation of the set of embedding mapping tables E5 from a large corpus in the same language as the set of documents to be later indexed. The documents themselves may or may not be part of this corpus; more accurate results are however obtained if they are included in C.

[0051] FIG. 2 shows an example of indexing a single document D.sub.i. The process depicted in FIG. 2 has to be performed for every document in the collection of documents to be indexed to be available for search at query time.

[0052] FIG. 3 shows and example of processing a query at the query time, i.e. when a user searches for a document. A query Q may be expressed as a single word, as a sequence of a few words, or as a textual description of what the user wants to search for, or a transcript of what the user said in case the system uses automatic speech recognition so that the user can talk instead of typing. In all such cases, the text of the query is processed by the steps depicted in FIG. 3 and the resulting documents with positions in the document matching the query Q color coded or otherwise highlighted is presented to the user posing originally the query Q.

[0053] In a preferred embodiment scenario, the following concrete implementation pipelines (sequences of processing modules) are used. The referenced modules are assumed to already contain all necessary models in order to perform the respective step; these models are either available with the individual components directly, or they can be trained (learned from data, for example for a different language or domain) in a way described also with the individual components through the references.

[0054] 1. In the creation of embeddings set of steps (FIG. 1), the Basic Linguistic Analysis step is performed on a large collection of documents (not necessarily only from the set of documents to be indexed later, but general sets can be used, e.g. corpora collected from the internet etc.), in the language of interest, by using the UDPipe tools (Straka et al., 2016), resulting in R1. The Semantic Analysis steps are performed on R1 by Treex, modular framework for deep language analysis (https://lindat.mff.cuni.cz/services/treex), using the t-layer analysis scenario, as available, e.g. at http://lindat.mff.cuni.cz/services/treex-web/run, resulting in R2. While the t-layer analysis scenario can be used also for the Basic Linguistic Analysis, better results are obtained by first running the UDPipe tools and then, after a simple conversion, the data is subsequently processed by Treex using the t-layer analysis scenario, starting with the A2T::CS::MarkEdgesToCollapse module, as described at http://lindat.mff.cuni.cz/services/treex-web/run. For the Named Entity Recognition and Linking steps, two successive sub-steps are required: first, a named entity module must process the result of the semantic analysis module (R2) and identify thus spans of named entities and assign them a type; for this purpose, NameTag tool (https://lindat.mff.cuni.cz/en/services#NameTag) is used. Its output is then fed directly to a Named Entity Linking (grounding) sub-step, which is implemented by (Taufer, 2016), and results in R3. R4 is then produced by simply merging R2 and R3 based on the position of the individual words in the text by using stand-off annotation, which is a standard technique that is applied for text annotation.

[0055] Embeddings are created in the final step. First, the following data streams are created by extraction from R4, based on the annotation attributes: word sequence, lemma sequence, sequence of typed named entities and sequence of grounded entities. These sequences are then fed to an embedding-creating subsystem, which is implemented by a Deep Artificial Neural network, as described in (Mikolov et al., 2013), where word is replaced by the respective units (words, lemmas, NEs, grounded NEs) in the four data streams. The result is E5, embeddings tables mapping the four types of units into real-valued vectors of a predefined length (as described in (Mikolov et al., 2013)).

[0056] 2. In the document indexing series of steps (FIG. 2), the same sequence of steps up to the merging step (Step No. 4) has to be performed on every document to be indexed (let's number the documents by index i, ranging from 1 to k, where k is the number of documents to process in one run); if any of the steps is replaced by a different embodiment of the same text processing step during embedding table creation, best results are achieved if the same step or steps are performed for document indexing. That is, these steps will, for the preferred embodiment of the embedding creation step, consist of the following sequence of steps: processing will start with the Basic Linguistic Analysis step is performed on every document D.sub.i by using the UDPipe tools (Straka et al., 2016), resulting in D.sub.iR1. The Semantic Analysis steps are performed on D.sub.iR1 by Treex, modular framework for deep language analysis (https://lindat.mff.cuni.cz/services/treex), using the t-layer analysis scenario, as available e.g. at http://lindat.mff.cuni.cz/services/treex-web/run, resulting in DiR2. While the t-layer analysis scenario can be used also for the Basic Linguistic Analysis, better results are obtained by first running the UDPipe tools and then, after a simple conversion, the data is subsequently processed by Treex using the t-layer analysis scenario, starting with the

[0057] A2T::CS::MarkEdgesToCollapse module, as described at http://lindat.mff.cuni.cz/services/treex-web/run. For the Named Entity Recognition and Linking steps, two successive sub-steps are required: first, a named entity module must process the result of the semantic analysis module (D.sub.iR2) and identify thus spans of named entities and assign them a type; for this purpose, NameTag (https://lindat.mff.cuni.cz/en/services#NameTag) is used. Its output is then fed directly to a Named Entity Linking (grounding) sub-step, which is implemented by (Taufer, 2016), and results in D.sub.iR3. D.sub.iR4 is then produced by simply merging D.sub.iR2 and D.sub.iR3 based on the position of the individual words in the text by using stand-off annotation, which is a standard technique that is applied for text annotation.

[0058] All the four attributes of the resulting annotation in D.sub.iR4, namely words, lemmas, named entities and grounded entities are then mapped to embeddings using the corresponding table from E5. These entities are then associated with the document D.sub.i in its (inverted) index X, and to each embedding a position in the document is attached for targeted display to user at query time if the document is selected. In addition, the embeddings (concatenated to form a single vector) for a given position in a document are taken as descriptors for the similarity search procedure according to (Nalepa et al., 2018) and processed to create the necessary indexing structures for search at query time.

[0059] Additional document or a set of documents may be added to the index X at any time by following all the steps described here and in FIG. 2.

[0060] 3. At query time, the user enters a query Q in the form of text (or a spoken query is transcribed to a text by some automatic speech recognition module (not included in the FIG. 3 since it is a standard optional extension in full-text search)). The query can be of any length, from a single word to a text describing the user's search goal. The query then undergoes the same steps as in component 2 (document indexing), including the final mapping of the annotated query to the precomputed embeddings (cf. FIG. 3), i.e., the query is first processed with the Basic Linguistic Analysis step by using the UDPipe tools (Straka et al., 2016), resulting in annotated query QR1. The Semantic Analysis steps are performed on QR1 by Treex, modular framework for deep language analysis (https://lindat.mff.cuni.cz/services/treex), using the t-layer analysis scenario, as available e.g. at http://lindat.mff.cuni.cz/services/treex-web/run, resulting in QR2. While the t-layer analysis scenario can be used also for the Basic Linguistic Analysis, better results are obtained by first running the UDPipe tools and then, after a simple conversion, the data is subsequently processed by Treex using the t-layer analysis scenario, starting with the A2T::CS::MarkEdgesToCollapse module, as described at http://lindat.mff.cuni.cz/services/treex-web/run. For the Named Entity Recognition and Linking steps, two successive sub-steps are required: first, a named entity module must process the result of the semantic analysis module (QR2) and identify thus spans of named entities and assign them a type; for this purpose, NameTag tool (https://lindat.mff.cuni.cz/en/services#NameTag) is used. Its output is then fed directly to a Named Entity Linking (grounding) sub-step, which is implemented by (Taufer, 2016), and results in QR3. QR4 is then produced by simply merging QR2 and QR3 based on the position of the individual words in the text by using stand-off annotation, which is a standard technique that is applied for text annotation.

[0061] All the four attributes of the resulting annotation in QR4, namely words, lemmas, named entities and grounded entities are then mapped to embeddings using the corresponding table from E5, forming a set of embeddings to be used as descriptor in the similarity search procedure as described in (Napela et al., 2018).

[0062] The similarity search procedure (Napela et al., 2018) against the set of documents D.sub.i as indexed in X using the embeddings extracted from the query by the above procedure as descriptors for the similarity search results in a set of documents Dj and a set of positions {p.sub.jx} within each such document, ranked by similarity. These documents are displayed to the user originally posing the query Q in a compact form, with a reference to the full document (and a position in it).

REFERENCES

[0063] Timothy Dozat, Christopher D. Manning: Deep Biaffine Attention For Neural Dependency Parsing. https://arxiv.org/pdf/1611.01734.pdf. 2018 [0064] A. Grover and J. Leskovec: node2vec: Scalable feature learning for networks, in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 855-864, 2016. [0065] Haji Jan, Panevov Jarmila, Hajiov Eva, Sgall Petr, Pajas Petr, tpnek Jan, Havelka Ji, Mikulov Marie, abokrtsk Zdenk, evikov-Razmov Magda, Ureov Zdeka: Prague Dependency Treebank 2.0. Software prototype, Linguistic Data Consortium, Philadelphia, Pa., USA, ISBN 1-58563-370-4, http://www.ldc.upenn.edu, July 2006 [0066] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781v3 [cs.CL] [0067] Filip Nalepa, Michel Batko, Pavel Zezula (2018): Towards Faster Similarity Search by Dynamic Reordering of Streamed Queries. T. Large-Scale Data- and Knowledge-Centered Systems 38: 61-88 (2018) [0068] Page, Larry, PageRank: Bringing Order to the Web. Archived from the original on May 6, 2002. Retrieved Sep. 11, 2016, Stanford Digital Library Project, talk. Aug. 18, 1997 (archived 2002) [0069] The Apache Software Foundation, Welcome to Apache Lucene. Lucene News section. Archived from the original on 21 Dec. 2017. Retrieved 21 Dec. 2017. [0070] Straka Milan, Haji Jan, Strakov Jana: UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, ISBN 978-2-9517408-9-1, pp. 4290-4297, 2016 [0071] Keet Sugathadasa, Buddhi Ayesha, Nisansa de Silva, Amal Shehan Perera, Vindula Jayawardana, Dimuthu Lakmal, Madhavi Perera: Legal Document Retrieval using Document Vector Embeddings and Deep Learning. https://arxiv.org/pdf/1805.10685.pdf, May 27, 2018, retrieved Dec. 10, 2018. Taufer, Pavel: Named Entity Linking. Diploma thesis, MFF UK, 2016. [0072] Ureov Zdeka, Fuikov Eva, Hajiov Eva, Haji Jan: Creating a Verb Synonym Lexicon Based on a Parallel Corpus. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Paris, France, ISBN 979-10-95546-00-9, pp. 1432-1437, 2018 [0073] Wolters Kluwer. On ASPI. https://www.wolterskluwer.cz/cz/aspi/o-aspi/o-aspi.c-24.html Archived from the original on Dec. 2, 2018.