System and method for performing a search in a vector space based search engine

20230138014 · 2023-05-04

    Inventors

    Cpc classification

    International classification

    Abstract

    The invention provides a relevance feedback system and computer-implemented method for performing a search in a vector space comprising a first number of target vectors. The method comprises forming a first search query, determining a second number of first search hit vectors among the first number of target vectors based on the first search query vector using a first distance function, determining a third number of flagged vectors, determining a vector subspace spanned by the flagged vectors and/or a second distance function by utilizing the flagged vectors, and determining a plurality of second hit vectors among the target vectors based on the first search query vector and the vector subspace and/or the second distance function.

    Claims

    1. A computer-implemented method of performing a search in a vector space based search engine, the vector space comprising a first number of target vectors among which one or more search hit vectors are determined, the method comprising: forming a first search query vector based on search query data, determining a second number of first search hit vectors among the first number of target vectors based on the first search query vector using a first search space distance function, the second number being smaller than the first number, determining a third number of flagged search hit vectors, the third number being smaller than the second number, determining at least one of a vector subspace spanned by the flagged search hit vectors and a second search space distance function by utilizing dimension-specific standard deviation of the flagged search hit vectors, and determining a plurality of second search hit vectors among the target vectors based on the first search query vector and at least one of the vector subspace and the second search space distance function.

    2. The method according to claim 1, comprising: determining said vector subspace, determining a second search query vector based on the vector subspace and the first search query vector, and determining the second search query vector such that it is located closer to that subspace than the first search query vector.

    3. The method according to claim 2, wherein the second query vector is located on a line passing via the first search query vector and being perpendicular to the subspace.

    4. The method according to claim 1, wherein the vector subspace has N−1 dimensions, where N is the number of flagged search hit vectors.

    5. The method according to claim 1, further comprising: determining a second search query vector based on the vector subspace and the first search query vector, and determining the first search hit vectors and second search hit vectors using the second search query vector and the first search space distance function.

    6. The method according to claim 1, further comprising: determining the first search hit vectors using the first search space distance function and the first search query vector, the first distance function typically being a spherical distance function, creating said second search space distance function based on the flagged search hit vectors and the first search query vector, the second distance function typically being an ellipsoidal distance function, and determining the second search hit vectors using the second search space distance function and the first search query vector.

    7. The method according to claim 6, wherein the second search space distance function is created by dividing the distance for each dimension of said vector space by the ratio of change in the standard deviation between the flagged search hit vectors and the first search hit vectors for that dimension.

    8. The method according to claim 1, wherein the flagged search hit vectors are determined by receiving first search hit flagging data from a user via user interface means.

    9. The method according to claim 1, wherein the flagged search hit vectors are determined automatically using additional information linked with the target vectors.

    10. The method according to claim 1, wherein: the search query data comprises natural language data in graph-format, such as tree format, and the first search query vector is formed by embedding the graph into the first search query vector using at least partly neural network -based algorithm, for example by first embedding the nodes of the graph into node vector values and subsequently embedding the graph using the node vector values using a neural network.

    11. The method according to claim 10, wherein the search query data comprises natural language data units arranged as graph nodes according to meronymity and/or hyponymity relationships between the data units, as inferred from a natural language-containing document.

    12. The method according to claim 10, wherein said embedding is carried out using a graph based neural network model which has been trained using supervised machine learning so as to minimize angles between vectors between graphs with technically similar content.

    13. A system for determining a subset of documents among a set of documents, the system comprising: a vector processing unit adapted to convert the set of documents into a first number of target vectors in a vector space and an initial search query data into a first search query vector, and a search unit adapted to: determine a second number of first search hit vectors among the first number of target vectors based on the first search query vector using a first search space distance function, the second number being smaller than the first number, determine a third number of flagged search hit vectors, the third number being smaller than the second number, determine at least one of a vector subspace spanned by the flagged search hit vectors and a second search space distance function by utilizing dimension-specific standard deviation of the flagged search hit vectors, and determine a plurality of second search hit vectors among the target vectors based on the first search query vector and at least one of the vector subspace and the second search space distance function, the second search hit vectors corresponding to said subset of documents.

    14. The system according to claim 13, being adapted to perform: forming the first search query vector based the initial search query data, determining the second number of first search hit vectors among the first number of target vectors based on the first search query vector using the first search space distance function, the second number being smaller than the first number, determining the third number of flagged search hit vectors, the third number being smaller than the second number, determining at least one of the vector subspace spanned by the flagged search hit vectors and the second search space distance function by utilizing dimension-specific standard deviation of the flagged search hit vectors, and determining the plurality of second search hit vectors among the target vectors based on the first search query vector and at least one of the vector subspace and the second search space distance function.

    15. Use of a vector subspace spanned by a plurality of vectors in an original vector space for fine-tuning search results of vector space based search engine.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0052] FIG. 1 shows a block chart of an exemplary neural network trained and vector based text document search engine utilizing graph conversion and graph embedding.

    [0053] FIG. 2 illustrates the subspace based query vector amendment.

    [0054] FIG. 3A shows a block diagram of an exemplary graph meronym/holonym edge relations.

    [0055] FIG. 3B shows a block diagram of an exemplary graph with meronym/holonym edge relations and hyponym/hypernym edge relations.

    [0056] FIG. 4 shows a flow chart of an exemplary graph parsing algorithm.

    [0057] FIG. 5 shows a block diagram of patent search neural network training using patent search/citation data as training data.

    DETAILED DESCRIPTION OF EMBODIMENTS

    Definitions

    [0058] “Natural language unit” herein means a chunk of text or, after embedding, vector representation of a chunk of text, i.e. a sentence vector descriptive of the chunk. The chunk can be a single word or a multi-word sub-concept appearing once or more in the original text, stored in computer-readable form. The natural language units may be presented as a set of character values (known usually as “strings” in computer science) or numerically as multi-dimensional vector values, or references to such values. E.g. a bag-of-words or Recurrent Neural Network approaches can be used to produce sentence vectors.

    [0059] “Block of natural language” refers to a data instance containing a linguistically meaningful combination of natural language units, for example one or more complete or incomplete sentences of a language, such as English. The block of natural language can be expressed, for example as a single string and stored in a file in a file system and/or displayed to the user via the user interface.

    [0060] “Patent document” refers to the natural language content a patent application or granted patent. Patent documents are associated in the present system with a publication number that is assigned by a recognized patent authority, such as the EPO, WIPO or USPTO, or another national or regional patent office of another country or region. The term “claim” refers to the essential content of a claim, in particular an independent claim, of a patent document. The term “specification” refers to content of patent document covering at least a portion of the description of the patent document. A specification can cover also other parts of the patent document, such as the abstract or the claims. Claims and specifications are examples of blocks of natural language.

    [0061] “Claim” is herein defined as a block of natural language which would be considered as a claim by the European Patent Office on the effective date of this patent application.

    [0062] “Edge relation” herein may be in particular a technical relation extracted from a block and/or a semantic relation derived from using semantics of the natural language units concerned. In particular, the edge relation can be [0063] a meronym relation (also: meronym/holonym relation); meronym: X is part of Y; holonym: Y has X as part of itself; for example: “wheel” is a meronym of “car”, [0064] a hyponym relation (also: hyponym/hypernym relation); hyponym: X is a subordinate of Y; hypernym: X is a superordinate of Y; example: “electric car” is a hyponym of “car”, or [0065] a synonym relation: X is the same as Y.

    [0066] In some embodiments, the edge relations are defined between successive nodes of a recursive graph, each node containing a natural language unit as node value.

    [0067] Further possible technical relations include thematic relations, referring to the role that a sub-concept of a text plays with respect to one or more other sub-concepts, other than the abovementioned relations. At least some thematic relations can be defined between successive units. In one example, the thematic relation of a parent unit is defined in the child unit. An example of thematic relations is the role class “function”. For example, the function of “handle” can be “to allow manipulation of an object”. Such thematic relation can be stored as a child unit of the “handle” unit, the “function” role being associated with the child unit. A thematic relation may also be a general-purpose relation which has no predefined class (or has a general class such as “relation”), but the user may define the relation freely. For example, a general-purpose relation between a handle and a cup can be “[handle] is attached to [cup] with adhesive”. Such thematic relation can be stored as a child unit of either the “handle” unit or the “cup” unit, or both, preferably with inter-reference to each other.

    [0068] “Graph” or “data graph” refers to a data instance that follows a generally recursive and/or network data schema, like a tree schema. The present system is capable of simultaneously containing several different graphs that follow the same data schema and whose data originates from and/or relates to different sources. The graph can in practice be stored in any suitable text or binary format, that allows storage of data items recursively and/or as a network. The graph is in particular a semantic and/or technical graph (describing semantic and/or technical relations between the node values), as opposed to a syntactic graph (which describing only linguistic relations between node values). The graph can be a tree-form graph. Forest form graphs including a plurality of trees are considered tree-form graphs herein. In particular, the graphs can be technical tree-form graphs.

    [0069] “Data schema” refers to the rules according to which data, in particular natural language units and data associated therewith, such as information of the technical relation between the units, are organized.

    [0070] “(Natural language) token” refers to a word or multi-word chunk in a larger block of natural language. A token may contain also metadata relating to the word or word chunk, such as the part-of-speech (POS) label or syntactic dependency tag. A “set” of natural language tokens refers in particular to tokens that can be grouped based on their text value, POS label or dependency tag, or any combination of these according to predetermined rules or fuzzy logic.

    [0071] The terms “data storage unit/means”, “processing unit/means” and “user interface unit/means” refer primarily to software means, i.e. computer-executable code, that are adapted to carry out the specified functions, that is, storing of digital data, allowing user to interact with the data, and processing the data, respectively. All of these components of the system can be carried in a software run by either a local computer or a web server, through a locally installed web browser, for example, supported by suitable hardware for running the software components.

    [0072] It should also be noted that herein using the initial, i.e. first, search query vector equals using the initial search query data, which may be at least partly in natural language form, and the vector embedder.

    Description of Selected Embodiments

    System Overview

    [0073] FIG. 1 shows an exemplary vector based search engine system for text document search. The system is particularly suitable in particular for searching technical documents, such as patent documents, or scientific documents. The system comprises a document store 10A, which contains a plurality of natural language documents. A graph parser 12 which is adapted to read documents from the document store 10A and to convert part of all of the contents of the documents into graph format. The converted graphs are stored in a graph store 10B.

    [0074] The system comprises a neural network trainer unit 14, which receives as training data a set of parsed graphs from the graph store, as well as some information about their relations to each other, which are used to form a training sample set for supervised machine learning. In this case, there is provided document reference data store 10C, including e.g. citation data and/or novelty search result regarding the documents. The trainer unit 14 run a graph-based neural network algorithm that is trained using the training samples, to form a neural network model suitable for embedding graphs into vector form by a graph embedder 15. The graphs from the graph store 10B are embedded into a vector index 16B, to constitute the searchable vector space.

    [0075] The search engine 16A is capable of finding nearest neighbour vectors from the vector index 16B for a given search query vector. The search query, which may be a document or a graph, obtained through user interface 18 is also embedded into vector form by the graph embedder 15 to obtain the query vector. If the user input is in text format, it can be first converted to graph format by the graph parser 12.

    [0076] In some embodiments, the embedding is carried out using a graph based neural network model which has been trained using supervised machine learning so as to minimize angles between vectors between graphs with technically similar content, such as patent claim graphs and patent specification graphs that are known to form novelty bars for the respective claim graphs.

    [0077] The system and embodiments above are described as a non-limiting exemplary embodiments. The invention can be used in connection with any nearest neighbour vector based search engine.

    [0078] However, the invention provides particular advantages with supervised machine learning vectorization engines, in particular those with complex input, such as natural language, preferably natural language in graph format, that are trained with human labelled training samples. An example is a natural language document search system. In these systems there are usually at least one million searchable documents and/or training samples and re-training or new vector embedding is very time-consuming.

    [0079] In the following description, a vector based document search engine is use as the primary example.

    Amendment of the Search Results

    [0080] When a user is shown documents most related to the search query, the user may find some of the results more relevant than others. The user can flag the most relevant results and perform the search again. When the search is performed with the query and flagged results a new search query vector is computed by moving the original query vector according to the flagged results, and the new search results are the documents closest to the new query vector.

    [0081] This process may be repeated several times, i.e. the user can, optionally from the updated search results, again flag the most relevant results to specify in more detail the results the user is looking for.

    [0082] Instead of having the flagged results provided by the user they can also be selected automatically or semi-automatically using other kind of additional information. As an example, if some kind of document classification is available (for instance a patent classification in the case of a patent search engine) then the user can select the class that he is most interested in. Then all or some of the documents in that class and found in the initial search results are be flagged and the new query vector computed accordingly by moving the original query vector accordingly.

    [0083] In case the flagged results represent results that are not interesting to the user, the new query vector can be moved further away from the flagged results.

    [0084] It is also possible to flag both desired and undesired results and then move the query vector closer to the desired ones while making sure not to move it too close to the undesired results.

    [0085] Next, different realizations of the search result amendment are discussed.

    Vector Subspace Based Amendment

    [0086] With reference to FIG. 2, in one embodiment the new query vector B is determined in three stages using the original query vector A and [0087] 1. The affine vector subspace S spanned by the vectors corresponding to the flagged documents (D1, D2, . . . , Dn) is calculated. [0088] 2. The closest vector C to the original query vector in the subspace S is calculated.

    [0089] This vector will be along the line L perpendicular to the subspace S passing through the original vector A. [0090] 3. Finally, the new query vector B is determined by moving the original vector A closer to the subspace S along the line L. The magnitude of movement (“temperature”) can be chosen to be e.g. 1-100%, typically 25-100% of the distance between the original query vector A and the closest vector C in the subspace S. The magnitude can be predetermined or dynamically adjusted.

    [0091] The subspace S will have n−1 dimensions (if the vectors (D1, D2, . . . , Dn) are linearly independent), where n is the amount of flagged results. For instance, if there are two flagged results then the subspace is the unique line passing through the two vectors, and if three results are flagged then the subspace S is the unique plane spanned by the three vectors. In the special case where there is only one flagged result then the subspace S is just a single vector V.

    [0092] In the exemplary graph-based document search system, where nodes of the graph represents features of the contents of the documents, the idea is that the subspace S describes the common features of the flagged documents (D1, D2, . . . , Dn). Thus, the new query vector represents the original query in the context of the relevant results flagged by the user.

    [0093] If the document graphs are ordered at least partly e.g. according to meronymity of technical features described in the document, the user can flag documents containing features of particular interest and the search engine can fine-tune the search results to include more documents with similar features.

    [0094] The closest vector C in the subspace S to the original query vector A can be calculated by using for instance the Gram-Schmidt process to find an orthogonal basis of S and then calculating C as the sum of the projections of A on the basis vectors.

    [0095] The new query vector B can be calculated by the formula


    B=tC+(1−t)A

    [0096] Here t is a real number between 0 and 1 (inclusive), called the temperature. Temperature 1 means the new query vector B is equal to the closest vector C, and temperature 0 means that the new query vector is the same as the original vector A. The value 0.5 indicates that the new query vector B is halfway between the original vector A and the closest vector C in the subspace S. In FIG. 2, the value 0.5 is used.

    [0097] The closer the new query vector B resides to the closest vector C, the more the search results will change, as the new search is carried out in the surroundings of the new query vector B. The optimal distance to move the vector depends on the specific search case, especially on the amount of flagged results and the variance of the flagged results. If more flagged results are provided then the subspace S can be considered to be a better estimate of the desired results, and thus a larger temperature can be used. Also if the variance of the flagged results is small then these can be assumed to provide a better estimate of the desired results. This allows for a larger temperature. A good rule of thumb is to start by placing the new query vector B halfway between the original vector A and the closest vector C in the subspace S, i.e. to use temperature 0.5.

    [0098] The vector subspace based amendment is efficient as fine-tuning method in high-dimensional natural language embedded vector spaces as it allows for finding technical and/or sematic similarities very efficiently. Also, a single incorrect flagged result does not affect the results adversely as much as e.g. in vector averaging based methods.

    Distance Function Change Based Amendment

    [0099] Another way of finding results that are more like the flagged results is keeping the query vector in the same place and instead modifying the way the distance is calculated between the different embeddings. In one embodiment the new search results are determined as follows: [0100] 1. The ratio of change R.sub.i in standard deviation for each dimension i between the set vectors of the flagged documents (D1, D2, . . . , Dn) and the set of full search results vectors is calculated. This gives information on how different the flagged results are from the full result set. [0101] 2. The original distance function is modified by dividing the distance for each dimension by the ratio of change R.sub.i for that dimension (calculated in step 1). [0102] 3. The new search results are generated by searching for the nearest neighbors to the original query vector according to the new distance function created in step 2.

    [0103] The result of the distance function modification is that the distance along some of the dimensions in the vector space is weighed less than other dimensions. The idea is that if the flagged results have a large variance in some dimension N1 but a small one in another dimension N2 then it is beneficial to look further away in dimension N1 than in dimension N2, since there one may find other documents more like the flagged ones. In effect, one looks for the neighbors nearest to the query vector inside a multidimensional ellipsoid instead of inside a multidimensional sphere.

    [0104] In step 1 the ratio of change R.sub.i for each dimension can be modified by a temperature exponent t, where t is a real number larger than 0. A larger temperature results in a more drastic change to the distance function.

    [0105] In case the Euclidean distance function is used the new distance function created in step 2 above looks as follows:

    [00001] d ( x , y ) = .Math. i = 1 n ( x i - y i ) R i ,

    where R.sub.i is the ratio of change in standard deviation for the dimension i and n is the dimension of the vector embeddings.

    [0106] In summary, the distance function based method comprises the steps of creating a query vector using the search query, performing an initial search with the created query vector, flagging the most relevant results from the initial search, modifying the search space distance function using the flagged relevant results and performing a new search with the query vector using the new (modified) distance function. In one embodiment the new distance function is created by dividing the distance for each dimension by the ratio of change in the standard deviation between the flagged relevant results and the full search results for that dimension.

    [0107] Instead of an ellipsoid-type distance function, it can be also another type of anisotropic distance function, in contrast to an isotropic, spherical distance function, which is typically used for the initial search.

    Applications in a Graph Based Document Search System

    [0108] Next, a tree-form graph structure applicable in particular for a patent search system, is described with reference to FIGS. 3A and 3B.

    [0109] FIG. 3A shows a tree-form graph with only meronym relations as edge relations. Text units A-D are arranged as linearly recursive nodes 30, 32, 34, 36 into the graph, stemming from the root node 30, and text unit E as a child of node 32, as a child node 38, as derived from the block of natural language shown. Herein, the meronym relations are detected from the meronym/holonym expressions “comprises”, “having”, “is contained in” and “includes”.

    [0110] FIG. 3B shows another tree-form graph with two different edge relations, in this example meronym relations (first relation) and hyponym relations (second relation). Text units A-C are arranged as linearly recursive nodes 30, 32, 34 with meronym relation. Text unit D is arranged as a child node 36 of parent node 34 with hyponym relation. Text unit E is arranged as a child node 34 of parent node 32 with hyponym relation. Text unit F is arranged as a child node 38 of node 34 with meronym relation. Herein, the meronym and hyponym relations are detected from the meronym/holonym expressions “comprises”, “having” and hyponym/hypernym expressions “such as” and “is for example”.

    [0111] According to one embodiment, the graph conversion subsystem is adapted to convert the blocks to graphs by first identifying from the blocks a first set of natural language tokens (e.g. nouns and noun chunks) and a second set of natural language tokens (e.g. meronym and holonym expressions) different from the first set of natural language tokens. Then, a matcher is executed utilizing the first set of tokens and the second set of tokens for forming matched pairs of first set tokens (e.g. “body” and “member” from “body comprises member”). Finally, the first set of tokens is arranged as nodes of said graphs utilizing said matched pairs (e.g. “body”-(meronym edge)-“member”).

    [0112] In one embodiment, at least meronym edges are used in the graphs, whereby the respective nodes contain natural language units having a meronym relation with respect to each other, as derived from said blocks.

    [0113] In one embodiment, hyponym edges are used in the graph, whereby the respective nodes contain natural language units having a hyponym relation with respect to each other, as derived from the blocks of natural language.

    [0114] In one embodiment, edges are used in the graph, at least one of the respective nodes of which contain a reference to one or more nodes in the same graph and additionally at least one natural language unit derived from the respective block of natural language (e.g. “is below” [node id: X]). This way, graph space is saved and simple, e.g. tree-form, graph structure can be maintained, still allowing expressive data content in the graphs.

    [0115] In some embodiments, the graphs are tree-form graphs, whose node values contain words or multi-word chunks derived from said blocks of natural language, typically utilizing parts-of-speech and syntactic dependencies of the words by the graph converting unit, or vectorized forms thereof.

    [0116] FIG. 4 shows in detail an example of how the text-to-graph conversion can be carried out in the graph conversion subsystem. First, the text is read in step 41 and a first set of natural language tokens, such as nouns, and a second set of natural language tokens, such as tokens indicating meronymity or holonymity (like “comprising”), are detected from the text. This can be carried out by tokenizing the text in step 42, part-of-speech (POS) tagging the tokens 43, deriving their syntactic dependencies in step 44. Using that data, the noun chunks can be determined in step 45 and the meronym and holonym expressions in step 46. In step 47, matched pairs of noun chunks are formed utilizing the meronym and holonym expressions. The noun chunk pairs form or can be used to infer meronym relation edges of a graph.

    [0117] In one embodiment, as shown in step 48, the noun chunk pairs are arranged as a tree-form graphs, in which the meronyms are children of corresponding holonyms. The graphs can be saved in step 49 in the graph store for further use, as discussed above.

    [0118] In one embodiment, the graph-forming step involves the use of a probabilistic graphical model (PGM), such as a Bayesian network, for inferring a preferred graph structure. For example, different edge probabilities of the graph can be computed according to a Bayesian model, after which the likeliest graph form is computed using the edge probabilities.

    [0119] In one embodiment, the graph-forming step comprises feeding the text, typically in tokenized, POS tagged, dependency parsed and/or noun chunked form, into a neural network based technical parser, which extracts the desired edge relations of the chunks, such as meronym relations and/or hyponym relations.

    [0120] In one embodiment, the graph is a tree-form graph comprising edge relations arranged recursively according to a tree data schema, being acyclic. This allows for efficient tree-based neural network models of the recurrent or non-recurrent type to be used. An example is the Tree-LSTM model.

    [0121] In another embodiment, the graph is a network graph allowing cycles, i.e. edges between branches. This has the benefit of allowing complex edge relations to be expressed.

    [0122] FIG. 5 shows an example s of training the neural network in particular for patent search purposes.

    [0123] For a generic document search engine case, the term “patent document” can be replaced with “document” (with unique computer-readable identifier among other documents in the system). “Claim” can be replaced with “first computer-identifiable block” and “specification” with “second computer-identifiable block at least partially different from the first block”.

    [0124] In the embodiment of FIG. 5, a plurality of claim graphs 51A and corresponding close prior art specification graphs 52A for each claim graph, as related by the reference data, are used by the neural network trainer 54A as the training data. These form positive training cases, indicating that low vector angle between such graphs is to be achieved.

    [0125] In addition, negative training cases, i.e. one or more distant prior art graphs, for each claim graph, can be used as part of the training data. A high vector angle between such graphs is to be achieved. The negative training cases can be e.g. randomized from the full set of graphs.

    [0126] According to one embodiment, in at least one phase of the training, as carried out by the neural network trainer 54A, a plurality of negative training cases are selected from a subset of all possible training cases which are harder than the average of all possible negative training cases. For example, the hard negative training cases can be selected such that both the claim graph and the description graph are from the same patent class (up to a predetermined classification level) or such that the neural network has previously been unable to correctly classify the description graph as a negative case (with predetermined confidence).

    [0127] According to one embodiment, which can also be implemented independently of the other method and system parts described herein, training of the present neural network-based patent search or novelty evaluation system is carried out by providing a plurality of patent documents each having a computer-identifiable claim block and specification block, the specification block including at least part of the description of the patent document. The method also comprises providing a neural network model and training the neural network model using a training data set comprising data from said patent documents for forming a trained neural network model. The training comprises using pairs of claim blocks and specification blocks originating from the same patent document as training samples of said training data set.

    [0128] Typically, these intra-document positive training samples form a fraction, such as 1-25% of all training samples of the training, the rest containing e.g. search report (examiner novelty citation) training samples.

    [0129] Vectors obtained from natural language (e.g. patent) documents via the graph conversion and using a supervised neural network model as discussed above forms a complex high-dimensional data set. In such sets, the dimensions of the vector space encode (technical) information which the presently described relevance feedback methods can maximally utilize for fast search result adaptation. It should, however, be noted that although described herein as part of a natural language document search system, the present approach can be used also independently of such system and generally in a nearest neighbor based vector search engines.