Content sensitive document ranking method by analyzing the citation contexts

Abstract

This invention relates to a method which provides, showing as well the relevant documents to the user even if the said documents that are closely related to the subject do not contain the keywords that are entered for search in their content, by analyzing citation contexts of every document in a data pool containing documents that are citing or linking to at least one document. In an alternative embodiment of this method, in the case when the documents cited by using the entered keywords are cited by other documents using other keywords, these said other keywords are considered as similar terms and the search is conducted as well by including these said similar terms.

Claims

1. A citation context sensitive search method for accessing to related documents by chosen keywords, the method comprising: extracting citation context of each document, that cites at least one document, in a database containing documents, the citation context is determined as a text in a window of predetermined size around a citation marker in a citing document, wherein the citation context is determined as a text around the citation marker inside the window; identifying meaningful keywords from the citation context of the citing document for each cited document; forming a citation network, wherein the citing document and the cited documents form nodes of the citation network, a first node corresponding to the citing document is connected to all the nodes corresponding to the cited documents forming edges of the citation network, wherein each edge of the citation network is labelled with an inferred keyword, the inferred keyword is an identified meaningful keyword from the citation context of the citing document corresponding to the cited document forming the edge; constructing a table T comprising sets of the citing document, the cited document, and the inferred keywords in the citation, receiving keywords to initiate a search, identifying citing and cited documents by searching the keywords in the table T, forming a related document pool by adding the citing and the cited documents identified in the table T and related citation networks for the inferred keywords corresponding to the keywords, ranking the documents of the related document pool by using a graph based ranking algorithm running on the union of the related citation networks.

2. The method of claim 1, wherein in the step of identifying meaningful keywords from the citation context of each document, every term in the citation context is the inferred keyword.

3. The method of claim 1, wherein in the step of identifying meaningful keywords from the citation context of each document, every term in the citation context is used as a definitive term for articles in the citation context.

4. The method of claim 1, wherein in the step of identifying meaningful keywords from the citation context of each document, the meaningful keywords are single words, or bigrams.

5. The method of claim 1, wherein in the step of identifying citing and cited documents by searching the keywords in the table T, the keywords are searched in the table T constructed by using the inferred keywords of the citation network.

6. The method of claim 1, wherein in the step of identifying citing and cited documents by searching the keywords in a table T, for a keyword for the search, looking up the keyword in the table T and identifying citing and cited documents corresponding to the keyword and constructing a term- specific citation network, wherein the citation and cited documents corresponding to the keyword form nodes of the term- specific citation network and edges from the citing documents to the cited documents are labelled with term .

7. The method of claim 1, wherein in the step of ranking the documents of the related document pool by using a graph based ranking algorithm running on the union of the related citation networks, a simple ranking module which takes keywords, constructs the union of term labeled citation networks for each keyword and the graph based ranking algorithm is run on the union of the citation networks to generate a ranked list of the related documents.

8. A method for finding documents which are closely related to a subject, by using other keywords along with chosen keywords, in the case where a same document is referred to by multiple documents and by multiple keywords, the method comprising: extracting citation context of each document, that cites at least one document, in a database containing documents, the citation context is determined as a text in a window of predetermined size around a citation marker in a citing document, wherein the citation context is determined as a text around the citation marker inside the window; identifying meaningful keywords from the citation context of the citing document for each cited document; forming a citation network, wherein the citing document and the cited documents form nodes of the citation network, a first node corresponding to the citing document is connected to all the nodes corresponding to the cited documents forming edges of the citation network, wherein each edge of the citation network is labelled with an inferred keyword, the inferred keyword is an identified meaningful keyword from the citation context of the citing document corresponding to the cited document forming the edge; constructing a table T comprising sets of the citing document, the cited document, and the inferred keywords in the citation, receiving keywords from a user to initiate a search, inferring terms which are similar to the keywords, identifying the citing and cited documents in the table T by searching the keywords and the inferred similar terms in the table T, forming a relating document pool for all results in the table T corresponding to both the keywords and the inferred similar terms, ranking the documents of the relating document pool by using a graph based ranking algorithm running on citation networks of the set of the keywords and the similar terms.

9. The method of claim 8, wherein in the step of inferring terms which are similar to the keywords, the terms which are similar to the keywords are terms that are frequently used together with the keywords in the citations.

10. The method of claim 8, wherein in the step of inferring terms which are similar to the keywords, as a measure towards a power linear correlation between two sample terms, Pearson Correlation Coefficient is used.

11. The method, of claim 8, wherein the keywords in the step identifying the citing and cited documents in the table T by searching the keywords and the inferred similar terms in the table T are searched in the table T documents that cite using the keywords and the similar terms and the documents that are cited using the keywords and similar terms are identified.

12. The method of claim 8, wherein after forming the citation network of the similar terms for a given term in the step identifying the citing and cited documents in the table T by searching the keywords and the inferred similar terms in the table T, for a keyword or the similar terms for the search, looking up the keyword in the table T and identifying citing and cited documents corresponding to the keyword and constructing a term- specific citation network, wherein the citation and cited documents corresponding to the keyword form nodes of the term- specific citation network and edges from the citing documents to the cited documents are labelled with term ; and then using the term- specific citation network for graph ranking.

13. The method of claim 9, wherein in the step of inferring terms which are similar to the keywords, for finding distinguishing terms which are used for defining smaller article sets that are present as similar terms, in order to lower a term weight of a term by a factor increasing with a frequency of appearance of the term in the citation contexts, a term frequency-inverse document frequency technique is used.

14. The method of claim 8, wherein in the step of ranking the documents of the relating document pool by using a graph based ranking algorithm running on citation networks of the set of the keywords and the similar terms, a simple ranking module which takes keywords, constructs the union of term labeled citation networks for each keyword and similar terms and the graph based ranking algorithm is run on the union of the citation networks to generate a ranked list of the related documents.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Methods in order to fulfill the objects of the present invention are illustrated in the attached figures, where:

(2) FIG. 1. Citation network of the exemplary pseudo-articles

(3) FIG. 2. Citation Networks

(4) FIG. 2a. Term labeled citation network G(A,C).

(5) FIG. 2b. Term-1 specific citation network G.sub.1(A, C.sub.1)

(6) FIG. 2c. Term-4 specific citation network G.sub.4(A, C.sub.4)

(7) FIG. 2d. Like term citation network. G.sub.S1(A, C.sub.S1) for set of like terms S.sub.1={.sub.1, .sub.4}

(8) FIG. 2e. Bipartite graph of documents versus terms

(9) FIG. 2f. Matrix F of terms versus documents

(10) FIG. 3. Intersecting terms

(11) FIG. 4. Steps of the context sensitive search method.

(12) FIG. 5. Steps of the method of searching with like terms.

ELEMENTS SHOWN IN THE FIGURES ARE NUMBERED AS FOLLOWS

(13) 100. Context sensitive search method 110. Separating the citation contexts 120. Identifying the meaningful keywords/word groups from the citation contexts 130. Forming the citation network, as a directed graph, from the citing article to the cited articles 140. Writing the keywords, citing and cited documents in a table T 150. Entering the keywords/word groups to start a search 160. Searching the keywords word groups in the table T 170. Taking the cited documents corresponding to the keywords in a pool of related documents 180. Ranking using a ranking algorithm

(14) Method of searching with like terms 210. Entering the keywords/word groups 220. Inferring the terms which are similar to the entered keywords 230. Conducting a search in the table T for the first keywords/word groups entered. 240. Conducting a search in the table T for the inferred similar terms 250. Forming a relating document pool for all results in the table T corresponding to the entered keywords and similar terms 260. Ranking the documents taken from the document pool using a ranking algorithm

DETAILED DESCRIPTION OF THE INVENTION

(15) Citation context sensitive search method (100), which provides access to the related documents by the chosen keywords, essentially comprises the steps of;

(16) extracting the citation contexts of each document, that cites at least one document, in a database containing the documents (110),

(17) identifying meaningful keywords/word groups from the citation context of each document (120),

(18) forming the citation network, an edge-labeled directed graph, where nodes are the documents. There is an edge from the citing article to the cited article and the edge is labeled by the inferred keywords (130),

(19) constructing a table T (see Table 1) which contains the citing document, the cited document, and the inferred keywords in the citation (140),

(20) entering the keywords/word groups to initiate a search (150),

(21) identifying citing and cited documents by searching the entered keywords in the table T (160),

(22) forming a relating document pool by adding the cited documents identified in (160), (170),

(23) ranking the documents taken to the related document pool by using any ranking algorithm (180).

(24) In the subject matter of citation context sensitive searching method (100), in order to provide access to the related documents via selected keywords, firstly, a table T consisting of keywords used for citing, along with citing and cited documents based on the keywords that are already present in the citation context, needs to be constructed. Once the table T is formed, all searches are conducted through this table T.

(25) In order to form the table T, first of all, the citation context of every document, that cite another document, is extracted (110), and then meaningful keywords/word groups (terms) are inferred from the citation context (120).

(26) The best articles are cited by numerous articles with relevant terms in the citation context. This, in turn, shows that the cited document is relevant to the subject. For this, in the method subject to the invention, a simple method is followed for determining/defining the citation terms in the citation context. The terms used for explaining the cited article stand close to the citation point. As also shown in FIG. 1, citation context is the text around the citation marker that is present inside the window, the size of which is predefined. The size of this text can be defined by a specific sentence, word or character number around the citation point. As a result of this, as present in the previous studies, the method subject to the invention also takes the citation context as a window with a fixed size, for example 400 characters in length, around the citation point.

(27) Once citation context is obtained, the next step is the identification of meaningful keywords/word groups (terms) from the citation context of each document (120). A word or a word group, which states something or some concept in a specific field is referred to as term. Every bigram, that is present in citation context, is used as a definitive term for the cited article. For example, if three articles are cited in the same citation context, every bigram it this content is taken as definitive terms for the three articles. The number of cords, taken as a term, is one or two but it can be expanded for using n-grams of any n. However, bigrams are preferred since they are the most used n-grams for explaining specific terms such as scale free, map reduce, and preferential attachment. While this situation helps inferring meaningful terms from the citation context, it eliminates problems such as synonymous words in the case of single words.

(28) After inferring meaningful terms from the citation context of each document (120), a citation network is formed (130). A citation network is an edge-labeled directed graph, where nodes are the documents. There is a directed edge from the citing article to the cited article. The edge is labeled by the inferred keywords. An edge in a citation network carries a lot more information than merely a binary relationship. The terms that the citing author used for explaining the cited document could be taken from the citation context.

(29) In FIG. 1, an example of a small citation network, formed by six pseudo-articles is given. The citation contexts are underlined, cited articles indicated in brackets, such as [2], together with the terms taken from the citation context are emphasized by bold font, such as term-1.

(30) For example, the corresponding term labeled citation network of FIG. 1 is given in FIG. 2a, where the set of vertices is A={a.sub.1, a.sub.2, . . . , a.sub.|A|}, and T={.sub.1, .sub.2, . . . , .sub.|T|} is the set of terms. The directed edge from a.sub.1 to a.sub.3 is labeled with terms in set T.sub.1,3={.sub.1, .sub.4, .sub.5}. The edge a.sub.1, a.sub.2 is not there since there is no such citation, that is T.sub.1,2 is the empty set. In other words, edges are labeled with the terms used in the citation context of the citing document while citing. The conventions for A and T are given below.

(31) A={a.sub.1, a.sub.2, . . . , a.sub.|A|} is the set of all documents. Lower case letters of the Latin alphabet such as i, jA are used to denote the members of A.

(32) T={.sub.1, .sub.2, . . . , .sub.|T|} is the set of all terms used in all documents in A. Letters of the Greek alphabet such as , T are used to denote the members of T.

(33) The edge from i to j is labeled by the terms in T.sub.ij. Set T.sub.ijT is the set of all terms, which are in at least one citation context of article i citing article j. If no citation is made from article i to article j, then T.sub.ij is the empty set. In the situations where article i cites article j, however there are no terms inside the citation context, then T.sub.ij again the empty set. It is also possible that the article i cites article j more than once and a term might be presented in each of these citations. Having used at least once is enough for the term to be in T.sub.ij.

(34) More formally, the term labeled citation network G(A,C), shown in FIG. 2a, is a directed graph by CAA where the edge is (i, j)C, if and only if article i contains at least on citation referring to article j. The edge (i, j)C is labeled with all the terms inside T.sub.ij.

(35) In the exemplary term labeled citation network G(A,C) given in FIG. 2a, the directed edges used for forming citation network between documents a.sub.1, a.sub.2, a.sub.3, a.sub.4, a.sub.5 and a.sub.6 are labeled with the terms .sub.1, .sub.2, .sub.3, .sub.4 and .sub.5. Document a.sub.1 cites document a.sub.3. In the citation context it terms uses .sub.1, .sub.4 and 5 In the citation context for a.sub.4, there are terms .sub.1 and .sub.2. Similarly, document a.sub.2 uses terms .sub.1 for a.sub.4 and .sub.1 and .sub.3 for a.sub.6.

(36) After constructing the citation network (130), table T constructed T (140). For every citing and cited document pair, there is a row. The terms used in the citation context are inserted into the corresponding row. These terms, though they change according to the content of the cited document, in one example of the invention are scale free, preferential attachment, map reduce. The directed edges used between the citing documents and cited documents are labeled with the terms used in the citation context. An exemplary table T formed for FIG. 2a is shown below in the exemplary table T shown below, .sub.1, .sub.2, .sub.3, .sub.4 and .sub.5 each represent a term.

(37) TABLE-US-00001 TABLE 1 The table T which is formed for the G(A, C) term labeled citation network given in FIG. 2a. Contains the citing and cited documents and the terms used in the citation context. Cited Words used Citing documents in citation context documents Document a.sub.3 .sub.1, .sub.4, .sub.5 Document a.sub.1 Document a.sub.4 .sub.1, .sub.2 Document a.sub.1 Document a.sub.6 .sub.2 Document a.sub.1 Document a.sub.4 .sub.1 Document a.sub.2 Document a.sub.6 .sub.1, .sub.3 Document a.sub.2 Document a.sub.5 .sub.4, .sub.5 Document a.sub.3 Document a.sub.6 .sub.3 Document a.sub.4

(38) In the inventive citation context sensitive searching method (100), after forming the table T containing citing and cited documents and relating terms, a content sensitive search can be initiated.

(39) In the method, in order to initiate a search, first of all keywords/word groups/terms of the subject to be searched are entered (150). In the preferred embodiment of the invention, the entered terms are searched in a table T (160).

(40) In another embodiment of the invention, for the terms entered for search (for example for ) a term- specific citation network is formed and after that, citing documents, cited documents and entered in this citation network are written in a table and citing documents corresponding to this value are determined (160). In an exemplary embodiment of the invention, an example of the term- specific citation network for .sub.1 and .sub.4 as entered terms, are shown in FIG. 2b and FIG. 2c respectively.

(41) Suppose .sub.1 is the term/keyword for the search. Term .sub.1 is searched inside the table T. All the documents corresponding to .sub.1 in table T are selected and considered as the related document pool (170). Thereby, not only the documents that contain .sub.1, documents related to the subject but do not contain .sub.1 are also selected. Hence access to all documents closely relating to the subject .sub.1 are provided.

(42) After gathering all documents relation to .sub.1 in a document pool, the documents taken from the document pool can be ranked by using any ranking algorithm (180).

(43) In one embodiment of the invention, for ranking the documents related to the subject (180), a simple ranking module, which takes a bigram and gives a ranked list of scientific articles in return is used.

(44) So far search related to term is considered (100). Another method (200), which provides access to the relevant documents by using similar terms, is explained below. Here not only documents gathered in (100), but also documents related to the terms that are similar to the entered terms are considered. So that documents, which are closely related to the subject but which do not contain the keywords can also be reached.

(45) A searching method with similar terms (200), which enables finding the documents, which are closely related to the subject, by using other keywords along with the chosen keywords, reuses steps (110) trough (140) and essentially comprises the steps of;

(46) extracting the citation contexts of each document, that cites at least one document, in a database containing the documents (110),

(47) identifying meaningful keywords/word groups from the citation context of each document (120),

(48) forming the citation network, an edge-labeled directed graph, where nodes are the documents. There is an edge from the citing article to the cited article and the edge is labeled by the inferred keywords (130),

(49) constructing a table T (see Table 1) which contains the citing document, the cited document, and the inferred keywords in the citation (140),

(50) entering the keywords/word groups to initiate a search (210),

(51) inferring the terms which are similar to the entered keywords (220),

(52) as in the case of (160) identifying citing and cited documents in the table T by searching the entered keywords (230),

(53) as in the case (160), identifying citing and cited documents in the table T by searching the inferred similar terms (240),

(54) forming a relating document pool for all results in the table T corresponding to both the entered keywords and the similar terms (250),

(55) as in the case of (180) ranking the documents taken from the document pool using any ranking algorithm (260).

(56) One of the main approaches of the inventive searching method with similar terms (200) is using both the entered and the similar terms in the process of network forming. This helps to expand the selected document set to include documents related to the similar terms.

(57) In scientific publications, one term generally is not sufficient to explain a subject by itself and usage of only a single term is prone to noises because of the natural usage of the language such as synonymous words. For every term, there is a set of articles that contain it. In FIG. 3, as can be seen related to the terms law of force, scale free and preferential attachment, these article sets substantially coincide for some terms.

(58) In the searching method with similar terms (200), table T is constructed by means of sequence of (110) through (140). After that, in order to initialize the searching process keywords/word groups/terms are entered (210) and the terms that are similar to the entered keywords are inferred (220). In principle two terms are similar if they appear together in a considerable number of citation. Given a term, inferring similar terms requires some tools.

(59) In order to infer the terms which are similar to the entered keywords (220), first a term-article matrix is formed. Similar terms for the given term , is the set of terms which have the article scope which substantially coincides with the article scope of the term .

(60) Term frequencies are related to the articles by a document matrix F=[f.sub.j] which has a size of |T||A|. In this matrix, the entry f.sub.j is the count regarding how many articles use the term in the citation context that cites article j. That is, f.sub.j is the in-degree of article j in term- graph G.sub.. Therefore F is actually the adjacency matrix of the weighted bipartite graph between the article nodes and term nodes.

(61) An example the bipartite graph given in FIG. 2e is obtained from term labeled citation network as in FIG. 2a. The corresponding related term document matrix F is shown in FIG. 2f.

(62) In the inventive searching method with similar terms (200), there are distinguishing terms which are used for especially defining smaller article sets are present as similar terms. Simple term frequency has a problem of assuming every one of the terms to have the same importance, however some terms have very little or no distinguishing power. For example, it is possible for almost the entire citation context of an article collection on the topic cancer to contain the term cancer. For this, the weights of the terms, which are present in numerous citation contexts, are lowered. In principle, the idea is reducing term frequency weight of a term by a factor that grows with its citation context frequency it appears. Term frequency-inverse document frequency (tf-idf) is a technique which is based on this idea. This method is widely used in information retrieval and text mining and it reflects how important a word is to a document in a collection. For this reason, in the inventive method (200), this technique is used for weighting the term frequencies.

(63) The inverse document frequency for term is defined by g(),

(64) $g () = \log \frac{.Math. A .Math.}{{.Math.}_{j = 1}^{.Math. A .Math.} sgn (f_{j})}$

(65) where sgn(x) is a signum function designed as

(66) $sgn (x) = {\begin{matrix} 1, & x > 0, \\ 0, & x = 0, \\ - 1, & x < 0. \end{matrix}$

(67) Afterwards, let us assume that D=[d.sub.] is a |T||T| diagonal matrix defined with below:

(68) $d_{} = {\begin{matrix} g (), & =, \\ 0, & . \end{matrix}$

(69) We define the weighted term document Matrix N=[n.sub.] of size |T||A| with N=DF.

(70) Afterwards, a relationship between the terms is established. and are assumed, to be the .sup.th and .sup.th row vectors of N respectively, and the members and show the related weighted term frequencies of and for the articles inside the data set. In order to learn how much of the scopes of the articles of these terms coincide, comparison of the corresponding row vectors of and is realized. For this, in a preferred embodiment of the invention, as a measure towards the power linear correlation between two sample terms. Pearson Correlation Coefficient, which is widely used in the field of science, is used.

(71) Afterwards, a Sample Pearson Correlation Matrix P=[p.sub.] of size |T||T| is defined and P.sub.

(72) $p_{} = \frac{{.Math.}_{i = 1}^{.Math. A .Math.} (_{i} - \overline{}) (_{i} - \overline{})}{\sqrt{{.Math.}_{i = 1}^{.Math. A .Math.} {(_{i} - \overline{})}^{2}} \sqrt{{.Math.}_{i = 1}^{.Math. A .Math.} {(_{i} - \overline{})}^{2}}}$

(73) is the Sample Pearson Correlation between term and where and are the .sup.th .sup.th row vectors of N. The vector is the average of the entrance of the vector .

(74) The sample Pearson Correlation Coefficient is the measure of the linear correlation between two samples X and Y, and it can give a value between 1 and 1 (including 1 and 1). A value of 1 means that a linear equation defines the relationship between X and Y, and all data points are located on a line where Y increases with increasing X. A value of 1 means that all data points are located on a line where Y decreases with increasing X. This case is irrelevant for our data set, because, in order to take the value 1 for two terms and , they need to be complementary to each other. This is not possible for large article collections. The value 0 means that there is no linear correlation present between the samples.

(75) For a given term , the similar term set S.sub. is defined as S.sub.={T|p.sub.>} for some value 0<<1. is the cross validation parameter and the value of changes between topics. Additionally, similarity point p.sub. for term equals to 1. For this reason, since S.sub., S.sub. is not empty.

(76) Weighted citation network which takes term as basis and which is directed from the term set S.sub. is defined as follows:

(77) The sub graphic G.sub.S (A, C.sub.S) of G(A, C) is named as the citation network of the set of similar terms, where (i) C.sub.S=.sub.SC.sub. (ii) the weight of the edge (i, j)C.sub.S equals to the sum of weights of the edges combined w.sub.if=.sub.(ij)T.sub.ij.sub.S.sub.P.sub..

(78) For example from FIG. 2a, suppose S.sub.1={.sub.1, .sub.4} is the similarity set for a given term .sub.1 and . Then, the network of similar terms set G.sub.S1 (A, C.sub.S1) for term .sub.1 is shown in FIG. 2d.

(79) The keywords entered, in the inventive method (200), are searched in table T and related documents are identified (230).

(80) In another embodiment of the invention, the citation network of terms similar to given term is formed in (220). Therefore table T contains the documents related to similar terms to the entered keyword.

(81) In another embodiment of the invention, after forming the citation network of similar terms set is formed for a given term , terms (for example .sub.1 and .sub.4) similar to the entered keyword (), the documents citing using these similar terms in the citation context and cited documents are written in the table T. Thus, cited documents corresponding to the entered keyword () and the terms that are similar to the entered keyword/keywords (.sub.1 and .sub.4), in other words documents closely related to the subject are determined.

(82) Thus, documents closely related to the subject can be determined by means of another search in the table T for the terms similar to the entered (240). Thus, the documents related to the entered keywords, and the documents related to similar terms, that is, closely related to the subject, are determined (250).

(83) After the cited documents corresponding to the entered keyword and the similar terms are collected in a data pool (250).

(84) The related documents, in the sense of both entered keywords and the similar terms, taken into the data pool. Then so selected documents are ranked by using any ranking algorithm (260).

Content sensitive document ranking method by analyzing the citation contexts

Assignee

Inventors

Cpc classification

Classification Explorer

G06F16/243

PHYSICS

Classification Explorer

G06F40/284

PHYSICS

Classification Explorer

G06F16/382

PHYSICS

Classification Explorer

G06F16/24578

PHYSICS

International classification

Classification Explorer

G06F17/30

PHYSICS

Classification Explorer

G06F17/27

PHYSICS

Abstract

Claims

Description