Computer-Implemented System And Method For Identifying Relevant Documents
20170262455 ยท 2017-09-14
Inventors
Cpc classification
Y10S707/99936
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
Y10S707/99932
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
Y10S707/99943
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G06F16/955
PHYSICS
Y10S707/99945
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
A computer-implemented system and method for identifying relevant documents is provided. A set of documents each associated with one or more concepts is identified. At least a portion of the documents in the set are clustered based on the concepts. A matrix that provides a summary of the documents most relevant to one such concept is generated by determining for each document a measure of similarity between a concept frequency occurrence and concept weights of each cluster. The matrix is populated with the calculated measures of similarity. Those documents associated with a threshold measure of similarity are identified as the most relevant documents to one such concept.
Claims
1. A computer-implemented system for identifying relevant documents, comprising: a database to store a set of documents each associated with one or more concepts; and a server comprising a central processing unit, memory, an input port to receive the document set, and an output port, wherein the central processing unit is configured to: cluster at least a portion of the documents in the set based on the concepts; and generate a matrix that provides a summary of those documents that are most relevant to one such concept by determining for each document a measure of similarity between a concept frequency occurrence and concept weights of each cluster and populating the matrix with the calculated measures of similarity; and identify those documents associated with a threshold measure of similarity as the most relevant documents to one such concept.
2. A system according to claim 1, wherein the central processing unit removes those documents from the matrix with zero values for one or more concept frequency occurrences.
3. A system according to claim 1, wherein the central processing unit calculates the similarity according to the following equation:
4. A system according to claim 1, wherein the central processing unit updates the concept weights of the clusters until the clusters settle.
5. A system according to claim 1, wherein the central processing unit removes duplicate documents from the identified documents that satisfy the threshold similarity.
6. A system according to claim 1, wherein the central processing unit identifies the documents for clustering by creating a histogram for each document that maps a relative frequency of occurrence for at least a portion of terms in that document, creating a corpus graph that maps the concepts for the documents in the set based on the terms, defining predefined thresholds for the corpus graph, and selecting those documents with terms that satisfy the thresholds for clustering.
7. A system according to claim 6, wherein the central processing unit removes from the set, those documents with terms that fail to satisfy the threshold.
8. A system according to claim 6, wherein the thresholds are set at 1% to 15% for shorter documents.
9. A system according to claim 6, wherein the central processing unit sets tighter thresholds for larger documents.
10. A system according to claim 1, wherein the central processing unit extracts terms from the documents in the set and generates a record for each term.
11. A computer-implemented method for identifying relevant documents, comprising: identifying a set of documents each associated with one or more concepts; clustering at least a portion of the documents in the set based on the concepts; generating a matrix that provides a summary of those documents that are most relevant to one such concept, comprising: determining for each document a measure of similarity between a concept frequency occurrence and concept weights of each cluster; and populating the matrix with the calculated measures of similarity; and identifying those documents associated with a threshold measure of similarity as the most relevant documents to one such concept.
12. A method according to claim 11, further comprising: removing those documents from the matrix with zero values for one or more concept frequency occurrences.
13. A method according to claim 11, further comprising: calculating the similarity according to the following equation:
14. A method according to claim 11, further comprising: updating the concept weights of the clusters until the clusters settle.
15. A method according to claim 11, further comprising: removing duplicate documents from the identified documents that satisfy the threshold similarity.
16. A method according to claim 11, further comprising: identifying the documents for clustering, comprising: creating a histogram for each document that maps a relative frequency of occurrence for at least a portion of terms in that document; creating a corpus graph that maps the concepts for the documents in the set based on the terms; defining predefined thresholds for the corpus graph; selecting those documents with terms that satisfy the thresholds for clustering.
17. A method according to claim 16, further comprising: removing from the set, those documents with terms that fail to satisfy the threshold.
18. A method according to claim 16, wherein the thresholds are set at 1% to 15% for shorter documents.
19. A method according to claim 16, further comprising: setting tighter thresholds for larger documents.
20. A method according to claim 11, further comprising: extracting terms from the documents in the set; and generating a record for each term.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
DETAILED DESCRIPTION
Glossary
[0028] Keyword: A literal search term which is either present or absent from a document. Keywords are not used in the evaluation of documents as described herein. [0029] Term: A root stem of a single word appearing in the body of at least one document. [0030] Phrase: Two or more words co-occurring in the body of a document. A phrase can include stop words. [0031] Concept: A collection of terms or phrases with common semantic meanings. [0032] Theme: Two or more concepts with a common semantic meaning. [0033] Cluster: All documents for a given concept or theme.
The foregoing terms are used throughout this document and, unless indicated otherwise, are assigned the meanings presented above.
[0034]
[0035] The document analyzer 12 analyzes documents retrieved from a plurality of local sources. The local sources include documents 17 maintained in a storage device 16 coupled to a local server 15 and documents 20 maintained in a storage device 19 coupled to a local client 18. The local server 15 and local client 18 are interconnected to the system 11 over an intranetwork 21. In addition, the document analyzer 12 can identify and retrieve documents from remote sources over an internetwork 22, including the Internet, through a gateway 23 interfaced to the intranetwork 21. The remote sources include documents 26 maintained in a storage device 25 coupled to a remote server 24 and documents 29 maintained in a storage device 28 coupled to a remote client 27.
[0036] The individual documents 17, 20, 26, 29 include all forms and types of unstructured data, including electronic message stores, such as electronic mail (email) folders, word processing documents or Hypertext documents, and could also include graphical or multimedia data. Notwithstanding, the documents could be in the form of structured data, such as stored in a spreadsheet or database. Content mined from these types of documents does not require preprocessing, as described below.
[0037] In the described embodiment, the individual documents 17, 20, 26, 29 include electronic message folders, such as maintained by the Outlook and Outlook Express products, licensed by Microsoft Corporation, Redmond, Wash. The database is an SQL-based relational database, such as the Oracle database management system, release 8, licensed by Oracle Corporation, Redwood Shores, Calif.
[0038] The individual computer systems, including system 11, server 15, client 18, remote server 24 and remote client 27, are general purpose, programmed digital computing devices consisting of a central processing unit (CPU), random access memory (RAM), non-volatile secondary storage, such as a hard drive or CD ROM drive, network interfaces, and peripheral devices, including user interfacing means, such as a keyboard and display. Program code, including software programs, and data are loaded into the RAM for execution and processing by the CPU and results are generated for display, output, transmittal, or storage.
[0039]
[0040] During text analysis, the text analyzer 42 identifies terms and phrases and extracts concepts in the form of noun phrases that are stored in a lexicon 18 maintained in the database 30. After normalizing the extracted concepts, the text analyzer 42 generates a frequency table 46 of concept occurrences, as further described below with reference to
[0041] Each module is a computer program, procedure or module written as source code in a conventional programming language, such as the C++ programming language, and is presented for execution by the CPU as object or byte code, as is known in the art. The various implementations of the source code and object and byte codes can be held on a computer-readable storage medium or embodied on a transmission medium in a carrier wave. The document analyzer 12 operates in accordance with a sequence of process steps, as further described below with reference to
[0042]
[0043]
[0044] Once identified and retrieved, the set of documents 44 is analyzed (block 73), as further described below with reference to
[0045]
[0046] Following preprocessing, a histogram 48 of the frequency of terms (shown in
[0047] Next, a document reference frequency (corpus) graph 49, as further described below with reference to
[0048] The selected set of terms and phrases falling within the thresholds are used to generate themes (and concepts) (block 85) based on correlations between normalized terms and phrases in the documents set. In the described embodiment, themes are primarily used, rather than individual concepts, as a single co-occurrence of terms or phrases carries less semantic meaning than multiple co-occurrences. As used herein, any reference to a theme or concept will be understood to include the other term, except as specifically indicated otherwise.
[0049] Next, clusters are created (block 86) from groups of highly-correlated concepts and themes. Individual concepts and themes are categorized based on, for example, Euclidean distances calculated between each pair of concepts and themes and defined within a pre-specified range of variance, such as described in commonly-assigned U.S. Pat. No. 6,778,995, issued Aug. 17, 2004, the disclosure of which is incorporated by reference.
[0050] A matrix 47 of the documents 44 is created (block 87), as further described below with reference to
[0051]
[0052] Initially, noun phrases are extracted (block 91) from each document 44. In the described embodiment, concepts are defined on the basis of the extracted noun phrases, although individual nouns or tri-grams (word triples) could be used in lieu of noun phrases. In the described embodiment, the noun phrases are extracted using the LinguistX product licensed by Inxight Software, Inc., Santa Clara, Calif.
[0053] Once extracted, the individual terms or phrases are loaded into records stored in the database 30 (shown in
[0054]
[0055]
[0056] Referring back to
[0057]
[0058]
[0059]
[0060] A median value 145 is selected and edge conditions 146a-b are established to discriminate between concepts which occur too frequently versus concepts which occur too infrequently. Those documents falling within the edge conditions 146a-b form a subset of documents containing latent concepts. In the described embodiment, the median value 145 is document-type dependent. For efficiency, the upper edge condition 146b is set to 70% and the 64 concepts immediately preceding the upper edge condition 146b are selected, although other forms of threshold discrimination could also be used.
[0061]
[0062]
[0063] For a set of n documents, the distance d.sub.cluster is calculated by taking the sum of products (inner product) by terms between document concept frequency occurrences and cluster concept weightings, using the following equation:
where doc.sub.term represents the frequency of occurrence for a given term i in the selected document and cluster.sub.term represents the weight of a given cluster for a given term i. The weights of the individual inner products are iteratively updated until the clusters settle. The goal is to calculate the minimum distances between as few clusters as possible until the rate of change goes constant. The rate of change can be calculated, for example, by taking the first derivative of the inner products over successive iterations.
[0064]
[0065] Satisfactory results are shown when a meaningful cluster of documents is found. Objectively, each document within a given theme will have an inner product falling within a pre-defined variance of other related documents, thereby reflecting a set amount of similarity. The cluster itself represents a larger grouping of document sets based on related, but not identical, themes.
[0066] If necessary, the results are re-run (block 182). One reason to re-run the results set would be to re-center the median value 145 of the corpus graph 140 (shown in
[0067] While the invention has been particularly shown and described as referenced to the embodiments thereof, those skilled in the art will understand that the foregoing and other changes in form and detail may be made therein without departing from the spirit and scope of the invention.