Computer-implemented system and method for generating document groupings for display
09619551 ยท 2017-04-11
Assignee
Inventors
Cpc classification
Y10S707/99936
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
Y10S707/99932
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
Y10S707/99943
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G06F16/955
PHYSICS
Y10S707/99945
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
A computer-implemented system and method for generating document groupings is provided. A lexicon of terms extracted from a set of documents is generated. The lexicon includes a frequency of each extracted term within each document in the set. Concepts each having two or more of the extracted terms are generated. A subset of the documents in the set is selected based on the term frequencies. The subset of documents is grouped into clusters based on the concepts. A similarity of each document cluster is calculated with one or more documents based on a distance by summing the frequency of each term in that document and a weight of the cluster for each of the terms. The weights are updated until a rate of change for each cluster becomes constant.
Claims
1. A computer-implemented system for generating document groupings, comprising: a database to store a set of document, a lexicon of terms extracted from the set of documents and comprising a frequency of each extracted term within each document, and concepts each comprising two or more of the extracted terms; and a server comprising a central processing unit, memory, an input port to receive the documents, lexicon and concepts from the database, and an output port, wherein the central processing unit is configured to: select a subset of the documents in the set based on the term frequencies; group the subset of documents into clusters based on the concepts; calculate a similarity of each document cluster with at least one document based on a distance by summing the frequency of each term in that document and a weight of the cluster for each of the terms; and update the weights until a rate of change for each cluster becomes constant.
2. A system according to claim 1, wherein the central processing unit further identifies one or more of the clusters as meaningful by defining a variance, selects those clusters having one or more documents with inner products that fall within the variance as meaningful, and presents those documents identified as meaningful.
3. A system according to claim 1, wherein the central processing unit further represents semantic content of each document by mapping the terms in that document in order of decreasing frequency.
4. A system according to claim 1, wherein the central processing unit further represents latent semantics of the document set by mapping each term in the document set against a total frequency occurrence, wherein the total frequency occurrence is calculated as a sum of the term frequencies within each document in the set.
5. A system according to claim 4, wherein the central processing unit further selects a median value and edge conditions for the total frequency occurrences of the terms and generates the subset of documents from those documents in the set that satisfy the edge conditions.
6. A system according to claim 5, wherein the central processing unit further re-centers the median value and to generate a different subset of documents for grouping.
7. A system according to claim 5, wherein the central processing unit further sets the edge conditions based on a size of the documents.
8. A system according to claim 7, wherein larger documents have tighter edge conditions than shorter documents.
9. A system according to claim 1, wherein the central processing unit further calculates the distance using the following equation:
10. A system according to claim 1, wherein the central processing unit further calculates the rate of change by determining a first derivative of the inner products over successive iterations.
11. A computer-implemented method for generating document groupings, comprising: generating a lexicon of terms extracted from a set of documents and comprising a frequency of each extracted term within each document; generating concepts each comprising two or more of the extracted terms; selecting a subset of the documents in the set based on the term frequencies; grouping the subset of documents into clusters based on the concepts; calculating a similarity of each document cluster with at least one document based on a distance by summing the frequency of each term in that document and a weight of the cluster for each of the terms; and updating the weights until a rate of change for each cluster becomes constant.
12. A method according to claim 11, further comprising: identifying one or more of the clusters as meaningful, comprising: defining a variance; and selecting those clusters having one or more documents with inner products that fall within the variance as meaningful; and displaying those documents identified as meaningful.
13. A method according to claim 11, further comprising: representing semantic content of each document by mapping the terms in that document in order of decreasing frequency.
14. A method according to claim 11, further comprising: representing latent semantics of the document set by mapping each term in the document set against a total frequency occurrence, wherein the total frequency occurrence is calculated as a sum of the term frequencies within each document in the set.
15. A method according to claim 14, further comprising: selecting a median value and edge conditions for the total frequency occurrences of the terms; and generating the subset of documents from those documents in the set that satisfy the edge conditions.
16. A method according to claim 15, further comprising: re-centering the median value; and generating a different subset of documents for grouping.
17. A method according to claim 15, further comprising: setting the edge conditions based on a size of the documents.
18. A method according to claim 17, wherein larger documents have tighter edge conditions than shorter documents.
19. A method according to claim 11, further comprising: calculating the distance using the following equation:
20. A method according to claim 11, further comprising: calculating the rate of change by determining a first derivative of the inner products over successive iterations.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION
Glossary
(15) Keyword: A literal search term which is either present or absent from a document. Keywords are not used in the evaluation of documents as described herein. Term: A root stem of a single word appearing in the body of at least one document. Phrase: Two or more words co-occurring in the body of a document. A phrase can include stop words. Concept: A collection of terms or phrases with common semantic meanings Theme: Two or more concepts with a common semantic meaning Cluster: All documents for a given concept or theme.
The foregoing terms are used throughout this document and, unless indicated otherwise, are assigned the meanings presented above.
(16)
(17) The document analyzer 12 analyzes documents retrieved from a plurality of local sources. The local sources include documents 17 maintained in a storage device 16 coupled to a local server 15 and documents 20 maintained in a storage device 19 coupled to a local client 18. The local server 15 and local client 18 are interconnected to the system 11 over an intranetwork 21. In addition, the document analyzer 12 can identify and retrieve documents from remote sources over an internetwork 22, including the Internet, through a gateway 23 interfaced to the intranetwork 21. The remote sources include documents 26 maintained in a storage device 25 coupled to a remote server 24 and documents 29 maintained in a storage device 28 coupled to a remote client 27.
(18) The individual documents 17, 20, 26, 29 include all forms and types of unstructured data, including electronic message stores, such as electronic mail (email) folders, word processing documents or Hypertext documents, and could also include graphical or multimedia data. Notwithstanding, the documents could be in the form of structured data, such as stored in a spreadsheet or database. Content mined from these types of documents does not require preprocessing, as described below.
(19) In the described embodiment, the individual documents 17, 20, 26, 29 include electronic message folders, such as maintained by the Outlook and Outlook Express products, licensed by Microsoft Corporation, Redmond, Wash. The database is an SQL-based relational database, such as the Oracle database management system, release 8, licensed by Oracle Corporation, Redwood Shores, Calif.
(20) The individual computer systems, including system 11, server 15, client 18, remote server 24 and remote client 27, are general purpose, programmed digital computing devices consisting of a central processing unit (CPU), random access memory (RAM), non-volatile secondary storage, such as a hard drive or CD ROM drive, network interfaces, and peripheral devices, including user interfacing means, such as a keyboard and display. Program code, including software programs, and data are loaded into the RAM for execution and processing by the CPU and results are generated for display, output, transmittal, or storage.
(21)
(22) During text analysis, the text analyzer 42 identifies terms and phrases and extracts concepts in the form of noun phrases that are stored in a lexicon 18 maintained in the database 30. After normalizing the extracted concepts, the text analyzer 42 generates a frequency table 46 of concept occurrences, as further described below with reference to
(23) Each module is a computer program, procedure or module written as source code in a conventional programming language, such as the C++ programming language, and is presented for execution by the CPU as object or byte code, as is known in the art. The various implementations of the source code and object and byte codes can be held on a computer-readable storage medium or embodied on a transmission medium in a carrier wave. The document analyzer 12 operates in accordance with a sequence of process steps, as further described below with reference to
(24)
(25)
(26) Once identified and retrieved, the set of documents 44 is analyzed (block 73), as further described below with reference to
(27)
(28) Following preprocessing, a histogram 48 of the frequency of terms (shown in
(29) Next, a document reference frequency (corpus) graph 49, as further described below with reference to
(30) The selected set of terms and phrases falling within the thresholds are used to generate themes (and concepts) (block 85) based on correlations between normalized terms and phrases in the documents set. In the described embodiment, themes are primarily used, rather than individual concepts, as a single co-occurrence of terms or phrases carries less semantic meaning than multiple co-occurrences. As used herein, any reference to a theme or concept will be understood to include the other term, except as specifically indicated otherwise.
(31) Next, clusters are created (block 86) from groups of highly-correlated concepts and themes. Individual concepts and themes are categorized based on, for example, Euclidean distances calculated between each pair of concepts and themes and defined within a pre-specified range of variance, such as described in commonly-assigned U.S. Pat. No. 6,778,995, issued Aug. 17, 2004, the disclosure of which is incorporated by reference.
(32) A matrix 47 of the documents 44 is created (block 87), as further described below with reference to
(33)
(34) Initially, noun phrases are extracted (block 91) from each document 44. In the described embodiment, concepts are defined on the basis of the extracted noun phrases, although individual nouns or tri-grams (word triples) could be used in lieu of noun phrases. In the described embodiment, the noun phrases are extracted using the LinguistX product licensed by Inxight Software, Inc., Santa Clara, Calif.
(35) Once extracted, the individual terms or phrases are loaded into records stored in the database 30 (shown in
(36)
(37)
(38) Referring back to
(39)
(40)
(41)
(42) A median value 145 is selected and edge conditions 146a-b are established to discriminate between concepts which occur too frequently versus concepts which occur too infrequently. Those documents falling within the edge conditions 146a-b form a subset of documents containing latent concepts. In the described embodiment, the median value 145 is document-type dependent. For efficiency, the upper edge condition 146b is set to 70% and the 64 concepts immediately preceding the upper edge condition 146b are selected, although other forms of threshold discrimination could also be used.
(43)
(44)
(45) For a set of n documents, the distance d.sub.cluster is calculated by taking the sum of products (inner product) by terms between document concept frequency occurrences and cluster concept weightings, using the following equation:
(46)
where doc.sub.term represents the frequency of occurrence for a given term i in the selected document and cluster.sub.term represents the weight of a given cluster for a given term i. The weights of the individual inner products are iteratively updated until the clusters settle. The goal is to calculate the minimum distances between as few clusters as possible until the rate of change goes constant. The rate of change can be calculated, for example, by taking the first derivative of the inner products over successive iterations.
(47)
(48) Satisfactory results are shown when a meaningful cluster of documents is found. Objectively, each document within a given theme will have an inner product falling within a pre-defined variance of other related documents, thereby reflecting a set amount of similarity. The cluster itself represents a larger grouping of document sets based on related, but not identical, themes.
(49) If necessary, the results are re-run (block 182). One reason to re-run the results set would be to re-center the median value 145 of the corpus graph 140 (shown in
(50) While the invention has been particularly shown and described as referenced to the embodiments thereof, those skilled in the art will understand that the foregoing and other changes in form and detail may be made therein without departing from the spirit and scope of the invention.