METHOD OF GRAPH MODELING ELECTRONIC DOCUMENTS WITH AUTHOR VERIFICATION
20230004583 · 2023-01-05
Inventors
- Haralambos Marmanis (Waltham, MA, US)
- Robin James Bramley (Worcestershire, GB)
- Matthew Kleiderman (Salem, MA, US)
Cpc classification
International classification
Abstract
A method for generating a graphical model of a plurality of electronic documents establishes connections between individual electronic documents with common authorship even if the spelling of the name of the author varies amongst the documents, for instance, due to the use of abbreviations, pseudonyms, misspellings, and the like. The graphical model is generated by ingesting data from the electronic documents and constructing a base graphical model using the processed data. Thereafter, as part of a disambiguation step, similar authors amongst the plurality of electronic documents are identified and clustered to yield an author similarity graph, which is preferably refined over time. A degree of belief, or similarity inference, is then calculated for documents determined to have common authorship and, in turn, incorporated into the base graphical model. As a result, an inference of the accuracy of linked information in the graphical model can be established.
Claims
1. A computer-implemented method for generating a graphical model of a plurality of electronic documents, each electronic document comprised of data which includes identifying information, the identifying information including authorship, the method comprising the steps of: (a) ingesting the data from each of the plurality of electronic documents; (b) constructing a base graphical model using the data from the plurality of electronic documents; (c) disambiguating any relatedness of identifying information between select pairs of the plurality of electronic documents; and (d) calculating a degree of belief of relatedness of identifying information between select pairs of electronic documents, wherein the degree of belief of relatedness of identifying information between select pairs of electronic documents is incorporated into the base graphical model.
2. The method as claimed in claim 1 wherein, as part of the disambiguating step, common authorship between select pairs of the plurality of electronic documents is identified.
3. The method as claimed in claim 2 wherein, as part of the disambiguating step, common authorship between select pairs of electronic documents is identified even with variances in spelling.
4. The method as claimed in claim 3 wherein, as part of the disambiguating step, pairs of electronic documents identified as having common authorship are linked.
5. The method as claimed in claim 4 wherein, as part of the calculating step, the degree of belief of common authorship between select pairs of electronic documents is assigned a numerical value of probability.
6. The method as claimed in claim 5 wherein, as part of the calculating step, the numerical value is calculated through a pair prediction algorithmic process.
7. The method as claimed in claim 6 wherein, as part of the ingesting step, data from the plurality of electronic documents is compiled and processed for data modeling.
8. The method as claimed in claim 7 wherein the ingesting step produces a table of data fragments from each of the plurality of electronic documents.
9. The method as claimed in claim 8 wherein, as part of the constructing step, the base graphical model is constructed using the table of data fragments from each of the plurality of electronic documents.
10. The method as claimed in claim 9 wherein, as part of the constructing step, the table of data fragments from each of the plurality of electronic documents is processed to yield a set of tables comprising: (a) a document node table, which associates each electronic document with a corresponding node in the base graphical model; (b) an author node table, which lists the authorship of each electronic document; and (c) graph edge tables, which lists relationships between nodes in the base graphical model.
11. The method as claimed in claim 10 wherein the disambiguation step comprises: (a) a linking phase in which the author node table is processed to identify similarities in author names and thereby allow for the construction of a collaboration graph, the collaboration graph comprising author nodes, article nodes, contribution edges and citation edges; (b) a clustering phase in which a similarity graph is constructed using the author nodes and similar person edges, including those derived from the collaboration graph; and (c) a refinement phase in which clustering results are examined to resolve variances in author names.
12. The method as claimed in claim 11 wherein the similar person edges are created using at least one technique from the group consisting of name matching, author identification code matching, and collaboration graph construction.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] In the drawings, wherein like reference numerals represent like parts:
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
DETAILED DESCRIPTION OF THE INVENTION
Graph Modeling Method 211
[0027] Referring now to
[0028] In the description that follows, method 211 is illustrated verifying the accuracy of the listed authorship for electronic documents in the graphical model. However, it should be noted that the principles of the present invention are not limited to incorporating information related to the certainty of authorship into a graphical model. Rather, it is to be understood that method 211 could be utilized to integrate a degree of belief, or confidence, of the veracity of any type of information, or data, included in the graphical model without departing from the spirit of the present invention.
[0029] As defined herein, the term “document” denotes any electronic record, or work. In the description that follows, documents are represented primarily as articles, such as scientific publications. However, it is to be understood that use of the term “document” herein is not intended to be limited to scientific publications or other similar types of articles. Rather, use of the term “document” is meant to encompass any/all forms of electronic records (e.g., arbitrary, text-based, information records) derived from any source, including literature, online news stories, and even database records, without departing from the spirit of the present invention.
[0030] As seen in
Data Ingestion Step 213
[0031] As referenced above, data ingestion step 213 involves acquiring, processing, and storing data from a set of electronic documents into a designated data pipeline. The frequency of data acquisition and ingestion is preferably dependent upon the volume and release dates of the electronic documents in the designated pipeline. As previously noted, the ingested data from the set of electronic documents is then subsequently utilized for data graph modeling.
[0032] Preferably, data ingestion step 213 is implemented entirely through a cloud computing services platform, thereby only requiring compute resources when processing data. For example, an Amazon Web Services (AWS) cloud computing services platform could be utilized to implement step 213, thereby allowing for an optimized selection and configuration of web services tools. For instance, data acquisition could be implemented using Python programming scripts on AWS-based Simple Storage Service (S3), processed using AWS-based Elastic MapReduce (EMR), and stored in a column-oriented file structure on the AWS-based Simple Storage Service (S3).
[0033] However, it should be noted that the use of a primarily AWS-based cloud computing services platform is provided for illustrative purposes only. Rather, step 213 could be similarly implemented using alternative cloud computing services platforms, such as the Microsoft Azure cloud computing services platform, without departing from the spirit of the present invention.
[0034] Referring now to
[0035] As seen in
[0036] As seen in
[0037] In the second stage of step 213, article data (i.e., the content of each electronic document) is ingested into the data pipeline. As a novel feature of ingestion process 213, the present invention is designed to support any updates for article data ingested into the pipeline. For instance,
[0038] Thereafter, through a consolidation process 253, data tables 245 and 251 are combined to create an intermediate data table 255 of article fragments, which represents the partially processed documents. Each record in table 255 preferably includes specifically extracted document properties, plus front and back matter fragments (e.g., the metadata and references). Where multiple overlapping sources are used for a single entity/node type, then a further consolidation/disambiguation step is required (not illustrated).
Graph Construction Step 215
[0039] As referenced above, graph construction step 215 involves the creation of a base graph model using the data tables generated from data ingestions step 213, the model allowing for the visualization of relationships (shown as vectors) between various nodes (e.g., authors, article content, and general article reference data). Graph construction step 215 and disambiguation step 217 are preferably implemented using any suitable distributed data processing framework, such as Apache Spark. As such, graph data can be exported in a suitable format for a graph database management system such as Neo4j.
[0040] Referring now to
[0041] It should be noted that graph edge tables 269 may represent, inter alia, (i) relationships between different node types (e.g., contributions, or linking, between article nodes and author nodes), (ii) relationships between nodes of the same type (e.g., a citation reference, or linking, between multiple article nodes), and (iii) connections to reference data nodes (e.g., linking a scientific article with a particular journal in which it was published).
[0042] For optimal performance, edge construction preferably relies on well-known identifying information, or identifiers, whenever available. The use of well-known identifiers eliminates the need to perform a look-up of the target at write time. Depending on the quality of the input data source and/or implementor preferences or constraints, integrity checking may be required at construction time, prior to projection of the data, or delegated to a downstream graph database management system.
Disambiguation Step 217
[0043] Author name disambiguation step 217 is a multi-stepped process which is designed to ensure, or verify, that the proper author is associated with each electronic document in the graphical model. As noted previously, the applicant has recognized that certain inconsistencies in the spelling of the name of an author amongst different document resources often results in the incorrect identification of an individual as a document author. As a result, the accuracy of a graph model generated for a collection of electronic documents can be significantly compromised. Accordingly, the process by which disambiguation step 217 serves to ensure, or certify, proper authorship is associated with a collection of scientific articles forms a critical aspect of the present invention.
[0044] Disambiguation step 217 is a split into the following sequence of phases: (i) a linking phase in which author node records are processed to identify similarities in author names and the analysis of paths occurring within a collaboration graph, (ii) a clustering phase in which a similarity graph is constructed using author nodes and similar person edges produced from the linking phase, and (iii) a refinement phase in which clustering results are examined to resolve author ambiguities created through the use of, inter alia, homonyms and synonyms. Each phase referenced above will be described further in detail below.
[0045] Referring now to
[0046] The collaboration graph construction technique is represented on the left-hand side of
[0047] Self-citation is one example of a common pattern of relationships which can be identified through a graphing process. A self-citation occurs when an author cites another document previously written by the same author, which is common in the scientific community. Through self-citation graphing, two authors of articles can be linked by a citation. Through additional filtering and comparison of the author names, preferably using a last name and a first name initial, author synonyms can be discovered and, in turn, used to construct similarity edges between the citing author and the cited author. This is represented generally on the right-hand side of
[0048] Fuzzy name matching is another example of a common pattern of relationships which can be improved through the application of a graphing process. Within the collaboration graph, communities, or cliques, within the graph model can be detected. Specifically in this case, linking phase 277 runs a connected components algorithm, which is an implementation of the alternating, big-star, little-star algorithm. Once the components have been allocated, the giant component is removed from consideration. Frequently, the remaining components have been found to be highly cohesive. Then, candidates (i.e., authors) are considered that are within the same component and share the same last name. If the initials and forename match exactly, the candidate is discarded since it should have already been identified through exact name (hash) matching. If candidates pass a high threshold, a name proximity similarity edge row is constructed in table 279. If candidates pass a lower threshold but the affiliation of the author passes a secondary threshold, an author name and affiliation proximity similarity edge row is constructed in table 279.
[0049] Fields of study matching is another example of a common pattern of relationships which can be identified through a graphing process. With fields of study matching, graphical paths are used to construct “fields of study” vectors that represent the specific topics on which an article author has been published. These fields of study vectors are then compared for candidate matches to reinforce similarity edges.
[0050] Finally, there is a mechanism to support certain known corrections in authorship. Specifically, as shown in
[0051] As referenced briefly above, upon completion of the linking phase of disambiguation step 217, a clustering phase is undertaken to construct a similarity graph using author nodes and similar person edges produced from the linking phase. Referring now to
[0052] Preferably, an iterative graph algorithm is executed as part of process 305 to identify clusters of potentially common authors. Specifically, any graph clustering algorithm (such as a connected components algorithm) can be run to allocate clusters to each author node. The clusters are then processed and the name of the distinct author is selected on the basis of a general criterion of utility such as the longest name amongst all the names in the cluster or the most frequent occurrence of a name if all names are of approximately the same length.
[0053] It is important for downstream consumers of the data pipeline that distinct authors maintain stable identifiers (e.g., to allow for the augmentation of data). In other words, as articles within the data pipeline are added and/or removed, clusters in turn can grow (e.g., form new clusters), shrink (e.g., delete existing clusters), or remain the same. Accordingly, the members, or components, within each cluster may migrate between clusters, form a new cluster, or be permanently deleted.
[0054] Therefore, the clustering phase is preferably designed with logic to ensure that cluster identifiers remain stable. Referring now to
[0055] Thereafter, the identified author clusters in table 313 are processed to produce distinct, or verified, author nodes. Specifically, a distinct author construction process 315 is applied to the clusters in table 313 to produce a distinct author node table 317. Subsequently, process 315 produces a disambiguated edge table 319 that links cluster members (i.e., author nodes defined in table 301) with the distinct author nodes defined in table 317, thereby facilitating in the reconciliation of authorship in the author nodes. Additionally, process 315 can be used to produce edge tables 321 which link the distinct author nodes defined in table 317 to other entities, such as articles, topics, collaborators, and the like.
[0056] As the final phase of disambiguation step 217, an optional refinement phase is undertaken in which clustering results are examined to resolve author ambiguities. Notably, due to the use of author homonyms and synonyms, clustering results can suffer from lumping and splitting errors. The identification of such errors can be accomplished using a classifier model trained on a labelled data set, as will be explained further below.
[0057] Specifically, as part of the refinement stage, a decision-tree classifier is trained using labelled data and, in turn, is utilized to refine author clusters. In other words, the decision-tree classifier is used to make predictions on whether cluster data (i.e., cluster members) represents the same person. Based on the pair-level predictions which can be interpreted as a distance measure, the cluster members can be re-clustered using a distributed clustering algorithm to produce refined clusters.
[0058] Referring now to
[0059] Input data 401 is preferably labelled data that was intentionally withheld from the similarity, or cluster, model. Model training is computationally intensive in nature and involves (i) a preparation step 405, in which input data 401 is split into training pair data 407 and test pair data 409, (ii) a training process 411 in which features data 403 from the similarity model and training pair data 407 are used to create a trained classifier model 413, and (iii) a testing process 415 in which data from trained classifier model 413 and similarity model features table 403 is utilized to evaluate the clustering, or linking, of the ‘gold standard’ test pair data 409. If the results of evaluation process 415 exceeds the previous model, the refined similarity, or cluster, model is deployed.
[0060] The trained decision-tree classifier 413 is then utilised to predict pairwise whether members of the cluster refer to the same person. Pair predictions are then processed to split clusters when necessary.
[0061] Referring now to
Degree of Belief Calculation Step 219
[0062] As the final step of process 211, an inferencing, or degree of belief (DoB) calculation, step 219 is implemented in order to infer the level of confidence in matched, or linked, authors; though the application of the inferencing is not solely limited to authors and can be determined along multiple dimensions for other node or edge types within the knowledge graph. In turn, any author matching inferences are incorporated as additional information, or knowledge, into the graph model and can thereby ensure proper authorship of electronic document. Using belief calculation step 219, probabilities within the graphical model can be fed into inference algorithms to determine the likelihood of other relationships being true.
[0063] Referring now to
[0064] As an alternative example of calculating a degree of belief measure, the confidence in the volume of output created by an author may be calculated using the following formula:
(1−e.sup.−α(x−β))/(1+e.sup.−α(x−β))
[0065] where alpha (α) and beta (β) are controllable parameters. Two sensible values are α=β=2, and x is the logarithm of identified matches (or “duplicates”) minus some sensible upper limit of publications that an individual can produce within the specified time period. For values of x close to zero that function is approximately 1 whereas for larger values it drops rapidly (exponentially) to zero.
[0066] In terms of defining values, after processing metadata, all articles are identified in which a particular author is listed, either identically or according to an accepted set of variations for equivalence (e.g., Ralph Stephen Baric==R. S. Baric==Ralph A. Baric). These articles will span a time period of T years. Assuming an upper limit of articles published per year (PPYL), X is defined as:
X=Log(base 10)[Number of articles−PPYL*T]
[0067] PPYL is provided with a value of 50 to start and the results are examined. Because that value is high, only the truly most prolific authors would be able to match. Depending on the outcome, the value can be recalibrated, as needed, as well as the alpha and beta parameters of the DoB calculation equation.
[0068] The invention described in detail above is intended to be merely exemplary and those skilled in the art shall be able to make numerous variations and modifications to it without departing from the spirit of the present invention. All such variations and modifications are intended to be within the scope of the present invention as defined in the appended claims.