Determining similarity between documents
11636167 · 2023-04-25
Assignee
Inventors
Cpc classification
G06F16/9535
PHYSICS
International classification
Abstract
Method and system for processing digital works, the method comprising the steps of identifying terms within each digital work of a plurality of digital works, wherein the terms are words and/or phrases. Determining a number of times that the identified terms occur within each digital work of the plurality of digital works. Generating a fingerprint for each digital work of the plurality of digital works, the generated fingerprint based on the identified terms and the number of times that the identified terms occur within each digital work. Using a neural network to find an encoding function, g, that encodes a higher dimensionality space, x, of each fingerprint into a lower dimensionality space, y. Applying the encoding function to each fingerprint of the plurality of digital works to reduce their dimensionality. Determining a similarity between a first fingerprint and one or more dimensionality reduced fingerprints.
Claims
1. A method for processing digital works, the method comprising steps of: identifying terms within each digital work of a plurality of digital works, wherein the terms are words and/or phrases; determining a number of times that the identified terms occur within each digital work of the plurality of digital works; generating a fingerprint for each digital work of the plurality of digital works, the generated fingerprint based on the identified terms and the number of times that the identified terms occur within each digital work; using a neural network to find an encoding function, g, that encodes a higher dimensionality space, x, of each fingerprint into a lower dimensionality space, y; applying the encoding function to each fingerprint of the plurality of digital works to reduce the dimensionality of each fingerprint; determining a similarity between a first fingerprint and one or more dimensionality reduced fingerprints; and generating an output of data indicating digital works having fingerprints with a similarity to a first dimensionality reduced fingerprint exceeding a threshold, wherein the output is generated in response to receiving a query, and wherein a similarity between two or more dimensionality reduced fingerprints is determined based on a scoring function sim(.), as:
2. The method of claim 1, wherein the query includes one or more terms and the first fingerprint is formed from the query.
3. The method of claim 2 further comprising the step of receiving from a user the query formed from a selection of the one or more terms.
4. The method according to claim 2 further comprising the step of applying the encoding function to the first fingerprint to reduce the dimensionality of the first fingerprint before determining the similarity between the first fingerprint and one or more dimensionality reduced fingerprints.
5. The method according to claim 1, wherein the similarity is based on a co-occurrence of terms within the dimensionality reduced fingerprints.
6. The method according to claim 1 further comprising the steps: of receiving a new digital work; and adding the new digital work to the plurality of digital works.
7. The method according to claim 1, wherein the encoding function, g, is found according to the function:
g: x.fwdarw.y by estimating an original higher dimensionality space, {circumflex over (x)}, by inverting the function, g, according to:
g.sup.−1: y.fwdarw.{circumflex over (x)}.
8. The method according to claim 1 further comprising the step of storing the plurality fingerprints and data identifying associated digital works in a repository.
9. The method of claim 1, wherein an expectation of a score between the two terms, i and j is evaluated according to:
10. The method of claim 1 further comprising the step of sorting fingerprints by respective similarity values, computed by the scoring function.
11. The method according to claim 1, wherein identifying terms within each digital work further comprises identifying terms based on a look up table or a dictionary or by tagging terms previous identified in previous processing.
12. The method according to claim 1 further comprising the step of limiting a number of identified terms in each digital work.
13. The method of claim 12, wherein the limited number of identified terms is a predetermined limit.
14. The method according to claim 1, wherein the digital works are any one or more of: text, documents, web pages, articles, news stories, books, newspaper content, social media content, and/or publications.
15. A data processing apparatus comprising a processor adapted to perform the steps of: identifying terms within each digital work of a plurality of digital works, wherein the terms are words and/or phrases; determining a number of times that the identified terms occur within each digital work of the plurality of digital works; generating a fingerprint for each digital work of the plurality of digital works, the generated fingerprint based on the identified terms and the number of times that the identified terms occur within each digital work; using a neural network to find an encoding function, g, that encodes a higher dimensionality space, x, of each fingerprint into a lower dimensionality space, y; applying the encoding function to each fingerprint of the plurality of digital works to reduce the dimensionality of each fingerprint; determining a similarity between a first fingerprint and one or more dimensionality reduced fingerprints; and generating an output of data indicating digital works having fingerprints with a similarity to a first dimensionality reduced fingerprint exceeding a threshold, wherein the output is generated in response to receiving a query, wherein the similarity between two or more dimensionality reduced fingerprints is determined based on a scoring function sim(.), as:
16. A non-transitory computer-readable medium comprising instructions, which when executed by a computer, cause the computer to carry out the steps of: identifying terms within each digital work of a plurality of digital works, wherein the terms are words and/or phrases; determining a number of times that the identified terms occur within each digital work of the plurality of digital works; generating a fingerprint for each digital work of the plurality of digital works, the generated fingerprint based on the identified terms and the number of times that the identified terms occur within each digital work; using a neural network to find an encoding function, g, that encodes a higher dimensionality space, x, of each fingerprint into a lower dimensionality space, y; applying the encoding function to each fingerprint of the plurality of digital works to reduce the dimensionality of each fingerprint; determining a similarity between a first fingerprint and one or more dimensionality reduced fingerprints; and generating an output of data indicating digital works having fingerprints with a similarity to a first dimensionality reduced fingerprint exceeding a threshold, wherein the output is generated in response to receiving a query, wherein the similarity between two or more dimensionality reduced fingerprints is determined based on a scoring function sim(.), as:
Description
BRIEF DESCRIPTION OF THE FIGURES
(1) The present invention may be put into practice in a number of ways and embodiments will now be described by way of example only and with reference to the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9) It should be noted that the figures are illustrated for simplicity and are not necessarily drawn to scale. Like features are provided with the same reference numerals.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
(10) The method involves identifying and retrieving digital patterns, based on a similar querying pattern, that can be user or machine generated. The method may be implemented on a computer system, network or virtual server. The machine generated patterns/fingerprints may be constructed by aggregating data from different sources and transformed to sparse distributed representations of the highest occurring terms (e.g. words or phrases). A fingerprint may be described as a collection of terms that defines particular properties of a digital work, wherein these properties are exhibited throughout the digital work. In one example implementation, these terms are stored in a lookup table along with their occurrences (i.e. the number of times that each term appears in the digital work). Other storage formats may be used for the fingerprints. The term list can be expanded with terms identified in previous processing. A fingerprint may be generated either through an automatic process from a digital work or it can be manually created by a user who can select terms from a list of available terms. In this case, the fingerprint may be used within a search query to find digital works that have matching fingerprints (e.g. exactly or within a tolerance). Therefore, a fingerprint is not necessarily unique but can be shared by different digital works.
(11) The method includes a dimensionality reduction engine 102 that performs a non-linear mapping of the terms to a lower dimensional representation. Dimensionality reduction is achieved through the use of a neural network that minimises the number of dimensions required to represent the data, while ensuring that essential information is preserved. The method includes a similarity engine 104 that operates by computing a qualitative similarity score between the query and the available patterns based on the normalised co-occurrence of terms embedded in the lower dimension representations. The method ranks qualifying patterns based on a similarity score and retrieves the top matching patterns based on a threshold, for example.
(12) Aggregating potential patterns over large data sets leads to a very high number of observed patterns. Finding similar patterns for new data can be cumbersome with a large corpus of historic patterns. In other words, given a large enough corpus of historic patterns to compare against, the timely retrieval of similar patterns may entail an impractical computational cost.
(13) In text based applications, patterns can be observed at an abstract content level, with an individual term level and at all levels in-between. Depending on the intended granularity, thousands of patterns can emerge from a single piece of digital content. Such applications may involve hashes based on term analysis, which often do not best describe the original content. One advantage of the present system and method is that it provides a universal process to efficiently retrieve patterns that can be generalised for digital content based applications. This provides additional value by consolidating dimensionality reduction techniques based on artificial intelligence and an optimal similarity engine, ensuring the retrieval of similar patterns in a timely manner. This may occur even if the patterns are not directly mentioned in text.
(14)
(15) At step 60, the encoding function is applied to the fingerprint of the digital work to reduce its dimensionality. There may be a repository of fingerprints having reduced dimensionality generated in a similar way and according to the method 10. At step 70, similar fingerprints within the repository may be matched and step 80 provides an output of the matches so that digital works having similar fingerprints and therefore similar content may be provided.
(16) There may be a threshold applied to the matches so that only the best matches are output, which reduce the number of documents provided to only those most relevant digital works. Alternatively or additionally, matches may be limited to a particular or predetermined number of matches.
(17) The system therefore recommends digital works based on fingerprints of tangible and intangible entities that result from machine learning techniques and aggregation of structured/unstructured text. This in turn may be based on a query fingerprint generated using the same or a similar process (e.g. to find digital works based on a single digital work) or by a query of terms manually constructed by a user. In an example implementation, the system and method may assign a newly generated fingerprint (or digital work) to a more generalised/abstract fingerprint from a predetermined set of fingerprints (e.g. having a similar theme). Similarity comparisons are preceded by the dimensionality reduction engine 102, where a fingerprint pattern is compacted in a lower dimensional space. The system output may comprise recommendations of the most relevant entities to the newly generated fingerprint by means of a similarity scoring function. An overview of this particular example fingerprint recommendation system and method 100 is depicted in
(18) The method 100 of
(19) A schematic diagram of one such example implementation system 200 is shown in
(20) The process may be abstracted or summarised as follows: 1. Receive the digital content (or selection of terms forming a query); 2. Determine a fingerprint for the digital content; 3. Perform dimensionality reduction to reduce the data space; 4. Compare the determined fingerprint against a repository of pre-processed digital content and calculate the similarity score; and 5. Provide a list of the closest candidates from that comparison based on the similarity score and a numerical similarity threshold.
(21) An example implementation may come in the form of a news article classifier. Given a specific news article (e.g. one chosen by the user), the user can be presented with other articles relevant to the one being read. Additionally, the fingerprint recommendation system can be used to associate the article to entities, whose fingerprints are relevant to the digital content, even if the entities were not directly mentioned in the article. For example, an article about improvements to city waste management can be associated with cities that struggle with the subject, without those cities being directly mentioned in the article.
(22) The system receives the digital works or content before processing can commence. The digital content can be hosted on the web, the user's computer or on a digital storage medium, for example and it may be passed to the fingerprint recommendation system 200.
(23) The system 200 parses the digital content and identifies prominent terms either using a look up dictionary of terms of interest or by tagging terms that were deemed important in previously generated fingerprints. In this way, the number of terms processed may be limited but this set of terms may evolve with new terms added over time, for example. A single term may be found within the digital works in several different forms. For example, some terms may include titles, middle names, singular and plural forms. A de-duplication algorithm is preferably employed to map different surface forms into a single term to avoid duplication. Additionally a hard limit (t) may be set, such that if the occurrence of a term is lower than t, then that term is not considered as part of the fingerprint. Fingerprints can be sorted by the number of mentions relative to the total number of mentions for all terms in the fingerprint, for example. For larger or diverse documents a limit of terms per fingerprint may also be applied to increase the specificity of the fingerprint. A user may also be able to generate a custom query (i.e. fingerprint) by selecting from a list of prominent terms that were observed in the past and adjusting the importance of each term so that more important terms have a higher or weighted ‘occurrence’ value.
(24) Due to the high dimensionality of the term space (patterns may be a subset of this term space), an intermediary dimensionality reduction engine 102 is use to improve the performance of the system 200. The dimensionality reduction engine 102 is comprised of an artificial neural network that encodes the high dimensionality space (x) into a lower dimensionality space (y), by implementing a similarity engine score described below.
(25) The neural network can be used to implement a squashing function where the output, y, is generated through non-linear combinations of the input, x. The original space is {circumflex over (x)} and the model may be trained such that the distance between input x and {circumflex over (x)} is minimised:
x−{circumflex over (x)}.fwdarw.0 (1)
(26) Once converged, an encoding function, g, may be used to reduce the dimensionality of all new inputs:
g: x.fwdarw.y (2)
(27) An estimate of the original space {circumflex over (x)} is determined by inverting the function according to:
g.sup.−1: y.fwdarw.{circumflex over (x)} (3)
(28) Another way to describe this is that when both g and g.sup.−1 are applied, the result should be similar to the original input, x, as shown by:
{circumflex over (x)}: g.sup.−1(g(x))≈x (4)
(29)
(30) A scoring function is applied to fingerprints in a variable multidimensional structure preferably, stored as a sparse distribution representation in a distributed database. However, other storage formats may be used. Fingerprints can be visualized as a dense vector, wherein the x-axis represents each term and the y-axis represents the occurrence of each term, as illustrated in
(31) The scoring function may be defined as:
(32)
where sim(.) is the similarity function, f1 and f2 are the fingerprints to compare, i and j are terms inside f1 and f2 respectively, |.| denotes the cardinality, and is the set intersection operator.
(33) The expectation of a score between two terms of fingerprints f1 and f2 can be evaluated as follows:
(34)
(35) Therefore, the scoring method can be used when comparing fingerprints having different lengths, whilst limiting computational time and resources to matching terms.
(36) A term can also be a fingerprint, consisting of terms, forming a multidimensional space of interconnected fingerprints. In order to improve efficiency and reduce processing requirements, lower dimensionality distance space can be built to enable faster, in memory searching for fingerprints. The placement of fingerprints in this space may be based on precomputed distances between fingerprints, as dictated by the above similarity scoring function (equation 3) described above. The use of this function further reduces score calculation times.
(37) The following provides an example implementation of normalisation that may be applied to any or all of the fingerprints before processing further.
(38) Consider fingerprint f1 which is comprised of three terms t1, t2, t3:
(39) TABLE-US-00001 f1 = { t1, occurrence = 3 t2, occurrence = 1 t3, occurrence = 1 }
(40) Normalising the occurrences involves dividing each occurrence with the sum of occurrences (3+1+1=5) for all terms in f1. Thus:
(41) TABLE-US-00002 f1 = { t1, normalised occurrence = 3/5 = 0.6 t2, normalised occurrence = 1/5 = 0.2 t3, normalised occurrence = 1/5 = 0.2 }
(42) In this example, plotting the terms on a graph having occurrence number as the y-axis and term as the x-axis (as shown in other example fingerprint representations in
(43) The variable length between fingerprints may be solved by the use of the constraint i≡j. The following provides an example implementation for fingerprint f1 and f2 of different lengths, where fingerprint f2 is defined as:
(44) TABLE-US-00003 f2 = { t1, occurrence = 5 t5, occurrence = 1 }
(45) In this example, f2 has a length of 2 (i.e. the same as the number of terms), whereas f1 has a length of 3. Additionally, f2 does not contain t2 and t3 which are present in f1. When the similarity engine 104 applies the similarity function to the two fingerprints, only t1 is used for their comparison, since it is the only term that exists in both f1 and f2. In this way we have compared f1 and f2 which have lengths 3 and 2 respectively.
(46) A list of possible candidate matches is collected and sorted by their similarity score. A precomputed and optimised threshold may be used to discard candidates, whose similarity score is below this threshold. This similarity threshold can vary depending the context, type of data, preferred granularity and the number of candidates. The threshold may be set using other mechanisms. All candidates (i.e. matches) whose similarity score exceed the threshold are returned to the user as recommendations or as another output format. For example, the actual digital works may be returned to the user or a link to such works.
(47) As will be appreciated by the skilled person, details of the above embodiment may be varied without departing from the scope of the present invention, as defined by the appended claims.
(48) For example, whilst the examples provides show a single server 210, the load may be managed over several servers. Alternatively, the system may be implemented in a virtual server or other distributed computing environment or network. Many different users may submit queries or receive recommendations. Queries may be submitted from different terminal types, such as desktop computers, mobile devices, smart phones or tablet computers, for example. In some implementations, the queries may be submitted and responses received using a mobile application. Users of such a mobile application may be presented with high level topics and may select more specific areas of interest. In this way, a query may be devises that corresponds with a search criteria fingerprint. This search may be submitted as the query and compared with other fingerprints within a repository. Matches that meet the criteria (e.g. threshold for similarity) are returned to the user. Users may be presented with the output on demand or unprompted. For example, they may select, download or read a particular article, which is fingerprinted by the system so that recommendations can be sent to the user automatically.
(49) Many combinations, modifications, or alterations to the features of the above embodiments will be readily apparent to the skilled person and are intended to form part of the invention. Any of the features described specifically relating to one embodiment or example may be used in any other embodiment by making the appropriate changes.