METHOD FOR CITATION IDENTIFICATION
20240045897 ยท 2024-02-08
Assignee
Inventors
Cpc classification
G06F16/25
PHYSICS
International classification
G06F16/38
PHYSICS
Abstract
A computer-implemented method for identifying a product citation in a document, the method comprising searching, in the document, for an entity identifier corresponding to an entity and, if an instance of the entity identifier is detected in the document, determining a portion of the document around the instance of the entity identifier as a target text, wherein the entity is associated with a product catalogue, the product catalogue comprising a plurality of product identifiers; applying a first regular expression to the target text, wherein the first regular expression is configured to match one or more of the plurality of product identifiers; and if a product identifier from the plurality of product identifiers is determined to be cited in the target text, adding an entry to a citation database linking the document and the product identifier.
Claims
1. A computer-implemented method for identifying a product citation in a document, the method comprising: searching, in the document, for an entity identifier corresponding to an entity and, if an instance of the entity identifier is detected in the document, determining a portion of the document around the instance of the entity identifier as a target text, wherein the entity is associated with a product catalogue, the product catalogue comprising a plurality of product identifiers; automatically applying a first regular expression to the target text, wherein the first regular expression is configured to match one or more of the plurality of product identifiers; and if a product identifier from the plurality of product identifiers is determined to be cited in the target text, adding an entry to a citation database linking the document and the product identifier.
2. A method according to claim 1, wherein the searching for the entity identifier in the document comprises applying a second regular expression to the document.
3. A method according to claim 2, wherein there are multiple entity identifiers corresponding to the entity, and the second regular expression is configured to match all of the multiple entity identifiers.
4. A method according claim 1, further comprising generating the first regular expression.
5. A method according to claim 4, wherein generating the first regular expression comprises obtaining, from a memory, a regular expression code associated with the product catalogue, and basing a first component of the first regular expression on the obtained regular expression code.
6. A method according to claim 4 wherein generating the first regular expression comprises generating, using at least one of the plurality of product identifiers, a second component of the first regular expression, the second component being configured to match the at least one product identifiers.
7. A method according to claims 5 wherein the at least one product identifiers were determined to not match the regular expression code.
8. A method according to claim 6 wherein generating the second component of the first regular expression comprises parsing the at least one product identifiers into a tree, and traversing the tree to generate the second component.
9. A method according to one of claims 4, further comprising storing the first regular expression in a memory.
10. A method according to claim 1, further comprising: in response to applying the first regular expression to the target text, identifying one or more tokens from the target text that match the first regular expression; and determining if any of the one or more tokens corresponds to a product identifier from the product catalogue and, if so, which product identifier.
11. A method according to claim 10, wherein determining if any of the one or more tokens corresponds to a product identifier comprises an iterative process, each iteration of the iterative process including: determining a set of prefixes for the plurality of product identifiers, each prefix having a predetermined number of characters; generating a set of prefix regular expressions corresponding to the set of prefixes; applying the prefix regular expressions to the one or more tokens, and keeping only the prefix regular expressions that return a match; wherein for each subsequent iteration, the predetermined number of characters is increased by one, and the set of prefixes is determined to include only prefixes that match a prefix regular expression kept in the previous iteration; and wherein, a token is determined to correspond to a product identifier if a prefix regular expression matches the token, and if the predetermined number of characters corresponds to a number of characters in the product identifier.
12. A method according to claim 10 further comprising, for each of the one or more tokens determined to correspond to a product identifier: generating a third regular expression configured to match the corresponding product identifier; applying the third regular expression to the target text; and if a match with the third regular expression is found in the target text, determining that the corresponding product identifier is cited in the target text.
13. A method according to claim 12, wherein the third regular expression comprises an identifier component configured to match the product identifier, and a context component that is configured to match a predetermined context.
14. A method according to claim 12 further comprising, for each of the one or more tokens determined to correspond to a product identifier: determining a risk factor for the corresponding product identifier, wherein the risk factor is associated with a risk of wrongly identifying the product identifier in the target text; and if a risk factor is identified, generating the third regular expression such that it includes an identifier component configured to match the product identifier, and a context component that is configured to match a predetermined context.
15. A method according to claim 10, wherein the first regular expression is configured to match all of the plurality of product identifiers.
16. A computer-implemented system comprising a processor and a memory storing instructions which, when executed by the processor, cause the processor to carry out the method of claim 1.
17. A computer-readable storage medium comprising instructions stored therein which, when executed by a processor, cause the processor to carry out the method of claim 1.
18. A computer-implemented method for generating a product citation database from a corpus of documents, the method comprising: searching, for each electronically accessible document within the corpus of documents, for an entity identifier corresponding to an entity and, if an instance of the entity identifier is detected in the document, determining a portion of the document around the instance of the entity identifier as a target text, wherein the entity is associated with a product catalogue, the product catalogue comprising a plurality of product identifiers; automatically applying a first regular expression to the target text, wherein the first regular expression is configured to match one or more of the plurality of product identifiers; and if a product identifier from the plurality of product identifiers is determined to be cited in the target text; generating a citation database file where the citation database does not exist; and adding an entry to the citation database linking the document and the product identifier.
19. A computer-implemented method for searching content within a database, the method comprising: receiving a search query for a product, performing a search on the citation database wherein the citation database was generated by the method of claim 18, and returning one or more product identifiers from the citation database based on the search query, wherein the one or more product identifiers are ranked based on a number of documents linked with each product identifier in the citation database.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0069] Embodiments of the invention will now be described by way of example with reference to the accompanying drawings, in which:
[0070]
[0071]
[0072]
[0073]
[0074]
[0075]
DETAILED DESCRIPTION
Further Options and Preferences
[0076]
[0077] It will be appreciated and understood that the document described herein can be an electronic or digital file that contains text and optionally, images, tables, graphs, embedded links, video portions or other content. The electronic or digital document can have any known or understood file type, such as .doc, .txt, .rtf, .pdf, .html, xml, and .json. The processors or computers described herein are configured to access such file types and evaluate the contents thereof.
[0078] In a first step 102, the method 100 involves searching, using a computer or processor suitably configured by code executing therein, the document for the entity identifier(s) associated with the entity. This may be done by performing a full-text search on the document for the entity identifiers associated with the entity. In some embodiments, this may be implemented by applying a regular expression to the document, where the regular expression is configured to match any (e.g. all) of the entity identifiers associated with the entity. Herein, the regular expression which is used to search for the entity identifiers is referred to as a second regular expression.
[0079] In some cases, there may be multiple entities (e.g. suppliers or manufacturers) of interest, each entity being associated with a respective product catalogue. In such a case, the step 102 may involve searching the document for entity identifiers associated with each of the multiple entities. Thus, a respective regular expression corresponding to each of the multiple entities may be applied to the document in step 102.
[0080] In a second step 104, if an instance of an entity identifier associated with the entity is detected in the document (e.g. if a string of characters in the text matches second regular expression), then a portion of the document including the instance of the entity identifier is determined as target text for further analysis. The target text includes a predetermined amount of text (e.g. a predetermined number of characters) on either side of the detected instance of the entity identifier. As an example, the target text may include 1000 characters (or fewer) on either side of the detected instance of the entity identifier.
[0081] In a third step 106, the method 100 involves applying a first regular expression to the target text. The first regular expression is configured to match one or more of the plurality of product identifiers in the product catalogue. Preferably, the first regular expression is configured to match all of the product identifiers in the product catalogue, so that only a single regular expression need be applied to the target text. In some cases, the first regular expression may be pre-computed for the product catalogue, in which case the first regular expression can simply be retrieved from a memory and applied to the target text. Alternatively, the first regular expression may be generated and/or updated as part of the method. An example process for generating the first regular expression is discussed in more detail below in relation to
[0082] In the cases mentioned above where there are multiple entities of interest, the first regular expression is configured to match one or more of the plurality of product identifiers in the product catalogue associated with the entity whose entity identifier was detected in step 102. Thus, step 106 only searches for product identifiers corresponding to the entity whose entity identifier is in the target text.
[0083] In step 106, the first regular expression may be applied to the target text by a search engine, which searches the target text for strings of characters that match the first regular expression. If a string of characters in the target text is found to match the first regular expression, then a token corresponding to that string of characters is output. Thus, step 106 may result in one or more tokens to be output, each token corresponding to a respective string of characters in the target text that matches the first regular expression.
[0084] In step 108, the tokens output from step 106 (if any) are analysed to determine if any of them corresponds to one of the product identifiers in the product catalogue and, if so, to which product identifier. Indeed, as the first regular expression may match multiple product identifiers, it is necessary to determine which of the product identifiers corresponds to an identified token from the target text. An example implementation of step 108 is described in more detail below in relation to
[0085] If a token output from step 106 is determined to correspond to one of the product identifiers, then in step 110 of the method 100, an entry is added to a citation database. The entry links the document with the cited product identifier, and may also further link the product identifier with the target text in which it was cited. Accordingly, by analysing multiple documents in accordance with the method 100, the citation database may be built up to include citations of product identifiers across the multiple documents. The citation database may associate product identifiers from the product catalogue with documents in which they were cited. This may, for example, enable products to be ranked in accordance with the number of times they have been cited, as well as facilitate rapid access to documents where they are cited. The citation database may store citation data for multiple entities, such that products from multiple entities can be ranked based on number of citations.
[0086]
[0087] The method 200 involves, at step 202, determining whether a regular expression code is available for the relevant product catalogue. For example, there may be a regular expression database which is configured to store regular expression codes associated with various product catalogues. The regular expression codes may correspond to regular expressions that were pre-computed for the product catalogues, or computed as part of a previous iteration of the method 100. The regular expression codes stored in the database may additionally or alternatively include human-generated regular expressions. Thus, step 202 may involve searching the regular expression database for a regular expression code associated with the relevant product catalogue, i.e. corresponding to the entity whose entity identifier was found in step 102. For instance, after an instance of an entity identifier is found in step 102, step 202 may determine if a regular expression code corresponding to the entity whose entity identifier was found in step 102 is available.
[0088] If a regular expression code corresponding to the relevant product catalogue is found in step 202, then the regular expression code is retrieved and the method 200 moves on to step 204. Step 204 checks if any product identifiers in the product catalogue do not match the retrieved regular expression code. This may be achieved, for example, by comparing each of the product identifiers in the product catalogue with the regular expression code, to determine which (if any) of the product identifiers do not match the regular expression code.
[0089] If all product identifiers in the product catalogue match the regular expression code, then the method 200 moves on to step 206, in which the first regular expression is generated based on the retrieved regular expression code. For example, the regular expression code may be defined as the first regular expression, or the regular expression code may be included as part of the first regular expression. On the other hand, if the retrieved regular expression code does not match all of the product identifiers in the product catalogue, the method 200 moves on to step 208. In step 208, the first regular expression is generated to include a first component and a second component. The first component is based on the regular expression code, in a similar way to step 206. The second component is configured to match the remaining product identifiers, i.e. the product identifiers which are not matched by the regular expression code. In this manner, the first regular expression obtained from step 208 can match all of the product identifiers in the product catalogue. The second component of the first regular expression is generated using a regular expression generator. The regular expression generator is configured to take the remaining product identifiers (i.e. those which do not match the regular expression code) as an input, and to output a regular expression that matches all of the remaining product identifiers.
[0090] Returning to step 202, if no regular expression code is found for the product catalogue in the regular expression database, then the method 200 moves on to step 210. In step 210, the first regular expression is generated, by a suitably configured processor, based on the product catalogue, the first regular expression being configured to match all of the product identifiers in the product catalogue. This can be performed using a regular expression generator, which is configured to take the plurality of product identifiers in the product catalogue as an input, and to output the first regular expression. The regular expression generator used in step 210 may be the same (or have a similar configuration) as the one used in step 208.
[0091] An example implementation of a regular expression generator that can be used in steps 208 and 210 will now be described. The regular expression generator is implemented as one or more processors suitably configured to receive as an input the product identifiers for which a regular expression is to be generated. In the case of step 208, this includes all of the product identifiers of the product catalogue that do not match the regular expression code retrieved in step 202. In the case of step 210, this includes all the product identifiers in the product catalogue. The regular expression generator is then configured to parse the product identifiers into a tree (e.g. prefix tree), and traverse the tree to generate a regular expression that matches all of the product identifiers. The tree is constructed by iterating over the product identifiers and for each product identifier splitting it into single character tokens. The process then iterates over the tokens and builds a tree structure including a root node, and a plurality of character nodes, each corresponding to a character token. The root nodes and character nodes are connected by edges such that each path through the tree from the root node to a leaf node corresponds to a respective product identifier. Accordingly, by performing a recursive walk down the tree, it is possible to construct an efficient pattern for matching all of the product identifiers. For example,
[0092] It should be noted that a tree may include terminal nodes, i.e. one or more nodes of the tree may be indicated as terminal nodes. For example, the tree 212 of
[0093] Accordingly, the method 200 outputs a first regular expression. The first regular expression obtained from the method 200 may then be applied to the target text in step 106 of the method 100, as discussed above. Regardless of whether the first regular expression is obtained from step 206, step 208 or step 210, the first regular expression is configured to match all of the product identifiers in the product catalogue. Following generation of the first regular expression by one of steps 206, 208 or 210, the first regular expression may be stored in a memory, e.g. in the regular expression database mentioned above. This may reduce an amount of processing required for subsequent analysis of documents. In particular, the first regular expression can be used again when analysing another document, without having to perform again the computationally expensive task of generating the first regular expression. In some cases, the regular expression code used in step 204 may correspond to a first regular expression for the product catalogue that was generated in a previous iteration of the method.
[0094]
[0095] The method 300 takes as an input the tokens which were found to match the first regular expression, i.e. strings of characters in the target text that were found to match the first regular expression. As shown in
[0096] At step 304, the method 300 then generates a respective prefix regular expression for each prefix in the set of prefixes determined in step 302. Each prefix regular expression is configured to match its corresponding n-character prefix. The prefix regular expression may be configured to require a word boundary before the prefix. This may serve to ensure that only prefixes starting at a word boundary are identified.
[0097] At step 306, the method 300 applies each of the prefix
[0098] regular expressions to the tokens that were found to match the first regular expression as a result of step 106. Where there are multiple tokens, this may be done, for example, by combining the multiple tokens into a single string of characters, and applying in turn each prefix regular expression to the string of characters. The string of characters may include spaces between the tokens, i.e. so that there is a word boundary between each token. The prefix regular expression may be applied to the string of characters using a suitable search engine. When a prefix regular expression is applied to the tokens, the prefix regular expression will return a match if the tokens include the corresponding prefix.
[0099] At step 308, the method keeps only those prefix regular expressions which returned a match when applied to the tokens in step 306. The remaining, non-matching prefix regular expressions may be discarded.
[0100] The method 300 then returns to step 302, incrementing the number n by 1, and runs through steps 302-308 again. Thus, where n=2 in the first iteration of the method 300, n will be incremented to 3 in the next iteration, and so on. However, in iterations of step 302 after the first iteration, the set of prefixes is determined to only include prefixes that match the prefix regular expressions kept in step 308 in the previous iteration. Thus, as the iterative process of method 300 proceeds, the set of prefixes becomes progressively smaller, with the prefixes themselves increasing incrementally in length.
[0101] The method 300 terminates when n reaches the length of the product identifiers in the product catalogue, i.e. when n can no longer be increased. In particular, if a prefix in the set of prefixes represents an entire product identifier (i.e. the prefix has the same number of characters as the product identifier), and the corresponding prefix regular expression matches one of the tokens in step 306, then that token may be determined to correspond to the product identifier. Accordingly, the method 300 may output product identifiers which are determined to correspond to tokens identified in the target text. The product identifiers output from the method 300 may be used for adding an entry in the citation database, as shown in step 110 of method 100. Alternatively, a further verification process may be performed to confirm that the product identifiers determined in method 300 are actually cited in the target text. This verification step serves to improve the accuracy of the method.
[0102] The method 400 is performed for each product identifier which is determined to correspond to a token from the target text. For example, the method 400 may be performed for each product identifier determined to correspond to a token from the target text following the method 300. In the following, each product identifier determined to correspond to a token from the target text (e.g. by method 300) may be referred to as a candidate product identifier. In general terms, the method 400 involves, for each candidate product identifier, generating, using a suitably configured processor, a third regular expression for that product identifier, and applying the third regular expression to the target text to determine if the product identifier is indeed cited in the target text.
[0103] Starting at step 402, the method 400 determines a risk factor for one of the candidate product identifiers, the risk factor being associated with a risk of wrongly identifying the product identifier in the target text. The inventors have found that product identifiers having certain properties may be more prone to incorrect identification, e.g. because they may be confused with character strings in the target text which are not actually represent product identifiers. For example, where a product identifier is a numeric code, there may be a risk that it could be confused with a date or some other numeric value or code. As another example, the product identifier may be confused with a Uniform Resource Locator (URL), Digital Object Identifier (DOI), address, unit of measurement or other string of characters. Accordingly, in step 402, the method 400 checks the product identifier against a set of rules, to determine whether there is a risk of the product identifier being wrongly identified in the target text. The set of rules may be determined beforehand, e.g. by a human user.
[0104] If in step 402 a risk factor for the product identifier is identified based on the set of rules, then the method 400 moves on to step 404. In step 404, the third regular expression is generated such that it has two components: an identifier component and a context component. The identifier component corresponds to regular expression code that is configured to match the product identifier. The identifier component may be configured to match varying writing styles, without overly broadening the match. For instance, where the product identifier includes punctuation, the identifier component may be configured to match versions of the product identifier having alternate punctuations. As an example, if the product identifier is ab-1234, the identifier component may be configured to match each of ab1234, ab 1234 and ab-1234. The context component is configured to match one or more predetermined character strings, adjacent to (e.g. preceding and/or following) a character string that is matched by the identifier component. The predetermined characters strings matched by the context component may correspond to any text that can be used for confirming that a string matching the identifier component is indeed a product identifier. For example, the predetermined character strings may include catalogue number, cat #, cat no., product number or similar. The predetermined character strings may also include text indicative of a pack size, as this may typically be located after mention of a product identifier. Accordingly, the third regular expression generated in step 404 is configured to match a string of characters having a first portion which matches a pattern of product identifier and context defined by the third regular expression. Thus, a match is only returned if a product identifier is cited in combination with a predetermined context. This may ensure that a string in the target text is not wrongly identified as a product identifier.
[0105] On the other hand, if in step 402 no risk factor is identified for the product identifier, then the method 400 moves on to step 406. As no risk factor was identified for the product identifier, then there may be no need to verify that the product identifier is cited in combination with a predetermined context. Thus, in step 406, the third regular expression is generated, whereby it is configured to match the relevant product identifier. In contrast to the third regular expression generated in step 404, the third regular expression generated in step 406 need not have a context component, e.g. it may only have the identifier component mentioned above. The third regular expression (generated either by step 404 or step 406) may be stored in a memory of the system for subsequent use, e.g. to avoid having to re-compute the third regular expression when other documents are analysed.
[0106] Once the third regular expression has been generated (either by step 404 or step 406), the method 400 moves on to step 408, where the third regular expression is applied to the target text. The third regular expression may be applied to the target text by a search engine, which is configured to search the target text for character strings matching the third regular expression. If a match in the target text is found, then it is determined that the candidate product identifier for which the third regular expression was generated is indeed cited in the target text. Accordingly, an entry for the product identifier may then be added automatically and without further human intervention, to the citation database, as in step 110, linking the product identifier with the document and target text. Alternatively, if no match is found in step 408, then no entry is made in the citation database for the product identifier.
[0107] It should be noted that not all embodiments involve the steps shown in
[0108]
[0109] The processor 504 is further communicatively coupled to various databases, so that it can retrieve data from, and store data in, the databases. In particular, the processor 504 is coupled to a document database 506, a product catalogue database 508 and a citation database 510. The document database 506 is configured to store documents (e.g. journal articles). The processor 504 may access a document stored in the document database 506 to analyse the document according to the invention, i.e. to search the document for product identifiers. The product catalogue database 508 is configured to store product catalogues associated with one or more entities (e.g. manufacturers and/or suppliers), each product catalogue including a plurality of product identifiers. The citation database 510 is configured to store citations determined according to a method of the invention. In particular, the citation database may store entries linking product identifiers from the product catalogues stored in database 508, with documents stored in database 506. The processor 504 may further be coupled with a regular expression database (not shown) which may be configured to store regular expressions generated when analysing a document. For example, the regular expression database may store the first, second and third regular expressions generated as part of the analysis of a document, so that they can be accessed and re-used when analysing another document. The regular expression database may also store the regular expression codes discussed above.
[0110] In practice, the system 500 may be implemented by any suitable combination of computer systems and network of computer systems. For example, the computer system 500 and all databases may be implemented by a single computer system. Alternatively, the different tasks and functions of the system 500 may be distributed across one or more computer systems (e.g. servers).
[0111] In yet a further implementation of the methods, systems and apparatus described herein, a citation database is generated automatically. Using a suitably configured computing system comprising a processor and a memory for storing instructions, a corpus of documents is accessed and evaluated. Here, the suitably configured computing system searches for an entity identifier corresponding to an entity of interest within a document. For example, the entity of interest here is a product supplier. In one arrangement, the entity of interest is provided by one or more entity lists, files or databases are accessible to the computer system. Here, the computer system is configured to iterate through the entity list. However, in one or more implementations, the entity of interest is received from a user input. For example, a custom user supplied entity of interest, and associated product codes can be supplied.
[0112] Where an instance of the entity of interest is detected in a given document, the computer system is further configured to select a portion of the document around the instance of the entity identifier as a target text. In one or more configurations, the system is pre-configured to select n number of characters before and after the entity identified. As described herein, the entity is associated with a product catalogue. This product catalogue can be a data object, XML document, JSON file, linked list, database, array or other data structure accessible to the computer system described. In one arrangement, the product catalogue contains one or more product identifiers.
[0113] Here, the computer system is configured to automatically apply a first regular expression to the target text, wherein the first regular expression is configured to match one or more of the plurality of product identifiers. Where the product identifier is determined to be cited in the target text, the described computer system is configured to add the citation to an existing database file.
[0114] However, where there is no existing database file, the computer system is configured to generate a citation database file. For example, one or more submodules are used to configure the computing system to generate a database in a pre-determined file format. Once generated, the entry is added to the citation database linking the document and the product identifier. This process is then repeated for each entity identified within the given document and each product identifier. Furthermore, given that the computer system is executing over a corpus of documents, the described approach is then conducted on each document within the corpus of documents. Upon completion, a single database is constructed that includes each citation identified entity and product that is represented within the corpus of documents.
[0115] In yet a further implementation, a search engine is provided utilizing one or more of the citation databases described herein. For example, a search engine is provided that is configured to receive a search query for a given product. However, in alternative configurations, the search engine can receive an entity identifier, or a document title. The search engine is then configured to perform a search on a citation database.
[0116] In one or more implementations, the citation database over which the search is executed was generated according to the one or more of the approaches described herein.
[0117] When the search query is executed over the citation database, the search engine is configured to return one or more product identifiers from the citation database based on the search query, such as search query using entity of interest identifiers. In one arrangement, the one or more product identifiers are then ranked based on a number of documents linked with each product identifier in the citation database.
[0118] A series of worked examples is provided below, to illustrate how a method according to an embodiment of the invention may be implemented. In the below examples, we consider four example entities (companies): Abcam, Cell Signaling Technology, MilliporeSigma and Developmental Studies Hybridoma Bank. Table 1 below shows example product catalogues associated with each entity. In particular, Table 1 shows, for each entity, product identifiers listed in their product catalogue.
TABLE-US-00001 TABLE 1 Example entities and product catalogues Entity Product identifier Abcam ab123 ab134 ab245 Cell Signaling Technology 2040 2050 4060 MilliporeSigma P-1000 P-1050 12345 123ab Developmental Studies Hybridoma C594.9B Bank 132-250-1 MF20
[0119] Table 2 below shows information that may be stored for each entity. For example Table 2 shows, for each entity, other names which may be used to identify that entity. Table 2 also shows any regular expression codes that have been previously determined and stored for each entity. The regular expression codes may be determined by a human, and/or automatically generated using a regular expression generator taking the product catalogue as an input. The final column in Table 2 shows a regular expression code for a pack size suffix which may feature at the end of a product identifier from the corresponding entity, e.g. to indicate a pack size of the product.
TABLE-US-00002 TABLE 2 Entity information Also Regular Pack Size Entity Known As Expression code Suffix Abcam ab[0-9]+ Cell Signaling CST [0-9]{4,5} [SL] Technology MilliporeSigma Millipore, [a-z]-[0-9]{4} Sigma Developmental DSHB Studies Hybridoma Bank
[0120] For each entity, the entity name in the first column of Table 2, together with the additional names in the second column of Table 2, may be taken as the entity identifiers for that entity. In order to search a document for each entity's entity identifiers, the regular expressions shown in Table 3 below may be generated. Each regular expression in Table 3 is configured to match all of the entity identifiers for the corresponding entity, surrounded by something that marks the start or end of a string of text. Additionally, in the regular expressions of Table 3, spaces are replaced with a permissive pattern that allows authors to use multiple spaces, tabs, hyphens, unicode dashes or other unexpected punctuation. In this manner, the regular expressions of Table 3 may match the entity identifiers, even when authors use different writing styles with regard to spacing and punctuation. The regular expressions of Table 3 correspond to the second regular expressions discussed above. The regular expressions of Table 3 may be generated by a human, or automatically using a suitably configured regular expression generator.
TABLE-US-00003 TABLE 3 Regular expressions for entity identifiers Entity Regular Expression Abcam (?<match>Abcam) Cell (?<match>Cell\W*Signaling\W*Technology|CST) Signaling Technology MilliporeSigma (?<match>MilliporeSigma|Millipore|Sigma) Developmental (?<match>Developmental\W*Studies\W*Hybridoma\ Studies W*Bank|DSHB) Hybridoma Bank
EXAMPLE 1
[0121] In a first example, we consider a first document including the following example text: [0122] Thank you to the lovely staff at Abcam for helping. [0123] An earlier paragraph not about antibodies. [0124] The antibodies were purchased from Abcam (ab123, ab134). We used mode ab99 on the microscope. [0125] A later paragraph not about antibodies.
[0126] Applying method 100 to the first document, in step 102 the regular expressions in Table 3 for each entity are applied to the document, to search the document for entity identifiers corresponding to the four entities. With the example text shown above, the regular expression for Abcam will return a match.
[0127] In step 104, portions of the document including instances of the Abcam entity identifier are defined as target text. For example, paragraphs of text including instances of the Abcam entity identifier may be defined as target text.
[0128] As an example, step 104 may output data summarised in Table 4 below:
TABLE-US-00004 TABLE 4 Example target text from first document Target text no. Entity Target text 1 Abcam Thank you to the lovely staff at Abcam for helping. 2 Abcam The antibodies were purchased from Abcam (ab123, ab134) . We used mode ab99 on the microscope.
[0129] The method 100 then analyses each target text separately. Starting with target text 1, a first regular expression needs to be determined for the Abcam product catalogue. The method 200 is followed for this purpose. The method 200 checks if a regular expression code is available for the Abcam product catalogue (step 202). As shown in Table 2, there is a regular expression code available. The method 200 moves on to step 204 where it is checked if the regular expression code matches all of the product identifiers in Abcam's product catalogue. In this case, all of the product identifiers match the regular expression code, so the method 200 moves on to step 206 where the first regular expression is generated based on the regular expression code. As an example, the following first regular expression which includes the regular expression code for Abcam may be generated:
(?<=\W|{circumflex over ()}) (ab[0-9]+) (?=\W|$)(1)
[0130] This first regular expression is stored (e.g. cached and/or stored in a regular expression database), so that it can be re-used later on.
[0131] Then, in step 106, the first regular expression (1) is applied to the target text 1 shown in Table 4. No matches are found, so the method moves on to analysing target text 2.
[0132] Accordingly, step 106 is performed again, this time applying the first regular expression (1) (which was previously stored) to the second target text. This returns three tokens from the target text: ab123, ab134 and ab99. These tokens are combined into a single string with spaces between the tokens: ab123 ab134 ab99.
[0133] Next, method 300 is used to determine if any of the
[0134] tokens correspond to product identifiers from Abcam's product catalogue. Starting with n=2, in step 302 the method determines all of the 2-character prefixes for the product identifiers in Abcam's product catalogue: this is just ab.
[0135] In step 304, a prefix regular expression is generated for
[0136] the prefix ab. For example, the prefix regular expression \Wab may be generated, which is configured to search for strings starting at a word boundary and having the prefix ab. In step 306, the prefix regular expression \Wab is applied to the string of tokens ab123 35 ab134 ab99, and returns a match. Therefore, in step 308 the prefix regular expression \Wab is kept.
[0137] The method 300 then returns to step 302, incrementing n by 1, such that at step 302 all of the 3-character prefixes matching \Wab are determined: ab1, ab2. In step 304, corresponding prefix regular expressions \Wab1 and \Wab2 are generated, and in step 306 they are applied to the string of tokens. \Wab1 returns a match, whilst \Wab2 does not return any match, such that at step 308 only \Wab1 is kept (\Wab2 can be discarded).
[0138] The method 300 then returns again to step 302, incrementing n by 1, such that at step 302 all of the 4-character prefixes matching \Wab1 are determined: ab12, ab13. In step 304, corresponding prefix regular expressions \Wab12 and \Wab13 are generated, and in step 306 they are applied to the string of tokens. \Wab12 and \Wab13 both return a match, such that at step 308 both \Wab12 and \Wab13 are kept.
[0139] The method 300 then returns again to step 302, incrementing n by 1, such that at step 302 all of the 5-character prefixes matching either of \Wab12 and \Wab13 are determined: ab123, ab134. In step 304, corresponding prefix regular expressions \Wab123 and \Wab134 are generated, and in step 306 they are applied to the string of tokens. \Wab123 and \Wab134 both return a match, such that at step 308 both \Wab123 and \Wab134 are kept.
[0140] When the method 300 returns again to step 302 and attempts to find 6-character prefixes, it is found that the maximum length of the product identifiers has already been reached, i.e. that the matched 5-character prefixes correspond to whole product identifiers. Accordingly, the prefix regular expressions \Wab123 and \Wab134 which provided a match in the latest iteration of step 308 correspond to matched product identifiers in the string of tokens. Therefore, the tokens ab123 and ab134 can be determined as product identifiers cited in target text 2.
[0141] To confirm that product identifiers ab123 and ab134 have been correctly identified in target text 2, method 400 is applied as a verification process. In step 402, no risk factor is determined for the product identifiers ab123 and ab134, e.g. because they do not have a format that could be confused with a date. In step 406, a third regular expression is generated for each of the product identifiers ab123 and ab134, examples of which are shown in Table 5 below.
TABLE-US-00005 TABLE 5 Example third regular expressions Product identifier Third regular expression ab123 (?<=([{circumflex over ()}\w\-]|\A))(?<match>ab123)(?=([{circumflex over ()}\w\- ]|\z)) ab134 (?<=([{circumflex over ()}\w\-]|\A))(?<match>ab134)(?=([{circumflex over ()}\w\- ]|\z))
[0142] The third regular expressions generated for these product identifiers may be stored in a memory so that they can be re-used at a later stage.
[0143] In step 408 each third regular expression shown in Table 5 is applied to target text 2. Both regular expressions return a match, meaning that product identifiers ab123 and ab134 are determined to be cited in target text 2.
[0144] Finally, in step 110, entries are automatically added to the citation database for product identifiers ab123 and ab134, linking them to the first document and target text 2.
EXAMPLE 2
[0145] In a second example, we consider a second document including the following example text: [0146] We purchased Cat# 2050S from CST. It has a use by date of the year 2040.
[0147] Applying method 100 to the second document, in step 102 the regular expressions in Table 3 for each entity are applied to the document. With the example text shown above, the regular expression for CST will return a match, and in step 104 the following target text is determined:
TABLE-US-00006 TABLE 6 Example target text from second document Target text no. Entity Target text 1 CST We purchased Cat# 2050S from CST. It has a use by date of the year 2040.
[0148] A first regular expression is generated for the CST product catalogue, in accordance with the method 200. In step 202 the method finds the regular expression code associated with the CST product catalogue, and at step 204 determines that all of the product identifiers in the CST product catalogue match the regular expression code. Accordingly, at step 206 a first regular expression based on the stored regular expression code is generated. The first regular expression for the CST product catalogue may, for example, be:
(?<=\W|{circumflex over ()})([0-9]{4,5})([SL])?(?=\W|$)(2)
[0149] Note that, as there is a pack size suffix for the CST product catalogue (see Table 2), this is included in the first regular expression (2).
[0150] In step 106, the first regular expression (2) is applied to the target text 1 shown in Table 6. The following two tokens from the target text match the first regular expression (2): 2050S, 2040. These tokens are combined into a single string with spaces between the tokens: 2050S 2040.
[0151] Next, method 300 is used to determine if any of the tokens correspond to product identifiers from CST's product catalogue. Starting with n=2, in step 302 the method determines the set of the 2-character prefixes for the product identifiers in CST's product catalogue: ['20, 40]. In step 304, a prefix regular expression is generated for each prefix: \W20 and \W40. In step 306 the prefix regular expressions are applied to the target text 1 of Table 6, and \W20 returns a match, whilst \W40 does not return a match. Therefore, in step 308 the prefix regular expression \W20 is kept (\W40 can be discarded).
[0152] The method 300 then returns to step 302, incrementing n by 1, and at step 302 finds the set of 3-character prefixes matching \W20: ['204, 205]. In step 304, a prefix regular expression is generated for each prefix: \W204 and \W205. In step 306 the prefix regular expressions are applied to the target text, and both \W204 and \W205 return a match. Therefore, both \W204 and \W205 are kept in step 308.
[0153] The method 300 then returns again to step 302, incrementing n by 1, and at step 302 finds the set of 4-character prefixes matching either of \W204 and \W205: ['2040, 2050]. In step 304, a prefix regular expression is generated for each prefix: \W2040 and \W2050. In step 306 the prefix regular expressions are applied to the target text, and both \W2040 and \W2050 return a match. Therefore, both \W2040 and \W2050 are kept in step 308.
[0154] When the method 300 returns again to step 302 and attempts to find 5-character prefixes, it is found that the maximum length of the product identifiers has already been reached, i.e. that the matched 4-character prefixes correspond to whole product identifiers. Therefore, the tokens 2040 and 2050 can be determined as product identifiers cited in the target text.
[0155] To confirm that product identifiers 2040 and 2050 have been correctly identified in target text, method 400 is applied as a verification process. In step 402, it is determined that there is a risk factor for the product identifiers 2040 and 2050, as these could be confused with dates mentioned in the text. Therefore, the method 400 moves on to step 406 where the third regular expression is generated as having an identifier component and a context component. An example of such third regular expressions is provided in Table 7 below.
TABLE-US-00007 TABLE 7 Example third regular expressions Product identifier Third regular expression 2040 (?<=([{circumflex over ()}\w\-]|\A)) (?<match>(#|(num(ber)?|cat(alog)?))\W*2040(?<suffix> [SL])?)(?=([{circumflex over ()}\w\-]|\z)) 2050 (?<=([{circumflex over ()}\w\-]|\A)) (?<match>(#|(num(ber)?|cat(alog)?))\W*2050(?<suffix> [SL])?)(?=([{circumflex over ()}\w\-]|\z))
[0156] The third regular expressions in Table 7 include a context component which requires corroborating text to confirm that a string 2040 or 2050 in the target text is in fact a product identifier. In particular, in the above third regular expressions the context component is configured to match strings of characters including #, num, number, cat, catalog which precede the string 2040. Additionally, the context component includes a component corresponding to the pack size suffix that may be included in CST product identifiers. Of course, the context component may be configured to search for additional or alternative strings that can be used to confirm that a product identifier is being cited.
[0157] In step 408 each third regular expression shown in Table 7 is applied to the target text. In this case, only the third regular expression for product identifier 2050 returns a match. Finally, in step 110, an entry is automatically added to the citation database for product identifier 2050, linking it to the second document and the target text.
EXAMPLE 3
[0158] In a third example, we consider a third document including the following example text: [0159] The mathematical value sigma was mentioned up here. [0160] We purchased anti-ab 12345 from Millipore
[0161] Applying method 100 to the third document, in step 102 the regular expressions in Table 3 for each entity are applied to the document. With the example text shown above, the regular expression for MilliporeSigma will return a match, and in step 104 the following target texts determined:
TABLE-US-00008 TABLE 8 Example target text from third document Target text no. Entity Target text 1 MilliporeSigma The mathematical value sigma was mentioned up here. 2 MilliporeSigma We purchased anti-ab 12345 from Millipore
[0162] A first regular expression is generated for the MilliporeSigma product catalogue, in accordance with the method 200. In step 202 the method finds the regular expression code associated with the MilliporeSigma product catalogue, and at step 204 determines that product identifiers 12345 and 123ab do not match the regular expression code. The method 200 then moves on to step 208, to generate the first regular expression having a first component and the second component. The first component is based on the retrieved regular expression code, whilst the second component is generated based on the non-matching product identifiers 12345 and 123ab. Various techniques may be used for generating the second component. For example, the second component could be generated simply by joining the non-matching product identifiers together, e.g. yielding /12345|123ab/. However, this approach can result in a very large string if there is a large number of non-matching product identifiers. Instead, a preferred technique is a regular expression generator which is configured to optimise the second component. This may be done, for example, by parsing the non-matching product identifiers into a tree, and then traversing the tree to obtain an optimised regular expression. As an example, this may yield the regular expression /123(45|ab)/. Combining the first component and the second component, the first regular expression for the MilliporeSigma product catalogue may be generated as:
(([a-z]\W*[0-9]{4})|(123(45|ab)))(3)
[0163] Note that in the first regular expression (3), the hyphen - is swapped for a more permissive pattern, to account for different writing styles.
[0164] In step 106, the first regular expression (3) is applied to the target text 1 shown in Table 8, and no match is returned. The method 100 then moves on to analysing target text 2, and applies the first regular expression to target text 2, which returns the following matching token: 12345.
[0165] Next, method 300 is used to determine if the token corresponds to a product identifier from MilliporeSigma's product catalogue. Starting with n=2, in step 302 the method determines the set of the 2-character prefixes for the product identifiers in MilliporeSigma's product catalogue: ['P-', 12]. In step 304, a prefix regular expression is generated for each prefix: \WP\W* and \W12. In step 306 the prefix regular expressions are applied to the target text 2 of Table 8, and only \W12 returns a match. Therefore, in step 308 the prefix regular expression \W12 is kept (the other one can be discarded).
[0166] The method 300 then returns to step 302, incrementing n by 1, and at step 302 finds the set of 3-character prefixes matching \W12: ['123]. In step 304, a prefix regular expression is generated for the prefix: \W123. In step 306 the prefix regular expression is applied to the target text, and \W123 returns a match. Therefore, \W123 is kept in step 308.
[0167] The method 300 then returns again to step 302, incrementing n by 1, and at step 302 finds the set of 4-character prefixes matching \W123: ['1234, 123a]. In step 304, a prefix regular expression is generated for each prefix: \W1234 and \W123a. In step 306 the prefix regular expressions are applied to the target text, and only \W1234 returns a match. Therefore, only \W1234 is kept in step 308.
[0168] The method 300 then returns again to step 302, incrementing n by 1, and at step 302 finds the set of 5-character prefixes matching \W1234: ['12345]. In step 304, a prefix regular expression is generated for the prefix: \W12345. In step 306 the prefix regular expression is applied to the target text, and \W12345 returns a match. Therefore, \W12345 is kept in step 308.
[0169] When the method 300 returns again to step 302 and attempts to find 6-character prefixes, it is found that the maximum length of the product identifiers has already been reached, i.e. that the matched 5-character prefix corresponds to a whole product identifier. Therefore, the token 12345 can be determined as a product identifier cited in the target text.
[0170] To confirm that product identifier 12345 have been correctly identified in target text, method 400 is applied as a verification process. In step 402 it may be determined that the product identifier 12345 does not have any particular risk factor. For example, although it is a numeric code, it does not look like a date and so is unlikely to be confused with a date. Also, the product identifier 12345 is not divisible by 100, and so is unlikely to be confused with a dilution value or a standard measurement. Accordingly, the method 400 moves on to step 406 to generate the third regular expression.
[0171] As an example, the third regular expression may be generated in step 406 as shown in Table 9 below, where a context component is included as an optional requirement of the search pattern. In other words, a string of characters does not necessarily need to match the context component in order to match the third regular expression of Table 9.
TABLE-US-00009 TABLE 9 Example third regular expressions Product identifier Third regular expression 12345 (?<=([{circumflex over ()}\w\- ]|\A))(?<match>((#|(num(ber)?|cat(alog)?))\W*)?2345) (?=([{circumflex over ()}\w\-]|\z))
[0172] In step 408 the third regular expression shown in Table 9 is applied to the target text 2 of Table 8, which returns a match. Finally, in step 110, an entry is added to the citation database for product identifier 12345, linking it to the third document and the target text 2.
EXAMPLE 4
[0173] In a fourth example, we consider a fourth document including the following example text: [0174] This is a really long bit of text before we start talking about the antibody. The mouse monoclonal antibody MF20 supernatant was obtained from the Developmental Studies Hybridoma Bank. This is another really long bit of text after we talk about the antibody.
[0175] Applying method 100 to the fourth document, in step 102
[0176] the regular expressions in Table 3 for each entity are applied to the document. With the example text shown above, the regular expression for DSHB will return a match, and in step 104 the following target text determined (e.g. by determining a portion of text including a predetermined number of characters before and after the instance of the entity name):
TABLE-US-00010 TABLE 10 Example target text from fourth document Target text no. Entity Target text 1 DSHB really long bit of text before we start talking about the antibody. The mouse monoclonal antibody MF20 supernatant was obtained from the Developmental Studies Hybridoma Bank. This is another really
[0177] In the case of the DSHB product catalogue, there is no stored regular expression code. Furthermore, the product identifiers from the DSHB product catalogue do not follow any particular character patterns, with the product identifiers in the DSHB product catalogue appearing unrelated to one another. Therefore, it may not be practical to generate a single first regular expression that covers the whole DSHB product catalogue, and instead at step 210 a respective first regular expression is generated for each product identifier in the product catalogue. Table 11 below shows examples of first regular expressions that may be generated for each of the product identifiers in the DSHB product catalogue.
TABLE-US-00011 TABLE 11 Example first regular expressions Product identifier First regular expression C594.9B (?<=([{circumflex over ()}\w\- ]|\A))(?<match>((#|(num(ber)?|cat(alog)?))\W*)?c594\ .9b)(?=([{circumflex over ()}\w\-]|\z)) 132-250-1 (?<=([{circumflex over ()}\w\- ]|\A))(?<match>((#|(num(ber)?|cat(alog)?))\W*)?132\W* 250\W*1)(?=([{circumflex over ()}\w\-]|\z)) MF20 (?<=([{circumflex over ()}\w\- ]|\A))(?<match>((#|(num(ber)?|cat(alog)?))\W*)?mf20) (?=([{circumflex over ()}\w\-]|\z))
[0178] The first regular expressions in Table 11 are generated by looking up the product identifiers in the DSHB product catalogue, and providing them as an input to an automated regular expression generator. The regular expression generator is configured to generate a permissive first regular expression for each product identifier, to account for different writing styles as discussed above. Additionally, the first regular expressions are generated to include an optional context component, similar to that discussed above in relation to Example 3. The first regular expressions are then stored in a memory (e.g. in the regular expression database and/or a cache), so that they can be re-used at a later stage.
[0179] Then, in step 106, each first regular expression computed for the DSHB product identifiers is applied in turn to the target text in Table 10. In step 108, if a first regular expression returns a match, then the matched token is determined to correspond to a product identifier cited in the target text. In this case, only the first regular expression for the product identifier MF20 produces a match. Accordingly, in step 110 an entry is added to the citation database, linking the product identifier MF20 to the fourth document and the target text.
[0180] Note that the regular expressions disclosed in the examples above are for illustrative purposes, and various modifications to the regular expressions can be made. For example, if wanted, the regular expressions can be modified to make them case insensitive (e.g. by adding //i around a regular expression) or to otherwise make them more tolerant to different writing styles.
[0181] Although a few preferred embodiments have been shown and described, it will be appreciated by those skilled in the art that various changes and modifications might be made without departing from the scope of the invention, as defined in the appended claims.
[0182] All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except where at least some of such features and/or steps are mutually exclusive. In particular, various combinations of the methods 100, 200, 300 and 400 discussed above may be used.
[0183] Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purposes, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclose is one example only of a generic series of equivalent or similar features.
[0184] The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.