METHOD FOR DYNAMIC CATEGORIZATION THROUGH NATURAL LANGUAGE PROCESSING
20230020779 · 2023-01-19
Inventors
Cpc classification
International classification
Abstract
Dynamic categorization of documents from a semi-static classification taxonomy through the use of key terms, concepts, and entities. Dynamic categorization is a method for retrieving documents that are relevant to a specific category, which can be defined at the time the documents are needed. This is in contrast to a priori sorting and tagging (identifying) documents as to what categories they belong. The categories can be defined not just as a set of key words but may also include phrases, entities and/or relationships found in the document(s), complex field queries, weighted queries against words, as well as exclusion conditions.
Claims
1. A method, comprising: receiving a document at a natural language processing engine, wherein the natural language processing engine extracts text from the document; indexing the extracted text at a data store; populating a query based on the indexed text and a category descriptor; and categorizing the document based on the query.
2. The method according to claim 1, wherein the natural processing engine identifies the language in the document and provides a base language meaning.
3. The method according to claim 1 or 2, wherein the natural language processing engine identifies entities in the document, concepts in the document, and relationships between the entities.
4. The method according to claim 3, wherein the natural language processing engine measures salience of each entity or concept.
5. The method according to any of claims 1-4, wherein documents are categorized by a cluster tool.
6. The method according to any of claims 1-5 further comprising: constructing a category.
7. The method according to any of claims 1-6, wherein the extracted text from the natural language processing engine is used as consideration for the documents inclusion into or exclusion from a category.
8. The method according to claim 7, wherein the inclusion into or exclusion from a category of the document is not dependent on the indexed text.
9. The method according to any of claims 1-8, wherein the category descriptor comprises a list of terms that that include at least one of a must term, a should term, and a should not term.
10. An apparatus, comprising: at least one processor; and at least one memory comprising computer program code; the at least one memory and computer program code configured, with the at least one processor, to cause the apparatus at least to perform receiving a document at a natural language processing engine, wherein the natural language processing engine extracts text from the document; indexing the extracted text at a data store; populating a query based on the indexed text and a category descriptor; and categorizing the document based on the query.
11. The apparatus according to claim 10, wherein the natural processing engine identifies the language in the document and provides a base language meaning.
12. The apparatus according to claim 10 or 11, wherein the natural language processing engine identifies entities in the document, concepts in the document, and relationships between the entities.
13. The apparatus according to claim 12, wherein the natural language processing engine measures salience of each entity or concept.
14. The apparatus according to any of claims 10-13, wherein documents are categorized by a cluster tool.
15. The apparatus according to any of claims 10-14, wherein the at least one memory and computer program code are further configured to perform: constructing a category.
16. The apparatus according to any of claims 10-15, wherein the extracted text from the natural language processing engine is used as consideration for the documents inclusion into or exclusion from a category.
17. The apparatus according to claim 16, wherein the inclusion into or exclusion from a category of the document is not dependent on the indexed text.
18. The apparatus according to any of claims 10-17, wherein the category descriptor comprises a list of terms that that include at least one of a must term, a should term, and a should not term.
19. An apparatus, comprising: circuitry configured to perform receiving a document at a natural language processing engine, wherein the natural language processing engine extracts text from the document; indexing the extracted text at a data store; populating a query based on the indexed text and a category descriptor; and categorizing the document based on the query.
20. The apparatus according to claim 19, wherein the natural processing engine identifies the language in the document and provides a base language meaning.
21. The apparatus according to claim 19 or 20, wherein the natural language processing engine identifies entities in the document, concepts in the document, and relationships between the entities.
22. The apparatus according to claim 21, wherein the natural language processing engine measures salience of each entity or concept.
23. The apparatus according to any of claims 19-22, wherein documents are categorized by a cluster tool.
24. The apparatus according to any of claims 19-23, wherein the circuitry is further configured to perform: constructing a category.
25. The apparatus according to any of claims 19-24, wherein the extracted text from the natural language processing engine is used as consideration for the documents inclusion into or exclusion from a category.
26. The apparatus according to claim 25, wherein the inclusion into or exclusion from a category of the document is not dependent on the indexed text.
27. The apparatus according to any of claims 19-26, wherein the category descriptor comprises a list of terms that that include at least one of a must term, a should term, and a should not term.
28. An apparatus, comprising: means for receiving a document at a natural language processing engine, wherein the natural language processing engine extracts text from the document; means for indexing the extracted text at a data store; means for populating a query based on the indexed text and a category descriptor; and means for categorizing the document based on the query.
29. The apparatus according to claim 28, wherein the natural processing engine identifies the language in the document and provides a base language meaning.
30. The apparatus according to claim 28 or 29, wherein the natural language processing engine identifies entities in the document, concepts in the document, and relationships between the entities.
31. The apparatus according to claim 30, wherein the natural language processing engine measures salience of each entity or concept.
32. The apparatus according to any of claims 28-31, wherein documents are categorized by a cluster tool.
33. The apparatus according to any of claims 28-32 further comprising: means for constructing a category.
34. The apparatus according to any of claims 28-33, wherein the extracted text from the natural language processing engine is used as consideration for the documents inclusion into or exclusion from a category.
35. The apparatus according to claim 34, wherein the inclusion into or exclusion from a category of the document is not dependent on the indexed text.
36. The apparatus according to any of claims 28-35, wherein the category descriptor comprises a list of terms that that include at least one of a must term, a should term, and a should not term.
37. A non-transitory computer readable medium comprising program instructions stored thereon that when executed in hardware, perform a method comprising: receiving a document at a natural language processing engine, wherein the natural language processing engine extracts text from the document; indexing the extracted text at a data store; populating a query based on the indexed text and a category descriptor; and categorizing the document based on the query.
38. The non-transitory computer readable medium according to claim 37, wherein the natural processing engine identifies the language in the document and provides a base language meaning.
39. The non-transitory computer readable medium according to claim 37 or 38, wherein the natural language processing engine identifies entities in the document, concepts in the document, and relationships between the entities.
40. The non-transitory computer readable medium according to claim 39, wherein the natural language processing engine measures salience of each entity or concept.
41. The non-transitory computer readable medium according to any of claims 37-40, wherein documents are categorized by a cluster tool.
42. The non-transitory computer readable medium according to any of claims 37-41, wherein the method further comprises performing: constructing a category.
43. The non-transitory computer readable medium according to any of claims 37-42, wherein the extracted text from the natural language processing engine is used as consideration for the documents inclusion into or exclusion from a category.
44. The non-transitory computer readable medium according to claim 43, wherein the inclusion into or exclusion from a category of the document is not dependent on the indexed text.
45. The non-transitory computer readable medium according to any of claims 37-44, wherein the category descriptor comprises a list of terms that that include at least one of a must term, a should term, and a should not term.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] For proper understanding of example embodiments, reference should be made to the accompanying drawings, wherein:
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
DETAILED DESCRIPTION
[0019] Dynamic Categorization is a method and set of processes for being able to dynamically identify documents that belong to, or have membership with, user defined category or sets of topics. A user can add topics to a semi-static classification taxonomy and provide subcategories, or extend existing categories, by adding specific terms, concepts, and entities which represent those categories.
[0020]
[0021] As illustrated in
[0022] A Data Store 120 capable of indexing extracted NLP data fields as well as providing a text index of the text and “gloss” and a service to populate a query template in a category query builder 130 with a query of the category descriptor 140 (the category taxonomy) may also be included in the dynamic categorization. The dynamic categorization may use a user interface to populate the category taxonomy and an interface to request 150 and display the documents that belong to members of the category. A document can be a member of multiple categories. The dynamic categorization may also include a cluster tool 150. As will be discussed below, a cluster tool 150 may perform a cluster analysis to populate the category description. As will be discussed below, a category may be constructed in the category constructor 160.
[0023]
[0024]
[0025] To construct the category descriptor, a user interface is used to populate a list of “must,” “should,” and/or “should not” terms for each category item in a taxonomic hierarchy. The lists can be created either by typing in the conditions directly, populated based on a set of documents, pointing and clicking on document presentations as a result of a query, or by performing co-occurrence analysis. Examples of certain category descriptor forming techniques are show in
[0026]
[0027] Queries may be based on the type of request to retrieve the set of documents that are members of a category. The Queries may be built from a hierarchical pattern and may be populated with the taxonomy of items to make categories, and subcategories, sub-subcategories, etc. . . . . A parent category is the Boolean “should” of all of the child categories. Selecting subcategory is a simple prune of all of the “nibbling” subcategories at the same hierarchy level (and their subordinates). The conditions may be any field from the NLP extraction, or the text or gloss, as shown in
[0028] Dynamic Categorization indexes the terms and concepts at a higher level than just a string, therefore the actual language of the document does not matter. If a curator were to add, for example, “dog”, the dynamic categorization would understand to index any French documents with “chien” (the French word for dog) because the dynamic categorization is run against the native language of the document, as well as the gloss.
[0029] Document membership to a category is determined at the time of retrieval, rather than at the time of storage or the time of indexing, allowing category definitions to be added, changed, deleted, without requiring the document to be reprocessed or re-indexed. Dynamic Categorization's category index is truly dynamic. There is no down-time period from adding terms, concepts, and entities to being able to query on those categories. Conversely, requiring those terms would force the topic category to only be about toys or model aircraft carriers.
[0030] Dynamic Categorization can index categories based on entities. For example, if a PERSON entity, like “Paris Hilton” were added, the category would only index documents that include “Paris Hilton” where that Paris Hilton is a person, not the place or FACILITY located in France. The entities can be Geospatial in nature as the “Paris Hilton” FACILITY has coordinates that can be used as membership or exclusion from the category. Similarly, the entities can be temporal in nature, allowing the category membership to be based on time windows. This feature may use an NLP process to identify the relationships.
[0031] The category membership/exclusions may be relational between Entities. (e.g., PEOPLE associate to the FACILITY “Paris Hilton.” Alternatively, category membership/exclusions may be limited further by a relational subtype. For example, a category may be limited by PEOPLE ‘employed at’ the FACILITY “Paris Hilton.”
[0032]
[0033]
[0034] Certain embodiments are directed to an apparatus including at least one processor and at least one memory. The memory may include computer program code. The at least one memory and computer program code may be configured, with the at least one processor, to cause the apparatus at least to perform a method.
[0035] One having ordinary skill in the art will readily understand that the example embodiments as discussed above may be practiced with procedures in a different order, and/or with hardware elements in configurations which are different than those which are disclosed. Therefore, although some embodiments have been described based upon these example embodiments, it would be apparent to those of skill in the art that certain modifications, variations, and alternative constructions would be apparent, while remaining within the spirit and scope of example embodiments.