METHODS AND SYSTEMS FOR SEARCHING AND RETRIEVING INFORMATION
20230142351 · 2023-05-11
Assignee
Inventors
Cpc classification
International classification
Abstract
Methods and systems for searching and retrieving information. In one aspect, there is a method of retrieving information using a knowledge base. The method comprises receiving a search query entered by a user and using a first model to identify a category corresponding to the received search query. The method further comprises based on the received search query, a loss function of the first model, and an objective function of a second model, identifying T topics corresponding to the received search query, and performing a search for the received search query only on a part of the knowledge base that is associated with the identified category and/or the identified topics. The method further comprises retrieving one or more files associated with the identified category and/or the identified topics.
Claims
1. A method of retrieving information using a knowledge base, the method comprising: receiving a search query entered by a user; based on the received search query, using a first model to identify a category corresponding to the received search query, wherein one or more files are assigned to the identified category and further wherein the first model is a categorization model that functions to map an input to one of M different categories, where M is greater than 1; based on (i) the received search query, (ii) a loss function of the first model, and (iii) an objective function of a second model, identifying T topics corresponding to the received search query, where T is greater than or equal to 1; using the identified category and the identified topics, performing a search for the received search query only on a part of the knowledge base that is associated with the identified category and/or the identified topics; and based on the performed search, retrieving one or more files associated with the identified category and/or the identified topics.
2. The method of claim 1, further comprising constructing the knowledge base, wherein constructing the knowledge base comprises: obtaining a set of N files, wherein each file included in the set of files is assigned to one of the M different categories, where N is greater than 1; based on (i) content of the N files, (ii) the loss function of the first model, and (iii) the objective function of the second model, identifying a set of topics, where each topic is a group of one or more keywords; generating the knowledge base using the identified topics; and for each one of the N files, based on a particular category to which the file is assigned and keywords included in the file, adding the file to the knowledge base.
3. A method for constructing a knowledge base, the method comprising: obtaining a set of N files, wherein each file included in the set of files is assigned to one of M different categories, where N and M are greater than 1; based on (i) content of the N files, (ii) a loss function of a first model, and (iii) an objective function of a second model, identifying a set of T topics, where T is greater than 1 and each topic is a group of one or more keywords; generating the knowledge base using the identified topics; and for each one of the N files, based on a particular category to which the file is assigned and keywords included in the file, adding the file to the knowledge base, wherein the first model is a categorization model that functions to map an input sentence to one of the M categories.
4. The method of claim 2, wherein the categorization model is a machine learning (ML) model, and the method further comprises training the ML model using the categorized files as training data.
5. The method of claim 2, wherein identifying the set of T topics comprises identifying said group of one or more keywords of each topic using a sum of the loss function of the first model and the objective function of the second model.
6. The method of claim 2, wherein the loss function of the first model depends at least on a probability distribution of each topic of the set of T topics and a stochastic parameter influencing a distribution of words in each topic of the set of T topics.
7. The method of claim 2, wherein the objective function of the second model depends at least on a predetermined category of a file and a predicted output of the first model.
8. The method of claim 2, wherein the second model is Latent Dirichlet Allocation (LDA) model.
9. The method of claim 2, further comprising: performing a Part-Of-Speech (POS) tagging on keywords associated with the identified set of T topics.
10. An apparatus for retrieving information using a knowledge base, the apparatus being adapted to: receive a search query entered by a user; based on the received search query, use a first model to identify a category corresponding to the received search query, wherein one or more files are assigned to the identified category and further wherein the first model is a categorization model that functions to map an input to one of M different categories, where M is greater than 1; based on (i) the received search query, (ii) a loss function of the first model, and (iii) an objective function of a second model, identify T topics corresponding to the received search query, where T is greater than or equal to 1; using the identified category and the identified topics, perform a search for the received search query only on a part of the knowledge base that is associated with the identified category and/or the identified topics; and based on the performed search, retrieve one or more files associated with the identified category and/or the identified topics.
11. The apparatus of claim 10, the apparatus further being adapted to construct the knowledge base, wherein constructing the knowledge base comprises: obtaining a set of N files, wherein each file included in the set of files is assigned to one of the M different categories, where N is greater than 1; based on (i) content of the N files, (ii) the loss function of the first model, and (iii) the objective function of the second model, identifying a set of topics, where each topic is a group of one or more keywords; generating the knowledge base using the identified topics; and for each one of the N files, based on a particular category to which the file is assigned and keywords included in the file, adding the file to the knowledge base.
12. An apparatus for constructing a knowledge base, the apparatus being adapted to: obtain a set of N files, wherein each file included in the set of files is assigned to one of M different categories, where N and M are greater than 1; based on (i) content of the N files, (ii) a loss function of a first model, and (iii) an objective function of a second model, identify a set of T topics, where T is greater than 1 and each topic is a group of one or more keywords; generate the knowledge base using the identified topics; and for each one of the N files, based on a particular category to which the file is assigned and keywords included in the file, add the file to the knowledge base, wherein the first model is a categorization model that functions to map an input sentence to one of the M categories.
13. The apparatus of claim 12, wherein the categorization model is a machine learning (ML) model, and the method further comprises training the ML model using the categorized files as training data.
14. The apparatus of claim 12, wherein identifying the set of T topics comprises identifying said group of one or more keywords of each topic using a sum of the loss function of the first model and the objective function of the second model.
15. The apparatus of claim 12, wherein the loss function of the first model depends at least on a probability distribution of each topic of the set of T topics and a stochastic parameter influencing a distribution of words in each topic of the set of T topics.
16. The apparatus of claim 12, wherein the objective function of the second model depends at least on a predetermined category of a file and a predicted output of the first model.
17. The apparatus of claim 12, wherein the second model is Latent Dirichlet Allocation (LDA) model.
18. The apparatus of claim 12, further comprising: performing a Part-Of-Speech (POS) tagging on keywords associated with the identified set of T topics.
19. A computer program comprising instructions which when executed by processing circuitry causes the processing circuitry to perform the method of claim 1.
20. A carrier containing the computer program of claim 19, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
21-24. (canceled)
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
DETAILED DESCRIPTION
[0021]
[0022] The knowledge graph 120 includes a top node 122 in the first layer of the knowledge graph 120 and middle nodes 124, 126, and 128 in the second layer of the knowledge graph 120. To perform a search on the knowledge graph 120, the search must be performed on the entire knowledge graph 120. The search of the entire knowledge graph 120, however, requires longer time period.
[0023] Accordingly, in some embodiments, both of a categorization and topic modelling are used such that a search only needs to be performed on a part of the knowledge graph rather than the entire knowledge graph.
[0024] For categorization, domain knowledge (e.g., hierarchy structure) or Artificial Intelligence (AI) based model may be used. As an example, convolutional neural networks (CNN) model may be used to categorize files based on an inputted search query. As used herein a “file” is a collection of data that is treated as an unit.
[0025] For topic modelling, Latent Dirichlet Allocation (LDA) model may be used to identify dominant topics in files.
[0026]
[0027] When an LDA model is used to identify topics in files, the loss function of the LDA model is used for finding a distribution of words associated with each of the topics such that word distributions are uniform. The problem of using the loss function of the LDA model is that it is unsupervised and thus may generate poor results. Also because the text is noisy, employing a categorizer (i.e., a classifier) may result in poor results. Thus, in some embodiments, the loss function (i.e., the objective function) of the LDA model is modified by adding the loss function of the categorizer (i.e., the classifier) to the loss function of the LDA model.
[0028] Exemplary loss function of the LDA model is L=Σ.sub.d.sup.NΣ.sub.n∈N.sub.
[0029] Thus, according to some embodiments, the loss function of the LDA model is modified such that the modified loss function of the LDA model is based on the loss function of the categorizer as well as the loss function of the LDA model. For example, the modified loss function of the LDA model is L.sub.mod=L.sub.unmod+∥y.sub.d−ŷ.sub.d∥.sub.2.sup.2, where L.sub.unmod=Σ.sub.d.sup.NΣ.sub.n∈N.sub.
[0030]
[0031] In step s302, all files in a database which needs to be searched are obtained.
[0032] After obtaining the files, in step s304, each of the obtained files is categorized and labelled with one or more categories. For example, a document used by service engineers for managing wireless network equipment(s) may be labeled with categories—“installation” and “troubleshooting.” Because sentences included in a document are likely related to the category or the categories of the document, each sentence included in the document may also be categorized according to the category or the categories of the document.
[0033] After categorizing and labelling the files, in step s306, keywords and/or key phrases are extracted from the files using a character recognition engine (e.g., Tesseract optical character recognition (OCR) engine) and each of the files is divided based on sentences included in each file. Each of the extracted key phrases may be identified as a single word by connecting multiple words included in each key phrase with a hyphen, a dash, or an underscore (e.g., solving_no_connection_problem).
[0034] In step s308, a categorization model is built. The categorization model may be configured to receive one or more sentences as an input and to output one or more categories associated with the inputted sentence(s) as an output. The input of the categorization model is set to be in the form of a sentence (rather than a word or a paragraph) because a search query is generally in the form of a sentence. In some embodiments, CNN model may be used as the categorization model.
[0035] In step s310, a topic modelling is performed on files that are in the same category, and dominant keywords which form topic(s) in the files are identified. In some embodiments, LDA model may be used to perform the topic modelling.
[0036] After identifying (i) categories of the files and (ii) topics associated with each of the categories of the files, a knowledge base is constructed in step s312. In the knowledge base, each of the categories, which is identified in step s304, may be assigned to a node in a top level (herein after “top node”) of the knowledge base and topics associated with each of the categories of the files may be assigned to nodes in a middle level (herein after “middle nodes”), which are branched from the top node.
[0037] As shown in
[0038] After constructing the knowledge base in step s312, in step s314, nodes corresponding to names of the files are added to a lower level of the knowledge base. The nodes in the lower level (herein after “lower nodes”) are associated with one or more of the topics in the middle level of the knowledge base and are branched from the associated topics. For example, in the knowledge graph 400, the node 414 corresponds to the file name—“File 1”—and is branched from the nodes 406 and 410 corresponding to the topics associated with the “File 1”—“Low Power” and “Poor Signal.”
[0039] In some embodiments, after performing the topic modelling in step s310, two additional steps may be performed prior to constructing a knowledge base in step s312. Specifically, as shown in
[0040] After performing the POS tagging, in step s504, NER construction may be performed. In the NER construction step, one or more words included in the obtained files are labelled with what the words represent. For example, the word “London” may be labelled as a “capital” while the word “France” may be labelled as a “country.”
[0041] After performing the NER construction in step s504, a knowledge base may be constructed in step s312.
[0042]
[0043] In step s602, a search query is received at a user interface. The user interface may be any device capable of receiving a user input. For example, the user interface may be a mouse, a keyboard, a touch panel, and a touch screen.
[0044] After receiving the search query, in step s604, one or more sentences corresponding to the search query is provided as input to a categorization model such that the categorization model identifies one or more categories associated with the search query. The categorization model used in this step may correspond to the categorization model built in step s408.
[0045] After identifying one or more categories associated with the search query, in step s606, a topic model identifies one or more topics associated with the search query based on one or more keywords of the search query. The topic model used in this step may correspond to the entity that performs the topic modelling in step s310.
[0046] Based on the identified categories and topics associated with the search query, in step s608, a search is performed only on a part of the knowledge base that involves the identified categories and the identified topics rather than on the whole knowledge base. By performing a search only on the part of a knowledge base that is most likely related to a user's search query, file(s) that is related to the search query may be retrieved faster.
[0047]
[0048] Step s702 comprises receiving a search query entered by a user.
[0049] Step s704 comprises based on the received search query, using a first model to identify a category corresponding to the received search query. One or more files may be assigned to the identified category and the first model may be a categorization model that functions to map an input to one of M different categories, where M is greater than 1.
[0050] Step s706 comprises based on (i) the received search query, (ii) a loss function of the first model, and (iii) an objective function of a second model, identifying T topics corresponding to the received search query, where T is greater than or equal to 1.
[0051] Step s708 comprises using the identified category and the identified topics, performing a search for the received search query only on a part of the knowledge base that is associated with the identified category and/or the identified topics.
[0052] Step s710 comprises based on the performed search, retrieving one or more files associated with the identified category and/or the identified topics.
[0053] In some embodiments, the process 700 may further comprise constructing the knowledge base. Constructing the knowledge base may comprise obtaining a set of N files, each of which is assigned to one of the M different categories, where N is greater than 1. Constructing the knowledge base may also comprise based on (i) content of the N files, (ii) the loss function of the first model, and (iii) the objective function of the second model, identifying a set of topics, where each topic is a group of one or more keywords. Constructing the knowledge base may further comprise generating the knowledge base using the identified topics and for each one of the N files, based on a particular category to which the file is assigned and keywords included in the file, adding the file to the knowledge base.
[0054]
[0055] Step s802 comprises obtaining a set of N files each of which is assigned to one of M different categories, where N and M are greater than 1.
[0056] Step s804 comprises based on (i) content of the N files, (ii) a loss function of a first model, and (iii) an objective function of a second model, identifying a set of T topics, where T is greater than 1 and each topic is a group of one or more keywords.
[0057] Step s806 comprises generating the knowledge base using the identified topics.
[0058] Step s808 comprises for each one of the N files, based on a particular category to which the file is assigned and keywords included in the file, adding the file to the knowledge base.
[0059] The first model may be a categorization model that functions to map an input sentence to one of the M categories.
[0060] In some embodiments, the categorization model is a machine learning (ML) model. The process 800 may further train the ML model using the categorized files as training data.
[0061] In some embodiments, identifying the set of T topics comprises identifying said group of one or more keywords of each topic using a sum of the loss function of the first model and the objective function of the second model.
[0062] In some embodiments, the loss function of the first model depends at least on a probability distribution of each topic of the set of T topics and a stochastic parameter influencing a distribution of words in each topic of the set of T topics.
[0063] In some embodiments, the objective function of the second model depends at least on a predetermined category of a file and a predicted output of the first model.
[0064] In some embodiments, the second model is Latent Dirichlet Allocation (LDA) model.
[0065] In some embodiments, the process 800 comprises performing a POS tagging on keywords associated with the identified set of T topics.
[0066]
[0067] While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above-described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
[0068] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.