APPARATUS AND METHOD FOR FINDING MEANINGFUL PATTERNS IN LARGE DATASETS USING MACHINE LEARNING
20230126266 · 2023-04-27
Inventors
Cpc classification
G06F18/2113
PHYSICS
G06F18/40
PHYSICS
G06V10/758
PHYSICS
International classification
G06F18/40
PHYSICS
G06F18/2113
PHYSICS
Abstract
A method, and corresponding system and computer program product, is provided for identifying meaningful information in connection with an investigation. The method comprises processing a dataset using a machine learning process to derive an initial model conveying first statistical significance information corresponding to features in the dataset. The method also comprises deriving an alternate model at least in part by processing the dataset using the machine learning process while nullifying a contribution of certain features in the dataset selected as candidates for nullification. The alternate model conveys second statistical significance information corresponding to features in the dataset. A user interface is rendered on the display and presents information for assisting the user in identifying the information in the dataset meaningful to the investigation, the information presented being derived at least in part by processing information conveyed by the initial model and the alternate model.
Claims
1. A method for assisting a user in identifying in a dataset information meaningful to an investigation, said method being implemented by a computer system including one or more processors in communication with a memory module storing the dataset and with a display device, said method comprising: a. using the one or more processors, processing the dataset using a machine learning process to derive an initial model conveying first statistical significance information corresponding to features in the dataset; b. using the one or more processors, deriving an alternate model at least in part by processing the dataset using the machine learning process, wherein deriving the alternate model includes nullifying a contribution of a set of features in the dataset selected as candidates for nullification when applying the machine learning process, the set of features selected as candidates for nullification including a subset of the features in the dataset, wherein the alternate model conveys second statistical significance information corresponding to features in the dataset, wherein the first statistical significance information is different from the second statistical significance information; c. rendering on the display device a user interface presenting information for assisting the user in identifying the information in the dataset meaningful to the investigation, the information presented being derived at least in part by processing information conveyed by the initial model and the alternate model.
2. A method as defined in claim 1, comprising selecting the candidates for nullification from the features in the dataset.
3. As method as defined in claim 1, comprising selecting the candidates for nullification from the features in the dataset is performed at least in part based on the first statistical significance information.
4. A method as defined in claim 2, comprising: a. classifying some features of the dataset as statistically significant at least in part by processing the first statistical significance information; b. selecting the candidates for nullification from the features of the dataset classified as statistically significant.
5. A method as defined in claim 4, wherein selecting the candidates for nullification from the features in the dataset is performed at least in part based on inputs provided by the user to the computer system.
6. A method as defined in claim 5, comprising: a. selecting the candidates for nullification from the features in the dataset at least in part by: i. presenting on the user interface at least some features in the dataset as suggested user-selectable options for nullification; and ii. in response to receipt of a user selection of one or more features from the suggested user-selectable options, including the user selection as part of the selected candidates for nullification; b. deriving the alternate model at least in part by processing the dataset using the machine learning process, wherein deriving the alternate model includes nullifying the contribution of the selected candidates for nullification.
7. A method as defined in claim 6, comprising processing the first statistical significance information using an automated process to derive the at least some features in the dataset to be presented on the user interface as part of the suggested user-selectable options for nullification.
8. A method as defined in claim 7, wherein the automated process is configured to process the first statistical significance information to select at least one feature from the dataset to be presented on the user interface as part of the suggested user-selectable options for nullification.
9. A method as defined in claim 8, wherein the automated process is configured to process the first statistical significance information to select at least two features from the dataset to be presented on the user interface as part of the suggested user-selectable options for nullification.
10. A method as defined in claim 8, wherein the automated process is configured to apply an optimization scheme to select the at least one feature from the dataset.
11. A method as defined in claim 10, wherein the optimization scheme includes a hill climbing (trial and error) process.
12. A method as defined in claim 8, wherein the automated process is configured to apply a set of heuristics rules to select the at least one features from the dataset.
13. A method as defined in claim 2, comprising selecting the candidates for nullification from the dataset at least in part by processing the first statistical significance information using an automated process to select features to form part of the selected candidates for nullification.
14. A method as defined in claim 13, wherein the automated process is configured to select at least one feature from the dataset as part of the set of features identified as candidates for nullification.
15. A method as defined in claim 13, wherein the automated process is configured to select at least two features from the dataset as part of the set of features identified as candidates for nullification.
16. A method as defined in claim 14, wherein the automated process is configured to apply optimization scheme to select the at least one feature from the dataset as part of the set of features identified as candidates for nullification.
17. A method as defined in claim 16, wherein the optimization scheme includes a hill climbing (trial and error) process.
18. A method as defined in claim 14, wherein the automated process is configured to apply a set of heuristics rules to select the at least one feature from the dataset as part of the set of features identified as candidates for nullification.
19. A method as defined in claim 1, wherein the information presented for assisting the user in identifying the information in the dataset meaningful to the investigation conveys: i. a first set of features in the dataset classified as statistically significant based at in part on the first statistical significance information; and ii. a second set of features in the dataset classified as statistically significant based at least in part on the second statistical significance information.
20. A method as defined in claim 1, wherein the information presented for assisting the user in identifying the information in the dataset meaningful to the investigation conveys information derived by performing a comparison between the initial model and the alternate model.
21. A method as defined in claim 1, wherein the information presented for assisting the user in identifying the information in the dataset meaningful to the investigation identifies a specific subset of features in the dataset presenting a greater change in statistical significance between the initial model and the alternate model relative to other features in the dataset.
22. A method as defined in claim 21, comprising comparing the initial model and the alternate model to rank features in the dataset at least in part based on changes in statistical significance of the features between the initial model and the alternate model.
23. A method as defined in claim 1, wherein the machine learning process includes a generalized linear modelling (GLM) process.
24. A method as defined in claim 1, wherein the machine learning process includes a topic modelling process.
25. A method as defined in claim 24, wherein the topic modelling process is a Latent Dirichlet Allocation (LDA) process.
26. A method as defined in claim 24, wherein the dataset includes a corpus and wherein features in the dataset include terms from the corpus.
27. A method as defined in claim 24, wherein the machine learning process used to derive the initial model includes: a. applying the topic modelling process to the dataset to derive information conveying: i. a topic identified in the dataset; and ii. the first statistical significance information for features in the dataset, the first statistical significance information conveying a relevance of respective features of the dataset to the topic identified in the dataset.
28. A method as defined in claim 27, wherein the information presented on the user interfaces for assisting the user in identifying the information in the dataset meaningful to the investigation conveys the topic identified in the database in association with at least a subset of features in the dataset, the subset of features in the dataset being derived at least in part by processing the first statistical significance information.
29. A method as defined in claim 24, wherein the machine learning process used to derive the initial model includes: a. applying the topic modelling process to the dataset to derive information conveying: i. a set of topics identified in the dataset; and ii. the first statistical significance information for features in the dataset, the first statistical significance information conveying a relevance of respective features in the dataset to each topic in the set of topics identified in the dataset.
30. A method as defined in claim 29, wherein the set of topics identified in the dataset includes at least two topics.
31. A method as defined in claim 29, comprising: a. selecting a number of topics to be included in the set of topics to be identified in the dataset; b. applying the topic modelling process to the dataset to derive the information conveying the set of topics identified in the dataset.
32. A method as defined in claim 31, wherein the number of topics selected is configured to be between 5 and 9.
33. A method as defined in claim 31, wherein the number of topics is selected at least in part based on a user input.
34. A method as defined in claim 33, wherein applying a topic modelling process to the dataset includes: i. presenting on the user interface one or more suggested user-selectable options for numbers of topics to be derived by the topic modelling process; and ii. in response to receipt of a user selection identifying a specific number of topics amongst the suggested user-selectable options, applying the topic modelling process to the dataset on the basis of the specific number of topics.
35. A method as defined in claim 24, wherein processing the dataset using the machine learning process to derive the initial model includes applying at least one of a data-cleaning process and feature engineering process to the dataset to remove a contribution associated with features considered insignificant to the investigation.
36. A method as defined in claim 35, wherein the machine learning process includes a topic modelling process and wherein the dataset includes a corpus and wherein features in the dataset include terms from the corpus, the set of insignificant features includes a set of common stop terms and a set of investigation specific stop terms.
37. A method as defined in claim 36, wherein processing the dataset using the machine learning process to derive the initial model includes applying a process using a term frequency—inverse document frequency (TF-IDF) statistic to the dataset to identify at least some terms in at least one of the set of common stop terms and the set of investigation specific stop terms.
38. A method as defined in claim 1, wherein deriving the alternate model includes using the machine learning process at least in part by applying an optimization process to the initial model nullifying the contribution of the set of features in the dataset selected as candidates for nullification.
39. A method as defined in claim 1, wherein the dataset includes a plurality of police reports and the investigation is a police investigation.
40. A method as defined in claim 1, wherein the dataset includes a plurality of medical reports and the investigation is a medical investigation.
41. A method as defined in claim 1, wherein the dataset includes a plurality of financial reports and the investigation is a financial trends investigation.
42. A method for assisting a user in identifying in a dataset information meaningful to an investigation, said method being implemented by a computer system including one of more processors in communication with a memory module storing the dataset and with a display device, said method comprising: a. using the one or more processors, processing the dataset using a machine learning process to derive an initial model; b. rendering a user interface on the display device to present a set of suggested user-selectable features for nullification, the suggested user-selectable features corresponding to statistically important features conveyed by the initial model; c. in response to receipt of a user selection of one or more features from the suggested user-selectable options, deriving an alternate model at least in part by processing the dataset using the machine learning process nullifying a contribution of the one or more features specified by the user selection; d. adapting the user interface to present information for assisting the user in identifying the information in the dataset meaningful to the investigation, the information presented being derived at least in part by processing the initial model and the alternate model.
43. A method as defined in claim 42, wherein the information presented for assisting the user in identifying the information in the dataset meaningful to the investigation conveys information derived by performing a comparison between the initial model and the alternate model.
44. A method as defined in claim 42, wherein the information presented for assisting the user in identifying the information in the dataset meaningful to the investigation identifies a set of features in the dataset presenting a greater change in statistical significance between the initial model and the alternate model relative to other features in the dataset.
45. A method as defined in claim 42, comprising comparing the initial model and the alternate model to rank features in the dataset at least in part based on changes in statistical significance of the features between the initial model and the alternate model.
46. A method as defined in claim 42, wherein deriving the alternate model includes using the machine learning process at least in part by applying an optimization process to the initial model.
47. A method as defined in claim 42, wherein the machine learning process includes a topic modelling process.
48. A method as defined in claim 47, wherein the topic modelling process is a Latent Dirichlet Allocation (LDA) process.
49. A method as defined in claim 47, wherein the dataset includes a corpus and wherein features in the dataset include terms from the corpus.
50. A method as defined in claim 42, wherein the machine learning process includes a generalized linear modelling (GLM) process.
51. A system for assisting a user in identifying in a dataset information meaningful to an investigation, said system being in communication with a display device and including one or more processors in communication with a memory module storing the dataset, said one or more processors being programmed for implementing the method defined in claim 1.
52. A computer program product for assisting a user in identifying in a dataset information meaningful to an investigation, said computer program product including computer readable instructions stored on a non-transitory computer readable medium, said computer readable instructions when executed by a system including one or more processors being configured for implementing the method defined in claim 1.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0049] A detailed description of specific exemplary embodiments is provided herein below with reference to the accompanying drawings in which:
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063] In the drawings, exemplary embodiments are illustrated by way of example. It is to be expressly understood that the description and drawings are only for the purpose of illustrating certain embodiments and are an aid for understanding. They are not intended to be a definition of the limits of the invention.
DETAILED DESCRIPTION OF EMBODIMENTS
[0064] A detailed description of one or more specific embodiments of the invention is provided below along with accompanying Figures that illustrate principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any specific embodiment. The scope of the invention is limited only by the claims. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of describing non-limiting examples and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in great detail so that the invention is not unnecessarily obscured. In addition, it is to be appreciated that tools embodying aspects of the invention may be integrated as part of a system for performing investigations that may include other tools providing other functionality including, for example, clustering, dimension reduction, hotspot visualization and other visualisation and analysis tools. For the purpose of simplicity, these tools have not been described in the present disclosure.
[0065] The approach described in the present document may be applied to providing insights and assisting users in identifying potentially meaningful patterns and/or to make predictions in a wide variety of investigations. In the description below, certain very specific practical implementations of the present invention will be presented in the context of a forensics applications in order to illustrate how the methods described can be used in a practical investigative context. It is to be appreciated that the concepts presented in the present document may be used in other practical applications in which it is desirable to identify meaningful patterns and trends to be able to make better predictions and(or) to solve a problem., including but without being limited to police investigations to identify crime/forensics patterns, economic investigations to identify/forecast to identify criteria that may affect certain social/economic outcomes, medical investigations and many others that may become apparent to the person skilled in the art in view of the present description.
[0066] Process for Assisting a User
[0067] With reference to
[0068] The method 100 depicted in
[0069] At step 102, a dataset stored on the non-transitory memory module is provided. The dataset constitutes a body of information based on which the investigation is to be performed. The dataset may include a wide range of data in different formats, including structured data (e.g. with pre-defined data fields, features, terms (e.g. words, word patterns, categories etc. . . . )) as well as unstructured data (e.g. free-form reports). In some practical implementations, the dataset may be comprised of a plurality of documents forming a corpus of documents, wherein the documents are formed of terms (e.g. words, word patterns, categories etc.). The content of the dataset will vary according to the application for which it is intended and any suitable manner known in the art for constituting the dataset may be used in specific practical implementations. For example, in specific practical implementations: [0070] a. for a police investigation aiming to identify crime/forensics patterns, the dataset (corpus of documents) may include a set of police reports pertaining to related crimes. For example if the investigation pertains to a homicide in a specific city, the corpus of document may comprise: (i) police reports for all homicides that occurred in that city (or group of cities) over a certain time period; (ii) police reports for all violent crimes that occurred in that city (or group of cities) over a certain time period; and/or (iii) police reports for all breaking and entering crimes that occurred in that city (or group of cities) over a certain time period. [0071] b. for an economic investigation aiming to identify/forecast criteria that may affect certain social/economic outcomes, the dataset (corpus of documents) may include a set of reports summarizing social/economic outcomes in different urban and rural regions. [0072] c. for a medical investigation aiming to identify risk factors for a specific medical condition, the dataset (corpus of documents) may include a set of medical reports pertaining to patients having been diagnosed with the specific medical condition.
[0073] At step 104, the dataset is processed using a machine learning process to derive an initial model conveying first statistical significance information corresponding to features in the dataset. Different types of machine learning processes/algorithms may be used in different implementations in order to derive this initial model.
[0074] In a first practical example, the machine learning process may include a generalized linear modelling (GLM) process to derive the initial model conveying first statistical significance information corresponding to features of the dataset.
[0075] In a second practical example in which the dataset is comprised of a plurality of documents forming a corpus of documents and where the features correspond to terms (e.g. words) in the corpus, the machine learning process may include a topic modelling process to derive the initial model conveying first statistical significance information corresponding to terms (e.g. words) of the documents. In this second practical example, any suitable type of topic modelling process may be used including, without being limited to Latent Dirichlet Allocation (LDA), LDA2Vec, Latent semantic analysis (LSA), hierarchical Latent Dirichlet Allocation (hLDA) and Non-negative matrix factorization (NMF or NNMF). A more detailed explanation of a practical implementation in which the machine learning process is a topic modelling process based on Latent Dirichlet Allocation (LDA) will be described later on in the present disclosure.
[0076] It is to be appreciated that the application of the machine learning process at step 104 may also include applying suitable data-cleaning and/or feature engineering processes to the dataset prior to applying a machine learning algorithm according to know methods to remove a contribution associated with features considered insignificant to the investigation. For example, where the dataset includes a corpus of documents and where features in the dataset include terms (e.g. words) from the corpus and where the machine learning process includes a topic modelling process, the set of insignificant features may include, for example, a set of common stop terms and/or a set of investigation specific stop terms. Common stop terms and/or a set of investigation specific stop terms may be identified in a number of different suitable manners known in the art including, for example but not limited to: [0077] a. using a reference static dictionary including common stop terms and/or a set of investigation specific stop terms (e.g. articles, pronouns, certain specific nouns and verbs etc. . . . ); and/or [0078] b. applying a process using a term frequency—inverse document frequency (TF-IDF) statistic to the dataset to identify common terms and/or a set of investigation specific stop terms.
[0079] Processes for performing data-cleaning and feature engineering are known in the art of machine learning and will therefore not be described in greater detail here.
[0080] Once the initial model conveying first statistical significance information corresponding to features in the dataset has been derived, the process proceeds to step 106.
[0081] At step 106, the dataset is processed using the same machine learning process as at step 104 in the order to derive an alternate model conveying second statistical significance information corresponding to features in the dataset. While the same machine learning process is used, the second statistical significance information is different from the first statistical significance information that was derived in connection with the initial model as a result of the nullification of the contribution of some features in the dataset.
[0082] In particular, step 106 for deriving the alternate model conveying the second statistical significance information includes, at step 108, nullifying a contribution of a set of features in the dataset selected from the dataset as candidates for nullification and then, at step 110, applying the machine learning process with the nullified set of features to derive the alternate model. The way the contribution of the specific set of features is nullified at step 108 depends on the nature of the process/algorithm applied by the machine learning process used and any suitable approach may be taken, including for example, eliminate from the dataset the features in the set of features, setting to “nil” weight factors associated to the features in the set of features et any other suitable approach.
[0083] In some implementations, the set of features of the dataset whose contribution is nullified is selected at part of step 108 from the features of the dataset assigned a higher level of statistical significance relative to other features in the initial model (and thus potentially considered statistically more important features in the initial model) than other features in the dataset.
[0084] The selection of the features as candidates for nullification may be performed automatically (based on one or more specific criteria) and/or based on one or more user inputs. An example of a process part of step 108 for selecting features as candidates for nullification is shown in
[0085] As depicted, two options for selecting the features as candidates for nullification are provided namely path “A”, which is the selection based on user input and comprises steps 2022 2022 and 2026, and path “B”, which is an automated selection and comprises step 2028. In specific practical implementations, either path “A” or path “B” may be used or, alternatively both may be used in combination to select features that will be included as part of the candidates for nullification.
[0086] Looking first at the selection of the features as candidates for nullification based on one or more user inputs (path “A”), at step 2022, the first statistical significance information conveyed by the initial model is processed using an automated process to classify the features in the dataset based on their respective statistical significance. Different types of classification of features may be contemplated including: binary (Boolean) classification where features may be classified as either being statistically significant or being not statistically significant; and multi-level classification wherein features may be classified according to different levels of significance (e.g. a three-level classification may look like; High significance, Medium significance, Low significance). It is to be appreciated that any suitable number of significance levels may be used in alternate implementations. The candidates for nullification may thus be selected from the features of the dataset on the basis of their classification.
[0087] In some implementations, this may include, for example without being limited to: [0088] a. assigning a Boolean classification to each feature in the dataset (whereby each feature classified as “being” or “not being” statistically significant/important). The classification may be performed based on different criteria including comparing the first statistical significance information against a threshold value and/or classifying only a specific limited number of features in the dataset as “being” statistically significant/important based on the first statistical significance information. The specific limited number of features classified as being statistically significant may vary between implementations and may include 1, 2, 3 . . . , 8, 9, 10 or any other suitable number of features. In a non-limiting practical implementation, the specific limited number of features to be classified as statistically significant/important was set to 10; and [0089] b. ranking the features in the dataset based on their respective statistical significance as conveyed the first statistical significance information. This approach is generalization of the previously described Boolean classification in which rather than having only two categories, a plurality of rankings is defined based on the first statistical significance information, each ranking having a threshold value and/or being assigned corresponding specific limited number of features.
[0090] At step 2024, suggested user-selectable options for nullification are presented to a user on a user interface displayed on a display device. The suggested user-selectable options include features from the features of the dataset classified as statistically significant at step 2022 and may be presented in the form of a list of individually selectable features. In some implementations, the suggested user-selectable options may include a single feature as an option for selection by the user. In other implementations, the suggested user-selectable options may include two or more features presented as options for selection.
[0091] The selection of which features to include as part of the suggested user-selectable options may be performed using an automated process configured to process the first statistical significance information to select one, two or more features from the dataset to be presented on the user interface as part of the suggested user-selectable options for nullification. Various processes may be contemplated in practical implementations and may include, without being limited to, processes configured for applying an optimization scheme and/or a hill climbing (trial and error) to select the one or more features from the dataset as well as processes configured to apply a set of heuristics rules to select one or more features from the dataset.
[0092] Specific examples of hill climbing (trial and error) processes that may be used include the use of a modified version of greedy backward elimination of features, where the optimization implemented by the automated process uses a metric pertaining to small factors emergence. For example, in the case of LDA, the process may include spotting a feature whose statistical significance has materially changed in an alternate model derived following the nullification of one ore more specific features (i.e. this feature is considered to be a “big mover” in terms of statistical significance)).
[0093] Specific examples of heuristics rules for selecting one or more features from the dataset that may be used include the use of a custom dictionary of frequently important terms built either using a rules approach (e.g. gang names), and/or by clustering the terms in the dataset based on semantic similarity or logic (e.g. to list words related to weapons, locations, etc.). On the basis of such a custom dictionary, heuristic rules may be devised for selecting features as candidates for nullification, for example by systematically selecting as candidates for nullification features pertaining to certain specific semantic and/or logical groupings defined by the custom dictionary and/or by automatically selecting as candidates for nullification features from the custom dictionary that, in the initial model derived, can be classified as being statistically significant based on the statistical significance information conveyed by the initial model.
[0094] Other suitable processes for deriving suggested user-selectable options of statistically important features as options for selection may be used in alternate implementations.
[0095] The user interface permits a user to select a subset of the features in the suggested user-selectable options, which may include 1 or more features, by providing corresponding user inputs. The user inputs may be provided using any suitable device including, without being limited to, a keyboard, mouse, touch-sensitive screen and audio/voice input.
[0096] Once the user has completed the selection of one or more features from the suggested user-selectable options at step 2024, the process proceeds to step 2026. At step 2026, in response to receipt of the user selections, the corresponding features are included as part of the selected candidates for nullification for use in step 108 (shown in
[0097] Looking now at the selection of the features as candidates for nullification based on an automated selection (path “B”), at step 2028, the first statistical significance information conveyed by the initial model is processed using an automated process to select features to form part of the selected candidates for nullification. The automated process used in specific implementations may be configured to process the first statistical significance information to select one, two or more features from the dataset to be included as part of the candidates for nullification. The number of candidates selected may be fixed or may vary and may be automatically determined based on one or more criteria. Alternatively, the number of candidates selected may be a configurable parameter selectable by the user. Various processes may be contemplated in practical implementations and may include, without being limited to, processes configured for applying an optimization scheme and/or a hill climbing (trial and error) and/or a set of heuristics rules, to select one ore more statistically important features from the dataset to be included as part of the candidates for nullification.
[0098] Specific examples of hill climbing (trial and error) processes that may be used include the use of a modified version of greedy backward elimination of features, where the optimization implemented by the automated process uses a metric pertaining to small factors emergence. For example, in the case of LDA, the process may include spotting a feature whose statistical significance has materially changed in an alternate model derived following the nullification of one ore more specific features (i.e. this feature is considered to be a “big mover” in terms of statistical significance)).
[0099] Specific examples of heuristics rules for selecting one or more features from the dataset that may be used include the use of a custom dictionary of frequently important terms built either using a rules approach (e.g. gang names), and/or by clustering the terms in the dataset based on semantic similarity or logic (e.g. to list words related to weapons, locations, etc.). On the basis of such a custom dictionary, heuristic rules may be devised for selecting features as candidates for nullification, for example by systematically selecting as candidates for nullification features pertaining to certain specific semantic and/or logical groupings defined by the custom dictionary and/or by automatically selecting as candidates for nullification features from the custom dictionary that, in the initial model derived, can be classified as being statistically significant based on the statistical significance information conveyed by the initial model.
[0100] Following steps 2028 and/or 2026, we continue step 108 shown in
[0101] At step 110, the machine learning process is applied with the nullified set of features to derive the alternate model. In some specific implementations, rather than applying the machine learning process “de novo” to the dataset with the set of nullified features to generate the alternate model, a machine learning re-optimization process may be applied to the initial model derived at step 104 while nullifying the contribution of the set of features selected as candidates for nullification in the dataset. Applying machine learning re-optimization may provide some advantages including lower computational requirements for generating the alternate model compared to the initial model and permitting a convergence of the alternate model that is closer to the maxima of the initial model and reducing the likelihood that the initial model and the alternate model will reside in two different local optima.
[0102] This counter-intuitive approach at step 106 of nullifying a contribution of statistically more important features and re-optimizing the machine learning process based on the initial model to derive the alternate model provides an unexpected benefit of allowing the alternate model to bring to light previously less statistically significant features that may have been overshadowed by more important features in the initial model.
[0103] Following completion of step 106 and once the initial and alternate models are obtained, at step 112, a user interface is rendered on a display screen presenting information for assisting the user in identifying the information in the dataset meaningful to the investigation. More specifically, at step 114, the first statistical significance information and the second statistical significance information conveyed by the initial and alternate models are processed to derive the information to be presented on the user interface. The inventors have noted that by considering differences/variations between the first and second statistical significance information resulting from the manipulations of some features in the dataset (in this example by the nullification of the contribution of some features in the dataset), some insights into patterns and/or trends in the dataset can be obtained.
[0104] The information presented to the user may take different suitable forms in different implementations.
[0105] In a first example, the information presented for assisting the user in identifying the information in the dataset meaningful to the investigation may convey a first set of features in the dataset classified as statistically significant based at least in part on the first statistical significance information; and a second set of features in the dataset classified as statistically significant based at least in part on the second statistical significance information. By presenting both sets of features concurrently on the interface, the user may view and ascertain differences/variations, which may help in identifying meaningful information, including in some cases trends and/or patterns.
[0106] In a second example, the information presented for assisting the user in identifying the information in the dataset meaningful to the investigation may convey information derived by performing a comparison between the initial model and the alternate model. Such a comparison may take on various forms in practical implementations.
[0107] A specific example of a process for deriving information to be presented for assisting the user in identifying the information in the dataset meaningful to the investigation by performing a comparison between the initial model and the alternate model is shown in
[0108] As depicted at step 302, the initial model derived at step 104 (shown in
[0109] Following this, at step 306, a comparison between the initial model and the alternate model is performed to identify information in the dataset meaningful to the investigation. Different suitable approaches may be used for comparing the initial and alternate models is practical implementations. In the example depicted, the comparison performed includes step 306a which aims to identify a specific subset of features in the dataset presenting a greater change in statistical significance between the initial model and the alternate model relative to other features in the dataset. In this embodiment, this comparison aims to identify the features whose statistical significance was most affected because of the nullification of the set of features used to derive the alternate model in order to draw the user's attention to these features.
[0110] In a specific practical implementation, Step 306a to identify a specific subset of features in the dataset presenting a greater change in statistical significance than other terms, includes at step 308 for comparing the initial model and the alternate model and assigning a ranking to features in the dataset at least in part based on changes in statistical significance of the features between the initial model and the alternate model. Following this, at step 310, the rankings are used to identify the specific subset of features in the dataset presenting a greater change than other features. The identification of the features may be performed based on different criteria including comparing the variation between the first and second statistical significance against a threshold value and/or identifying a specific number of features in the dataset as being the ones that presented the greatest change. In practical implementation, the specific number of features identified as presenting the greatest change may be a fixed number of features or may be an operational parameter selectable by the user by providing an input through a user interface.
[0111] It is to be appreciated that the information derived by performing a comparison between the initial model and the alternate model may be presented to the user alone or together with other information including, but not limited to information conveying: [0112] i. the first set of features in the dataset classified as statistically significant based at in part on the first statistical significance information; and [0113] ii. the second set of features in the dataset classified as statistically significant based at least in part on the second statistical significance information.
[0114] The user interface may present the information pertaining to the initial model, the alternate model and the information derived by comparing both models in a variety of different suitable manners. For example, a graphical representation may be presented to show similarity and dynamic graphical displays to visually convey changes between the initial and the alternate models may be contemplated. Specific details pertaining to such graphical display approaches are beyond the scope of the present application and as such will not be described in greater specificity in the present document.
[0115] System for Assisting a User
[0116] With reference to
[0117] As depicted, the system 400 comprises a memory module 402 storing a dataset based on which the investigation is being performed. In a very specific example, the system 400 may be used for performing police investigations and the dataset may be comprised of a corpus of documents including police reports. It is to be appreciated that while the memory module 402 has been depicted as a single entity storing all the dataset, in practical application such memory module 402 may be comprised of one or more memory storage devices which may be co-located or, alternatively, distributed over a network in communication with other components of the system 400.
[0118] The system 400 also comprises a display device 420 on which a user interface is rendered for presenting information for assisting the user in identifying the information in the dataset meaningful to the investigation. Optionally the user interface may also be used for presenting user-selectable options to the user, including for example suggested features to be included as part of the candidates for nullification in the process described above with reference to
[0119] The system 400 also comprises one ore more user input devices 422 for allowing a user of the system to provide user commands, user selections and to allow a user (together with the display device 420) otherwise interact with the machine learning system 404 when conducting an investigation. The one or more input devices 422 may include various types of suitable input devices including for example, but without being limited to, a keyboard, mouse, touch-sensitive screen and audio/voice input.
[0120] The system 400 also comprises a machine learning system 404 in communication with the memory module 402, the user input device(s) 422 and the display device 420. The machine learning system 404 is configured to process the dataset stored in memory module 402 according to steps 104 106 and 112 (described above with reference to
[0121] In the embodiment depicted, the machine learning system 404 includes various functional modules for implementing various aspects of the method depicted in
[0122] The data cleaning module 406 and the feature engineering module 408 are configured to receive and process features in the dataset 402 before processing by the machine learning engine 410 to eliminate information unlikely to be relevant to the investigation.
[0123] For example, a data cleaning process implemented by the data cleaning module 406 may be applied to the dataset to remove punctuation, to remove articles, to fix incomplete data and the like. Any suitable method known in the art for cleaning a dataset may be used in practical implementations of the cleaning module 406.
[0124] With respect to a feature engineering process implemented by the feature engineering module 408, this process may be applied to the dataset 402 to modify original features and/or remove original features considered insignificant given the nature of the dataset and/or the nature of the investigation. For example, in specific implementations in which the dataset 402 includes a corpus comprised of police reports and features in the dataset include terms and/or groups of terms from the corpus, the feature engineering process implemented by the feature engineering module 408 may comprise removing features pertaining to standard police reports, for example terms shared by most of the reports in the corpus while providing little or no practical insights into a specific investigation. For example, terms such as “police report”, “forensics”, “investigation”, “victim”, “detective” and the like may be removed as these terms appear in the great majority of the corpus while providing little or no practical insights into a specific investigation. Any suitable method known in the art for performing feature engineering including identifying and removing such terms (or groups of terms) from the corpus may be used in practical implementations of the feature engineering module 408. In a specific practical implementation, the feature engineering process may include applying a process using a term frequency—inverse document frequency (TF-IDF) statistic to the dataset to identify at least some terms of the set of investigation specific stop terms to be removed from the corpus.
[0125] The machine learning engine 410 implements a suitable machine learning process that may be applied to a dataset for generating a model conveying statistical significance information corresponding to features in the dataset. The specific process applied by the machine learning engine 410 may vary between practical implementations. Specific example of processes that may be contemplated include, without being limited to, various topic modelling processes (e.g. Latent Dirichlet Allocation (LDA). LDA2Vec, Latent semantic analysis (LSA), hierarchical Latent Dirichlet Allocation (hLDA) and Non-negative matrix factorization (NMF or NNMF) and generalized linear modelling (GLM) processes.
[0126] With reference to the machine learning system 404 shown in
[0129] The “candidates for nullification” identification module 414 is configured to select a subset of features as candidates for nullification from the features of the dataset 402 and provide this subset to the machine learning engine 410. In a specific implementation, the subset of features selected corresponds to features assigned a higher level of statistical significance in the initial model (and thus potentially considered statistically more important features in the initial model). The module 414 receives the initial model conveying first statistical significance information from the machine learning engine 410 and processes this initial model to derive candidates for the subset. Optionally, for practical implementations allowing the user of the system 400 to influence the selection of the candidates for nullification, the module 414 may be in communication: (i) with the display device 402 to present suggested user-selectable options; and (ii) with the one or more user input device 422 for receiving user selections of one or more features for nullification. In this regard, in some specific examples of implementation, the module 414 may implement a process of the type described above with reference to
[0130] The meaningful information identification module 418 is configured to process the initial model 412 and the alternative model 416 received from the machine learning engine 410 in order to render a user interface on the display device 420 presenting information for assisting the user in identifying the information in the dataset 402 meaningful to the investigation. In some specific practical implementations, the meaningful information identification module 418 implementations the process shown in
[0131] Topic Modelling and Latent Dirichlet Allocation (LDA)
[0132] While the machine learning process implemented by the machine learning engine 410 to derive the initial and alternate models is not specifically limited to the use of topic modelling, such as for example Latent Dirichlet Allocation (LDA), the use of this type of process may present some interesting advantages in some practical implementations. For this reason, a specific implementation of the process describe in
[0133] Generally speaking, topic modeling is a type of statistical modeling for discovering abstract “topics” that occur in a collection of documents. In a very specific implementation, the topic modelling process is a Latent Dirichlet Allocation (LDA), which is an example of topic model used to classify terms (e.g. words) in a document to a particular topic. The LDA model ca be considered to construct a topics per document matrix and a terms per topic matrix modeled as Dirichlet distributions. Essentially, each document in the dataset can be considered to be reduced to a “bag of terms” and then LDA classifies each of these terms, within a document, to a particular topic. The general philosophy behind LDA is that if some terms appear frequently together in the corpus, it is likely because they are expressions of a same topic. The specific model generated by LDA is derived relying on certain specific assumptions namely: [0134] i. LDA model supposes the existence of N topics in the corpus; and [0135] ii. The presence of a term (e.g. word) in a document is a manifestation of a topic.
[0136] The model generated includes: [0137] i. The definition of each topic, including information conveying statistical significance information pertaining to the terms (e.g. words) in each topic; and [0138] ii. The topic composition of each document
[0139] It is noted that each topic is composed of a mixture of terms, specifically, that each topic is a convex combination of all the terms present in the corpus. In addition, each document is composed of a mixture of topics. LDA processes, and the mathematical models used in connection with such processes, are well known in the art of machine learning and as such will not be described in further detail here.
[0140] With reference to
[0141] At step 502, which is analogous to step 102 in
[0142] At step 503, which is analogous to step 104 in
[0143] As depicted, step 503 may include sub-step 504, sub-step 506 and sub-step 508.
[0144] Sub-step 504 is for applying a data-cleaning process and/or feature engineering process to the dataset, of the type describe previously, to remove a contribution associated with features considered insignificant to the investigation. The set of insignificant features may include, for example, a set of common stop terms and/or a set of investigation specific stop terms. As mentioned above, various approaches may be used at sub-step 504 including using a dictionary of common stop terms and/or investigation specific stop terms. In some specific implementations, sub-step 504 includes applying a process using a term frequency—inverse document frequency (TF-IDF) statistic to the dataset to identify at least some common stop terms and/or some of the investigation specific stop terms.
[0145] Sub-step 506 is for obtaining information conveying a number of topics to be derived from the dataset using the machine learning process. The number of topics to be derived may vary based on several factors including, for example, the nature of the information in the dataset, user preferences as well as other factors. The number of topics may range between 1 and 20; preferably between 1 and 10; more preferably between 5 and 9. In a specific practical implementation, 7 topics have been used. The number of topics to be derived may be a fixed or, alternatively, may be a programmable parameter of the system whereby a user (and/or an administrator) may provide this information during a configuration of the system. Once the number of topics to be derived is obtained, the process proceeds to step 508.
[0146] Alternatively, the number of topics at step 506 to be identified may be selected at least in part based on a user input. In such implementations, the user may be prompted through the user interface to provide a user input specifying the desired number of topics to be derived. The prompt may be in the form of a window including a set of user selectable options for the number of topics. In practical implementations of the method, the number of topics selected may be configured to lie within a certain specific range and the user may be presented with user selectable options in that range. In a specific implementation, the certain specific range is configured to be 9 or fewer topics, preferably between 5 and 9 and most preferably 7 topics. In response to the user selection of the number of topics to be derived, the process proceeds to step 508.
[0147] Alternatively, the number of topics at step 506 may be derived using an automated process, which may be trial-and-error based, aiming to satisfy certain criteria, configured for processing the dataset to derive a desirable number of topics. Any suitable method known in the art for selecting a number of topics to be derived in connection with a topic modelling process, such as LDA, may be used here. Once the number of topics to be derived is obtained using the automated process, the process proceeds to step 508.
[0148] At step 508, a topic modelling process, such as LDA, is applied to the dataset to derive the initial model which conveys: [0149] i. a set of topics identified in the dataset (the number of topics in the set corresponding to the number of topics obtained at step 506); and [0150] ii. first statistical significance information for features in the dataset, the first statistical significance information conveying a relevance of respective features (terms) in the dataset to each topic in the set of topics identified in the dataset.
[0151] The process then proceeds to step 509.
[0152] At step 509, which is analogous to step 106 described in
[0153] In particular, step 509 for deriving the alternate model conveying the second statistical significance information includes at step 510 of selecting a set of features in the dataset as candidates for nullification and then, at step 512, applying the machine learning process with the nullified set of features to derive the alternate model. The selection of the features as candidates for nullification at step 510 may be performed automatically (based on one or more specific criteria) and/or based on one or more user inputs and may be preformed in a manner similar to that described previously with reference to
[0154] Following step 509 and once the initial and alternate models are obtained, the process proceeds to a step (not shown in the Figures) analogous to step 112 in which a user interface is rendered on a displace screen presenting information for assisting the user in identifying the information in the dataset meaningful to the investigation.
[0155]
[0156]
[0157] As shown, the process includes providing a dataset including a plurality of documents constituting a corpus 702 based on which the investigation is to be conducted. Each document is constituted of terms. Element 704 is presented to illustrate a portion of the content of one of the documents in the corpus 702. At step 706, which is analogous to step 104 in
[0158] At step 760, the initial model 708A and 708B and the alternate model 758A and 758B are compared to consider differences/variations between the first and second statistical significance information resulting from the nullification of the terms selected at step 752 in order to derive information that may be meaningful to the investigation but may have been obstructed by some of the more statistically significant terms in the dataset. For example, the information derived at step 760 may identify a set of terms in the dataset presenting a greater change in statistical significance between the initial model and the alternate model relative to other terms in the dataset. Step 760 may be implemented in a manner similar to what was discussed early with reference to step 306 (
[0159]
[0160] It is to be appreciated that the screen shots shown in
[0161] Process for Assisting a User—Other Embodiment
[0162] With reference to
[0163] The method 1000 depicted in
[0164] At step 1002, which is analogous to step 102 of
[0165] At step 1004, which is analogous to step 104 of
[0166] At step 1006, a user interface is rendered on a display device to present a set of suggested user-selectable features for nullification, the suggested user-selectable features corresponding to statistically important features conveyed by the initial model derived at step 1004. The set of suggested user-selectable features may be derived in different manners, such as for example in a manner similar to what is described with reference to steps 2022 and 2024 (in path “A”) in
[0167]
[0168] The process then proceeds to step 1008 in which, in response to receipt of a user selection of one or more features from the suggested user-selectable options, an alternate model is derived at least in part by processing the dataset using the machine learning process while nullifying a contribution of the one or more features specified by the user selection. Step 1008 may be implemented in a manner similar to step 110 described with reference to
[0169] Following completion of step 1008 and once the initial and alternate models are obtained, at step 1010, the user interface displayed on the display screen is adapted to present information for assisting the user in identifying the information in the dataset meaningful to the investigation. More specifically, at step 1010, the first statistical significance information and the second statistical significance information conveyed by the initial and alternate models are processed to derive the information to be presented on the user interface. The inventors have noted that by considering differences/variations between the first and second statistical significance information resulting from the manipulations of some features in the dataset (in this example by the nullification of the contribution of some features in the dataset), some insights into patterns and/or trends in the dataset can be obtained.
[0170] The information presented to the user may take different suitable forms in different implementations. Examples of different suitable forms in which the information may be presented were described in connection with step 112 (
[0171]
[0172] It is to be appreciated that the examples described, and the configuration of the user interface in
[0173] Practical Examples of Implementation
[0174] Those skilled in the art should appreciate that in some non-limiting embodiments, all or part of the functionality previously described herein with respect to the processing system 404 depicted in
[0175] Those skilled in the art should further appreciate that the program instructions may be written in a number of suitable programming languages for use with many computer architectures or operating systems.
[0176] In a non-limiting example, some or all the methods and processes described in the present disclosure may be implemented on a suitable computing system 1300, of the type depicted in
[0177] The computing system 1300 may also include additional interfaces, such as a network I/O interface (not shown in the figures) for exchanging data over a private (or public) computer network to enable the computing system 1300 to communication with remote devices. Amongst others, this network I/O interface may enable the computing system 1300 to access remote device including, without being limited to, external storage devices storing additional datasets that may be useful in conducting an investigation and/or memory devices for storing results of the processing described in the present disclosure, such as for example the different models derived by applying the machine learning processes described in the present disclosure.
[0178]
[0179] Note that titles or subtitles may be used throughout the present disclosure for convenience of a reader, but in no way these should limit the scope of the invention. Moreover, certain theories may be proposed and disclosed herein; however, in no way they, whether they are right or wrong, should limit the scope of the invention so long as the invention is practiced according to the present disclosure without regard for any particular theory or scheme of action.
[0180] All references cited throughout the specification are hereby incorporated by reference in their entirety for all purposes.
[0181] It will be understood by those of skill in the art that throughout the present specification, the term “a” used before a term encompasses embodiments containing one or more to what the term refers. It will also be understood by those of skill in the art that throughout the present specification, the term “comprising”, which is synonymous with “including,” “containing,” or “characterized by,” is inclusive or open-ended and does not exclude additional, un-recited elements or method steps.
[0182] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. In the case of conflict, the present document, including definitions will control.
[0183] Although various embodiments of the disclosure have been described and illustrated, it will be apparent to those skilled in the art in light of the present description that numerous modifications and variations can be made. The scope of the invention is defined more particularly in the appended claims.