System and methods for network sensitivity analysis
11568020 · 2023-01-31
Assignee
Inventors
- G. Edward Powell (Nashville, TN, US)
- Mark T. Lane (Franklin, TN)
- Stephen C. Bedard (Charlotte, NC, US)
- N. Edward White (Austin, TX)
Cpc classification
G06F17/16
PHYSICS
G06N3/126
PHYSICS
International classification
G06F15/16
PHYSICS
G06F17/16
PHYSICS
Abstract
A computer-implemented method to establish a relative importance of an input parameter p.sub.j in a plurality of input parameters p.sub.i in a data set input to a machine learning model, the data set represented by a j row by k column matrix I.sub.m, an intersection of each row with each column defining an element, the method includes for each of the plurality of parameters p.sub.i in the input data set, a computer sorts columns k.sub.i of the matrix I.sub.m. to produce a re-ordered matrix I.sub.m,j; the computer determines a hyper-parameter N* of sub-matrices into which may be sorted the values in a j.sup.th row of the re-ordered matrix I.sub.m,j; the computer generates a plurality of group sub-matrices G.sub.i, each of the group sub-matrices comprising a subset of columns and the jth row; the computer inputs the re-ordered matrix I.sub.m,j into a fully-trained machine learning model to produce machine learning model outputs; and the computer produces normalized mean values of the outputs.
Claims
1. A computer-implemented method to establish a relative importance of an input parameter p.sub.j in a plurality of input parameters P in a data set m input to a machine learning model, the data set m represented by a j row by k column matrix Im, an intersection of each row j with each column k defining an element j,k, the method, comprising: for the input parameter p.sub.j in the plurality of input parameters P in the input data set m, a computer sorts columns k.sub.i of the matrix I.sub.m to produce a re-ordered matrix I.sub.m,j by reordering the columns k.sub.i wherein elements j,k in a j.sup.th row are arranged in order of parameter values of the input parameters P; the computer determines a hyper-parameter N* of sub-matrices G.sub.i into which are sorted re-ordered columns k.sub.i according to the elements j,k in the j.sup.th row of the re-ordered matrix I.sub.m,j; the computer generates a plurality of group sub-matrices G.sub.i, each of the plurality of group sub-matrices G.sub.i comprising a subset of the re-ordered columns k.sub.i and the j.sup.th row of the re-ordered matrix I.sub.m,j; the computer inputs the re-ordered matrix I.sub.m,j into a fully-trained machine learning model to produce machine learning model outputs O.sub.i by sequentially imputing data input vectors, defined as the reordered columns k.sub.i, according to positions of re-ordered columns k.sub.i in each of the plurality of group sub-matrices G.sub.i; and the computer produces normalized mean values of the machine learning model outputs O.sub.i.
2. The method of claim 1, wherein to produce normalized mean values of the machine learning model outputs O.sub.i, the computer: computes an average of the machine learning model outputs O.sub.i for each group sub-matrix G.sub.i; computes output means of the machine learning model outputs O.sub.i for each data input vector in an i.sup.th group sub-matrix G.sub.i; computes a mean value of elements in the j.sup.th row of the group sub-matrix G.sub.i; normalizes each of the output means by dividing the output means of the outputs O.sub.i by a mean of output È[O.sub.i] where È is an Expected Value of elements in the j.sup.th row of the i.sup.th group sub-matrix G.sub.i; and the computer makes the normalized mean values available for display as a network sensitivity analysis curve.
3. The method of claim 1, wherein the computer sorts the columns k.sub.i in an ascending order of parameter values of the plurality of input parameters P.
4. The method of claim 1, wherein the computer determines the hyper-parameter N* as a default number of columns k.sub.i.
5. The method of claim 1, wherein the computer determines the hyper-parameter N* as a function of a number of discrete elements present in the j.sup.th row of the input matrix I.sub.m.
6. The method of claim 1, wherein a matrix G.sub.j,k is represented as a single row j and multiple columns k.
7. The method of claim 1, wherein the computer determines a relative strength of the input parameter p.sub.j and makes the relative strength available for display.
8. The method of claim 7, further comprising: computing a first Network Sensitivity Analysis (NSA) curve with all m input data sets; sequentially computing m additional NSA curves with, sequentially, each of the m input data sets removed; and using a difference between the first NSA curve and each of the m additional NSA curves as an indication of the relative strength of a removed input data set, wherein the difference may be expressed as a difference in areas under the NSA curves for each of the m input data sets.
9. The method of claim 8, wherein the computer successively varies a value of the hyper-parameter N* to determine a contribution of the input parameter p.sub.j to the NSA curve.
10. A non-transitory computer-readable storage medium having encoded thereon machine instructions for producing data to enable display of a network sensitivity analysis curve, the machine instructions when executed by a processor, causing the processor to: for each parameter p.sub.j of a plurality of parameters P in an input data set m, sort columns k.sub.i of a j row by k column matrix I.sub.m, an intersection of a j.sup.th row and a k.sup.th column defining an element j,k, to produce a reordered matrix I.sub.m,j by reordering the columns k.sub.i wherein elements j,k in the j.sup.th row are arranged in order of parameter values of the plurality of parameters P; determine a hyper-parameter N* of sub-matrices into which are sorted the reordered columns k.sub.i according to the elements j,k in the j.sup.th row of the reordered matrix I.sub.m,j; generate a plurality of group sub-matrices G.sub.i, each of the group sub-matrices G.sub.i comprising a subset of the columns k.sub.i and the j.sup.th row of the reordered matrix I.sub.m,j; input re-ordered matrix I.sub.m,j into a fully-trained machine learning model to produce machine learning model outputs O.sub.i, comprising sequentially imputing data input vectors, defined as the re-ordered columns k.sub.i, according to positions of the re-ordered columns in each of the plurality of group sub-matrices G.sub.i; and produce normalized mean values of the machine learning model outputs O.sub.i.
11. The computer-readable storage medium of claim 10, wherein to produce the normalized mean values of the machine learning model outputs Oi, the processor: computes an average of the machine learning model outputs O.sub.i for each group sub-matrix G.sub.i; computes output means of the machine learning model outputs O.sub.i; computes a mean value of elements in the j.sup.th row of each group sub-matrix G.sub.i; and normalizes each of the output means comprising dividing the output means of the outputs Oi by a mean of output È[Oi] where È is an Expected Value of the elements in the j.sup.th row of an i.sup.th group sub-matrix Gi.
12. The computer-readable storage medium of claim 10, wherein the computer sorts the columns k.sub.i in an ascending order of parameter values of the plurality of parameters P.
13. The computer-readable storage medium of claim 10, wherein the computer determines the hyper-parameter N* as a default number of columns k.sub.i.
14. The computer-readable storage medium of claim 10, wherein the computer determines the hyper-parameter N* as a function of a number of discrete elements present in the j.sup.th row in the matrix I.sub.m.
15. The computer-readable storage medium of claim 10, wherein a group sub-matrix G.sub.j,k represents a single row j and multiple columns k.sub.i.
16. The computer-readable storage medium of claim 10, wherein the computer determines a relative strength of a parameter p.sub.j and makes the relative strength available for display.
17. A computer-implemented method for determining a relative contribution of a parameter p.sub.j in an input data set m to an output of a machine learning model, comprising: from the input data set m, extract, using a computer, one or more parameters P.sub.i and two or more entities E.sub.i to generate a matrix I.sub.m of j rows of parameters P and k columns of the entities E.sub.i, an intersection of a j.sup.th row and a k.sup.th column defining an element j,k of the matrix I.sub.m; sort the entities E.sub.i of the matrix I.sub.m to produce a re-ordered matrix I.sub.m,j by re-ordering the k columns such that elements are arranged in order based on a parameter value of each parameter P.sub.i; generate a plurality of sub-matrices G.sub.i, each of the plurality of sub-matrices G.sub.i comprising a row of the parameters P and a plurality k* of the k columns, where k*<k; arrange each of the re-ordered columns k in order in one of the plurality of sub-matrices G.sub.i; apply each of the plurality of sub-matrices G.sub.i sequentially to the machine learning model to generate outputs O.sub.i; and display the outputs O.sub.i.
18. The method of claim 17, comprising: computing a normalized mean of the outputs O.sub.i, comprising: computing an average of the generated outputs O.sub.i for each group sub-matrix G.sub.i; computing a mean of the outputs O.sub.i; computing a mean value of the j.sup.th row of the group sub-matrix G.sub.i; and normalizing each of the output means by dividing the output means by a mean of output È[O.sub.i] where È is an Expected Value of elements in the j.sup.th row of an i.sup.th group sub-matrix G.sub.i.
19. The method of claim 17, further comprising determining a hyper-parameter N* of sub-matrices G.sub.i into are sorted re-ordered columns k.sub.i according to the elements j,k in a j.sup.th row of the re-ordered matrix I.sub.m,j.
20. The method of claim 19, further comprising successively varying a value of the hyper-parameter N* to determine a contribution of parameter p.sub.j to an NSA curve.
Description
DESCRIPTION OF THE DRAWINGS
(1) The detailed description refers to the following figures in which like numerals refer to like items, and in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
DETAILED DESCRIPTION
(12) Machine learning may be used to help humans to understand the structure of data and fit that data into models that also may be understood and used by humans. Machine learning algorithms differ from traditional computer algorithms in that machine learning algorithms allow computers to train on data inputs and use statistical analysis to output values that fall within a specific range. Machine learning allows computers to build models from sample data in order to automate decision-making processes based on data inputs. Machine learning methods generally consist of supervised learning and/or unsupervised learning. Supervised learning trains models using algorithms based on example input and output data that is labeled by humans, and unsupervised learning provides the algorithm with no labeled data in order to allow it to find structure within its input data. Common machine learning algorithmic approaches include genetic algorithms, logistic regression, gradient descent algorithms, the k-nearest neighbor algorithm, decision tree learning, and deep learning. As one skilled in the art will appreciate regarding the instant specification, one or all of the above-listed algorithms, and other algorithms, may be used with the herein disclosed inventive concepts. In supervised learning, the computer is provided with example inputs that are labeled with their desired outputs. The purpose of this method is for an algorithm to be able to “learn” by comparing its actual output with the “taught” outputs to find errors, and modify the model accordingly. Supervised learning therefore uses patterns to predict label values on additional unlabeled data. In unsupervised learning, data is unlabeled, so the learning algorithm is left to find commonalities among its input data. Because unlabeled data are more abundant than labeled data, machine learning methods that facilitate unsupervised learning are particularly valuable. The goal of unsupervised learning may be as straightforward as discovering hidden patterns within a data set, but it may also have a goal of feature learning, which allows the computational machine to automatically discover the representations that are needed to classify raw data. Unsupervised learning is commonly used for transactional data. Without being told a “correct” answer, unsupervised learning methods may look at complex data that is more expansive and seemingly unrelated in order to organize it in potentially meaningful ways. Unsupervised learning may be used for anomaly detection including for fraudulent credit card purchases, and recommender systems that recommend what products to buy next. The k-nearest neighbor algorithm is a pattern recognition model that may be used for classification as well as regression. Often abbreviated as k-NN, the k in k-nearest neighbor is a positive integer, which is typically small. In either classification or regression, the input will consist of the k closest training examples within a space. In this method, the output is class membership. This will assign a new object to the class most common among its k nearest neighbors. In the case of k=1, the object is assigned to the class of the single nearest neighbor. Among the most basic of machine learning algorithms, k-nearest neighbor is considered to be a type of “lazy learning” as generalization beyond the training data does not occur until a query is made to the system. For general use, decision trees are employed to visually represent decisions and show or inform decision making. When working with machine learning and data mining, decision trees are used as a predictive model. These models map observations about data to conclusions about the data's target value. The goal of decision tree learning is to create a model that will predict the value of a target based on input variables. In the predictive model, the data's attributes that are determined through observation are represented by the branches, while the conclusions about the data's target value are represented in the leaves. When “learning” a tree, the source data is divided into subsets based on an attribute value test, which is repeated on each of the derived subsets recursively. Once the subset at a node has the equivalent value as its target value has, the recursion process will be complete. Deep learning attempts to imitate how the human brain may process light and sound stimuli into vision and hearing. A deep learning architecture is inspired by biological neural networks and consists of multiple layers in an artificial neural network made up of hardware and GPUs. Deep learning uses a cascade of nonlinear processing unit layers in order to extract or transform features (or representations) of the data. The output of one layer serves as the input of the successive layer. In deep learning, algorithms may be either supervised and serve to classify data, or unsupervised and perform pattern analysis. Among the machine learning algorithms that are currently being used and developed, deep learning absorbs the most data and has been able to beat humans in some cognitive tasks.
(13)
(14) The network 50 may be any communications network that allows the transmission of signals, media, messages, voice, and data among the entities shown in
(15) In an aspect, the human users 31 and data sources 20 all may be independent of each other. In another aspect, the data sources 20, for example, may belong to an organization, such as a business or government agency, and the human user 31 may work for, or otherwise be associated with the organization. In addition, end users 30 themselves may be data sources.
(16) The human users 31 may desire to gain insights into data received at and processed by the system 100. In an aspect, one or more of the human users 21 may desire to gain recognition for the data 25 provided by their respective data sources 20. Thus, the human users 21 and 31 may cooperate in a process in which data 25 are supplied, insights are gleaned from the data 25, and an individual human user 21 (or the associated data source 20) providing the data 25 receives a measure of recognition based on the importance of the insights gleaned from the data 25 provided by the human user 21's data source.
(17) The system 100 may be implemented on specially-programmed hardware platform 102. Such a platform is shown in
(18)
(19)
(20) Thus, the engine 110 may combine both batch and streaming data processing. Data 25 may first be processed by streaming data component 111 to extract real-time insights, and then persisted into a data store 103 (see
(21) The engine 110 also includes data conditioning module 114, which may execute to organize and/or configure data for use in a specific model, and to clean up faulty data. Many machine learning algorithms show poorer performance when instances in a data set are missing features or values, as compared to the same algorithm operating with a complete data set. In an aspect, the data conditioning module 114 may pre-process the input data sets 25 to replace a missing feature value with, for example, the median or the mean of all feature values that are present in the instance. This median value may be used during training and testing of the model. The same median value may be used when applying the model to a new data set in which instances are missing feature values. Of course, this process is relatively straightforward when the feature values are expressed as numbers but not so straightforward when the data set includes missing text entries. In this later situation, the module 114 may be configured to assign numerical values (e.g., 0 or 1) to a missing text value. Other data transforms such as scaling and normalizing feature values may improve the performance of selected algorithms. Finally, model update component may include data and instructions to make changes to the machine learning model.
(22) The engine 110 outputs training data set 113, test data set 115, neural network configuration file 117, and neural network configuration file 119. The outputs 113, 115, 117, and 119 from engine 110 are input to the machine learning engine 130. In an aspect, the engine 130 is implemented as a neural network model (neural network model 133—see
(23) Both the training data set 113 and the test data set 115 have parameters that are inputs to the model 113 with known outputs. The training data set 113 is used to cause the machine learning engine 130 to fit the input to the known outputs. The test data set 115 may be used to determine how well the engine 130 is able to generalize to input data that are not used to fit the engine parameters. A common problem with machine learning is “over-fitting,” whereby the machine learning engine is only able to match the training data set 113 with a desired level of accuracy. The test data set 115 may allow the human user 31 to understand how well the fit model will work with inputs that were not used during the training. Once the human user 31 is satisfied with the results of the test data process, the engine 130 may be trusted to produce acceptable predictions from future data set inputs.
(24) To address limitations of linear model predictions, the neural network model 133 may be a non-linear feed-forward supervised learning neural network model. One such model methodology is known as the resilient propagation algorithm, which is a modification of back propagation. The feed-forward neural network model 133 employs the fitting of weights, but in addition applies a non-linear sigmoidal activation function to each weight to give the model 133 the ability to recognize patterns in a non-linear fashion (as also is the result from the use of the hidden layers). Note that other activation functions may be used in lieu of the sigmoidal activation function.
(25) I Using vector and matrix notation (bold-faced lowercase letters are vectors and bold-faced uppercase letters are matrices), the mathematical encoding of the neural network model 133 with one hidden layer is described. In this model 133, i denotes the vector of input-layer neurons, h denotes the vector of hidden-layer neurons, and o denotes the vector of output-layer neurons for any instantiation of data that comprises one cycle through the neural network model 133. Furthermore, d is the dimension of i, q is the dimension of h, and n is the dimension of o. W.sub.1 is a q by d matrix of weights to convert i into h. W.sub.2 is a n by q matrix of weights to convert h into o. Finally, f(x), where x is a vector, denotes the application of a logistics (or activation) function f for every element in x. Then the neural network model 133 is formulated by the following system of mathematical equations:
h=f(W.sub.1i) and o=f(W.sub.2h).
The training data 113 with target output t is employed to fit the matrices W.sub.1 and W.sub.2 so as to minimize the square of the sum of the errors |t−o| using the common L.sub.2 vector norm. Each cycle of data is passed through the model 133, and the error is used to back-propagate through the system of equations to update the weight matrices. This process is repeated by cycling through all of the training data 113 until convergence is reached. Once the weight matrices are calculated in this fashion, the model 133 may predict output quantities o for inputs outside the training data. One such logistic function is:
(26)
f(x)=tan.sup.−1(x) (b)
(27)
(28)
(29) The limit of this logistic function as x tends to negative infinity is 0 and is 1.5 as x tends to positive infinity. The logistic function's steepest slope is in the half-open interval of (0, 1). However, the logistic function may be of limited use when outputs from the neural network are to be negative. Other examples for possible logistic (or activation) functions include: The limit of function (b) as x tends to negative infinity is −π/2 and is π/2 as x tends to positive infinity. Its steepest slope is in the interval of (−1, 1). The limit of function (c) as x tends to negative infinity is −1 and is 1 as x tends to positive infinity. Its steepest slope is in the interval of (−1, 1). The limit of function (d) as x tends to negative infinity is −1 and is 1 as x tends to positive infinity. Its steepest slope is in the interval of (−1, 1).
Each of the functions (b), (c), and (d) has the ability to support negative values. This obviates the need for additive adjustments to the data in order to force the values to be positive.
(30)
(31) Back propagation networks such as the model 133 use a supervised learning technique where truth data are known in a training interval, the model 133 is trained using the error function E over this training interval, and the trained network models data over the test interval where the truth data are not known. The error function E may be written as:
E=½.sub.π(t.sub.π−o.sub.π).sup.2
where π is an index that runs over the training interval. Updates to the weights W during back propagation are governed by the equation:
(32)
where μ is the learning rate. If μ is small enough, the above equation approaches the gradient descent method. Since the error E is a sum, the partial derivative also is a sum. Batch-mode (or epoch) based learning refers to a technique whereby the partial derivative is evaluated over the entire sum over the training interval in a cycle before a single correction to the weight matrices is made. By contrast, on-line learning refers to the case where the weight matrices are updated after each pattern p in the training interval, without waiting for the calculation of the entire cycle. There are advantages and disadvantages to both techniques. Batch-mode learning is more in tune with gradient descent, but on-line learning may converge better because the weights are updated continuously throughout the cycle.
(33) The system 100 may employ randomizing of the data in the cycle for feeding the training data set 113 to the neural network model 133. This prevents the time order of the training data from influencing the model 133 in the same way every cycle, and such data randomization may prevent the model 133 from being trapped into local minima or “ravines”. Another benefit of randomizing the presentation of data from the training data set 113 is the possibility of reducing large biases that could result from the training data always being presented to the model 133 in the same order.
(34) The system 100 may enhance the performance of the neural network model 133 through use of bias nodes and momentum. Bias nodes are artificial constructs in a neural network model that help to define a certain measure of balance to a classification scheme. Specifically, one node is added to all layers in the neural network model except the output layer, and the input to each of these additional nodes is set equal to 1. As the neural network model trains and learns the patterns of the training data set 113, bias nodes may help to separate the data into regions that are more easily classified. Bias nodes may be effective in many applications of a neural network. If {x} is the set of input data, with each x a vector of size n, then when bias nodes are used, the size of each x is increased to n+1, with x.sub.n=1. Then the size of each hidden layer, h of size q.sub.j, also is increased by one, with h.sub.qj=1.
(35) The momentum parameter may increase the convergence rate for the neural network model 133 as long as the momentum parameter is used in conjunction with a small learning rate μ. The idea is to weigh the previous correction to the weight matrices, so that learning for each change in the weight matrices does not follow a different path. Using the momentum parameter, α, the equation for correction of the weight matrices now becomes:
(36)
where α is the momentum parameter. There is direct relationship between the momentum parameter α and the learning rate parameter μ, and generally when momentum α is large, the learning rate μ should be small.
(37) Momentum α and learning rate μ are hyper-parameters that may be input to the neural network model 133, but experience has shown that a large momentum is helpful in conjunction with a small learning rate. The momentum hyper-parameter amplifies the effective learning rate to μ′=μ/(1−α), so that large momentum values call for smaller learning rates, in general. Experience has shown that α=0.8 and μ<0.2 is best. Adaptive learning rate algorithms, disclosed herein, may lead to even smaller learning rates to keep a complex neural network model converging properly.
(38) It is clear that the learning rate and momentum hyper-parameter settings have a direct impact on the ability of a neural network model to learn, but it is not always clear how to pick good settings at the start of training. A solution to this problem may be to have the neural network model 133 adapt an adequate learning rate parameter as the model 133 is being trained.
(39) The algorithm 190 relies on a single learning rate parameter for all weights and utilizes logical rules to determine when to hold steady, increase, or decrease the learning rate μ. As shown in
(40) The adaptive learning rate algorithm 190 has the following effect: If the rms error r for the current cycle is at least 1% better than the previous cycle (and the learning rate parameter μ is less than 4), μ may be increased by 3%. If the error is not at least 1% better, but is still better, the algorithm 190 checks to see if μ has been consistently decreasing for 10 cycles. If μ has been consistently decreasing (and μ is not too large), then it is safe to increase μ by 3%. If the error for the current cycle has increased more than 3% from the previous cycle's error, cut μ by 30%. If μ has not increased by more than 3%, but still has increased, the algorithm 190 increments the count for how many cycles μ has been increasing. If μ has increased 5 cycles in a row, then cut μ by 10%. If none of the above conditions are met, it is safe to make no change to the learning rate μ and go on to the next cycle. When the learning rate is cut by 30%, it makes sense to also set the momentum parameter α to 0 so that μ may have some cycles to settle, although increasing momentum may help to reduce oscillation. As soon as the error starts to decrease again, as desired, the momentum parameter α may be reset back to its original value.
(41) A model, such as those disclosed herein, produces an output O in response to a vector of input parameters x.sub.i in the m.sup.th input data set 25 (the number of data sets m may range from m=1 to m=N data sets). Thus, each of the m data set 25 input to the model 133 may be represented by a collection of input vectors (an input matrix) I.sub.m=[x.sub.1, x.sub.2, x.sub.3 . . . x.sub.n] where each x.sub.k is the input vector related to the k.sup.th entity. Each input parameter p.sub.j is the j.sup.th parameter of the each of the n input vectors, or equivalently stated p.sub.j may be viewed as the label for the j.sup.th row of the input matrix I.sub.m. Each of the elements in each row j of the matrix I.sub.m may take on a range of values. Ideally, a human operator or analyst, such as end user 31A (see
(42) To overcome limitations with current data analysis systems, the network sensitivity analysis (NSA) engine 150 executes one or more procedures, such as the procedures described below, based on data input to and output from a fully trained neural network model 133 (or other non-linear models) in order to determine the relationship between various inputs to and outputs from the fully trained neural network model 133 (or other non-linear models). (In this aspect, the fully trained neural network model 133 should be understood to be a model trained satisfactorily from the input data set 25, which may include data segregated to form the training data 113 and test data 115—see, e.g.,
(43) The operation for establishing the importance of each of the J parameters, where J is the total number of parameters p.sub.i, begins by, for each individual parameter, p.sub.j, in the set of vectors I.sub.m=[x.sub.1, x.sub.2, x.sub.3 . . . x.sub.n] for data input to the model 133, sorting the columns of the matrix I.sub.m from the m.sup.th data set 25 in ascending (or descending) order according to the values of the j.sup.th parameter of each input data vector I.sub.m(j,k) for k=1 to n. This may be thought of as re-ordering the rows j of the input matrix I.sub.m to create re-ordered matrix I.sub.m,j where the values in the j.sup.th row are ascending or descending. Next, the NSA engine 150 separates the columns of the re-ordered matrix I.sub.m,j from the m.sup.th data set 25 into a number of groups, N* based on the values in the j.sup.th row of the re-ordered matrix I.sub.m,j. In an aspect, a human user may specify the hyper-parameter, N*, as an input to the NSA process, or N* may be optimally calculated for the parameters p.sub.j with real values, but the actual number of groups for each parameter p.sub.j might be less than a selected value for N*. For discrete parameters, or even non-numeric parameters, N* may be at most the number of distinct values of the parameter in the j.sup.th row of the input matrix I.sub.m for k=1 to n. However, if the selected N* value is too large the NSA engine 150 may not be able to sufficiently sample the input-output relationship for each input value, which could result in a “noisy” NSA curve. If the selected N* is too small, the order of the NSA curve may be too low, and the NSA curve may not embody important structural characteristics of the input-output relationship. For continuous normalized input parameters, a selection of N*=10 may be sufficient to produce NSA curves that balance these sampling structure trade-offs. As one skilled in the art will appreciate, the input data 25 may include only numeric data or a combination of numeric data and other data. In an aspect, data other than numeric data may be converted to numeric data. For example, Yes/No and Male/Female data may be represented by a 0 or a 1, respectively. Months of a year may be represented by 1 . . . 12, etc. Other schemes may be used to render non-numeric input data suitable for use in the model 133.
(44)
(45) In an aspect, the data set 25 of input vectors x.sub.k may be separated in to groups G.sub.i, where each G.sub.i is a is a collection of columns (G.sub.i is, in fact, a matrix) of the re-ordered matrix I.sub.m,j, where i varies from 1 to N*, by simply taking approximately equal numbers of input vectors x.sub.k to form each group G.sub.i. In another aspect, to separate the columns of the re-ordered matrix I.sub.m,j ordered by their row j values into input vector groups G.sub.i where i=1 to N*, a tolerance parameter is defined as TOL=[(I.sub.m,j(j,n)−I.sub.m,j(j,1)]/N*. Next, starting with the first vector in the N*-sorted m.sup.th data set 25, and beginning with G.sub.1 for the jth parameter, the NSA engine 150 sets k.sub.0=1 and then determines if [I.sub.m,j(j,k.sub.0)−I.sub.m,j(j,k+1)]<TOL; if so then the I.sub.m,j(j,k+1) k+1 column is placed in G.sub.i; otherwise the I.sub.m,j(j,k+1) k+1 column is placed in G.sub.i+1 and k.sub.0 is set to be equal to I.sub.m,j(j,k+1). Then, j is incremented by one and the process is repeated until the input Groups G.sub.1, G.sub.2, . . . , G.sub.N* are formed. In yet another aspect, to improve the comparability of NSA curves resulting from neural network models trained using different sets of input vectors, a user may specify that the groups G.sub.i are created to be of approximately equal size, with the requirement that the columns of I.sub.m,j(j,k) with equal values in the j.sup.th position of any input vector x.sub.k are placed in the same group G.sub.i. The chosen process is repeated until the set of (sorted) input vectors represented in the re-ordered matrix I.sub.m,j is exhausted, resulting in groups {G.sub.1, G.sub.2, . . . , G.sub.N*}.
(46) After the data are segregated, each vector in each group G.sub.i is input into the fully trained model 133 and the average of the resulting output for each group G.sub.i is computed. For each group G.sub.i, the mean of the values in the j.sup.th row of the G.sub.i.sup.th group matrix of input vectors is computed, and the mean of the outputs from the model 133 from each input vector (column) in the i.sup.th group matrix G.sub.i of input vectors is computed. Finally, the mean value of the j.sup.th row of the group matrix G.sub.i is computed. In an embodiment, each of the means of the outputs is normalized by dividing by the mean of the output E[O.sub.i] where E is the Expected Value of elements in the j.sup.th row of the i.sup.th group matrix G.sub.i, so that key parameters from different populations may be compared on a similar scale. The resulting plot of normalized mean output versus the mean input is termed an NSA curve for the parameter in the j.sup.th position. Examples of NSA curves generated by execution of the system 100 are shown in
(47) The preceding discussion of NSA engine 150 operations referred to matrices of parameters p ordered in rows and entities k ordered in columns. One skilled in the art will appreciate that other matrix arrangements are possible and contemplated by the above disclosure.
(48) Armed now with the neural network model 133 and the NSA of individual input parameters, the NSA of entire data sets 25 is described in detail. There are three major factors that are considered for an entire swath of data contributed to the input training and test data:
(49) Quality of data contributed
(50) Quantity of data contributed
(51) Insights generated based on the data contributed.
(52) While ascertaining the quantity of data contributed may seem straightforward, there are some additional aspects of data that may need to be considered. Real-world situations may have one or more of the following characteristics:
(53) Periodicity
(54) Cycles
(55) Secular trends
(56) Oscillating curves about the secular trend
(57) Outliers
(58) Accuracy of the model
(59) Confidence Intervals of the results
(60) NSA curves of individual parameters in the input data
(61) White Noise
(62) A serious study of each of these aspects may be considered, and therefore, each of the three major factors are intertwined in determining the usefulness or strength of a contributed data set 25. For example, data might be contributed that might or might not give insight into cycles of the data, but the contributed data might additionally increase accuracy of the model. Such a model may be easily separated into yearly cycles.
(63) If a new contributed data set does not span outside of the existing data, then no new information may be gleaned from the new data regarding yearly cycles, but there may be a contribution to increased accuracy of the model.
(64) Each contributed data set N (i.e., the data set 25 of
(65)
The measure M.sub.i for each data set contribution also may be computed using the root mean squared difference between the samples that make up the NSA curves multiplied by the range of the i.sup.th input parameter as a substitute for the absolute area between the NSA curves. Other techniques such as Absolute Percentage Error (APE) also may be used. The rest of the procedure is the same as described above.
(66) Thus, a method for evaluating the relative contribution of an individual data set N.sub.j in a plurality of data sets N.sub.(i . . . j . . . n) to a problem solution O, the data sets N.sub.(i . . . j . . . n) processed and applied to a machine learning model, begins with a processor executing a network sensitivity analysis (NSA). Executing the NSA includes generating a N NSA curve for each of a plurality distinct input parameters in the data sets N.sub.(i . . . j . . . n) by computing a solution O.sub.N with all of the data sets N.sub.(i . . . j . . . n); generating a N-j NSA curve (i.e., a NSA curve with the j.sup.th data set removed from the N data sets) for each of the plurality of distinct input parameters by removing the j.sup.th data set from the data sets N.sub.(i . . . j . . . n), and computing a solution O.sub.N-j with the j.sup.th data set removed. Finally, executing the NSA involves determining a measure M.sub.j of a contribution of a j.sup.th data set based on a difference between the N NSA curves and the N-j NSA curves, and computing a relative strength S.sub.j of each of the N.sub.(i . . . n) data sets as a function of the measure M.sub.i:
(67)
(68) The importance of each of the aspects listed above is problem specific. For example, in a case where the same data sets 25 are contributed by two different sources, theoretically there should be no enhancement of the model from the second contribution. The system 100 may either not recognize and credit the second contributor at all, because new information is presented to the model, or else the system 100 may recognize and credit each contributor equally. It may be that in some problem cases the periodicity or cycles is of supreme importance, and in other problem instances the accuracy of the model is of supreme importance.
(69) Accuracy of the neural network model 133 may be characterized by how the model 133 performs on the test data 115 as opposed to the training data 113. As noted herein, a model may be over-fit to the training data, and the resulting model may not generalize very well to the test data. The accuracy of a neural network model against a data set may be measured by either the root mean-squared error ((rms) or the Absolute Percentage Error (ape) of the prediction model against the known answers. Either technique is well known by practitioners of neural network models. The rms error is computed by the square root of the average of the squares of the errors, and the ape is computed by the average of the absolute value of the errors.
(70) Experiment 1: This experiment applies the inventive features disclosed herein to Medicare provider utilization data to predict the risk of malpractice for Florida doctors. Annual medical liability costs are in the tens of billions, 7.4% of physicians have a malpractice claim per year, and 1.6% have a claim that leads to payment. The ability to predict which physicians have elevated risk for a malpractice claim may therefore be of interest. The herein disclosed system 100 predicts the risk of physicians being sued for malpractice and generates physician risk and work profiles. The system 100 uses provider utilization data and medical malpractice history for training and testing. The utilization data may be all claims processed by the provider or a subset of their claims, such as Medicare data. The medical malpractice data are needed for the years upon which the model will be trained and tested. The Medicare data are used to create yearly profiles for each physician, and these profiles are inputted into the neural network model 133 to predict malpractice risk for each physician. The physicians were sorted into deciles based on their predicted risk. The model 133 demonstrates the ability to discriminate between high and low risk physicians, with the physicians in the top 20% of estimated risk being 20.5 times more likely to be sued than the physicians in the bottom 20% of estimated risk.
(71) There were three main sources of data for this experiment: Medicare provider utilization and payment data, the NPPES NPI registry, and Florida malpractice claims. Medicare provider utilization and payment data contains over 150 million Medicare line items across five years. These data cover procedures and prescribed drugs that were charged to Medicare by all physicians in the United States. The NPPES NPI registry contains physician information for every registered physician in the United States. These data include physician specialty and practice information. Florida publishes all malpractice claims that resulted in a payout, either a successful court case or an out of court settlement. These data contain over 55,000 claims from 1985-2018, including nearly 10,000 in the model period of 2013-2016. Returning to
(72)
(73)
(74)
The relative strength S.sub.i of a data set indicates how significant its contribution was to the observation O. In completing the operation of block 840, the system 100 may simply integrate under the NSA curves to produce an absolute value of the differential areas. Alternately, the Measure M.sub.j for each data set contribution may be computed using the root mean squared difference between the samples that make up the NSA curves multiplied by the range of the i.sup.th input parameter as a substitute for the absolute area between the NSA curves.
(75)
(76) In block 920, the NSA engine 150 determines a number N* (i.e., a hyper-parameter) of columns k into which may be sorted the values in the j.sup.th row of the re-ordered matrix I.sub.m,j. As an aspect of block 920, the NSA engine 150 may determine hyper-parameter N* as a default number of columns k.sub.i or the NSA engine 150 may compute N* as a function of the number of discrete elements present in the input data set 25. Thus, elements that have a same value as other elements may be represented in a single column of the re-ordered matrix I.sub.m,j. In block 930, the NSA engine 150 generates a plurality of group sub-matrices G.sub.i, with each G.sub.i including a subset of columns k for the jth row. The result is a matrix G.sub.j,k represented as a single row j and multiple columns k.
(77) In block 940, the NSA engine 150 inputs into the fully-trained model 133, and computes an average of the resulting output for each group matrix G.sub.i. In block 950, the NSA engine 150 produces normalized mean values of the outputs.
(78)
(79)
(80)
Then a plot the (x.sub.i, <y.sub.i>) for the first parameter is generated as shown in
(81) Subsequent to the operations illustrated in
(82) The preceding disclosure refers to flowcharts and accompanying descriptions to illustrate the embodiments represented in
(83) Embodiments disclosed herein may be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the herein disclosed structures and their equivalents. Some embodiments may be implemented as one or more computer programs; i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by one or more processors. A computer storage medium may be, or may be included in, a computer-readable storage device, a computer-readable storage substrate, or a random or serial access memory. The computer storage medium may also be, or may be included in, one or more separate physical components or media such as multiple CDs, disks, or other storage devices. The computer readable storage medium does not include a transitory signal.
(84) The herein disclosed methods may be implemented as operations performed by a processor on data stored on one or more computer-readable storage devices or received from other sources.
(85) A computer program (also known as a program, module, engine, software, software application, script, or code) may be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program may be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program may be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.