Dynamic outlier bias reduction system and method

Abstract

In at least one embodiment, the present description is directed to a computer system, having at least components of a server, including a processor and a non-transient storage subsystem, storing a computer program including instructions that, when executed by the processor, cause the processor to at least: electronically receive a model for one or more operating conditions, one or more threshold criteria, and facility operating data for each respective facility of a plurality of facilities; validate the one or more threshold criteria to be one or more acceptable bias criteria; iteratively perform one or more iterations of outlier bias reduction in the facility operating data based on the model; determine, based on non-biased facility operating data, a non-biased performance standard for the one or more operating conditions; and track, based on the non-biased performance standard and the facility operating data, operating performance of each respective facility of the plurality of facilities.

Claims

1. A method comprising the steps of: electronically receiving, by a processor, at least the following: i) a model for one or more operating conditions, ii) one or more threshold criteria, and iii) facility operating data of the one or more operating conditions for each respective facility of a plurality of facilities; wherein the model comprises one or more coefficients; validating, by the processor, the one or more threshold criteria to be one or more acceptable bias criteria; iteratively performing, by the processor, one or more iterations of outlier bias reduction in the facility operating data of the plurality of facilities based at least in part on the model; wherein the iteratively performing the one or more iterations of outlier bias reduction comprises the steps of: (i) determining a set of model predicted values; (ii) comparing the set of model predicted values to the facility operating data to produce a set of error values; (iii) removing bias facility operating data of one or more performance outlier facilities from the facility operating data of the plurality of facilities to form a non-biased facility operating data of one or more performance non-biased facilities that are selected from the plurality of facilities, wherein the one or more performance outlier facilities are determined based at least in part on the set of error values and the one or more acceptable bias criteria; (iv) constructing, based at least in part on the non-biased facility operating a data, an updated model for the one or more operating conditions, wherein the updated model comprises one or more updated coefficients; (v) repeating steps (i) through (iv) when one or more termination criteria are not satisfied; determining, by the processor, based at least in part on the non-biased facility operating data for the one or more operating conditions of the one or more performance non-biased facilities, one or more non-biased performance standards for the one or more operating conditions; and tracking, by processor, based at least in part on the one or more non-biased performance standards and the facility operating data, operating performance of each respective facility of the plurality of facilities.

2. The computer-implemented method of claim 1, wherein the iteratively performing the one or more iterations of the outlier bias reduction further comprises the steps of: determining a set of first improvement error values for the facility operating data; determining a set of second improvement error values for the non-biased facility operating data; and comparing the at least one set of first improvement error values with the at least one set of second improvement error values.

3. The computer-implemented method of claim 2, wherein the determination that the one or more termination criteria are not satisfied is based on the comparison of the at least one set of first improvement error values with the at least one set of second improvement error values.

4. The computer-implemented method of claim 3, wherein the determination that the one or more termination criteria are not satisfied is based on determining that the one or more termination criteria have at least one improvement value that does not exceed the difference of the at least one set of first improvement error values and the at least one set of second improvement error values.

5. The computer-implemented method of claim 2, wherein the first improvement error values are standard error values.

6. The computer-implemented method of claim 2, wherein the first improvement error values are coefficient of determination values.

7. The computer-implemented method of claim 1, wherein a particular criterium is a specified number of iterations.

8. The computer-implemented method of claim 1, wherein a particular criterium is a convergence criterium.

9. The computer-implemented method of claim 1, wherein the set of error values comprises a set of relative error values and a set of absolute error values.

10. The computer-implemented method of claim 9, wherein the one or more performance outlier facilities are determined as one or more facilities that have the relative error values and the absolute error values for respective facility operating data exceed the one or more error threshold criteria.

11. The computer-implemented method of claim 1, wherein the repeating steps (i) through (iv) further comprises: recombining the non-biased facility operating data of one or more performance non-biased facilities with the bias facility operating data of the one or more performance outlier facilities to produce the facility operating data.

12. A computer system, comprising: at least one server, comprising: at least one processor and a non-transient storage subsystem; wherein the non-transient storage subsystem stores a computer program comprising instructions that, when executed by the at least one processor, cause the at least one processor to at least: electronically receive at least the following: i) a model for one or more operating conditions, ii) one or more threshold criteria, and iii) facility operating data of the one or more operating conditions for each respective facility of a plurality of facilities; wherein the model comprises one or more coefficients; validate the one or more threshold criteria to be one or more acceptable bias criteria; iteratively perform one or more iterations of outlier bias reduction in the facility operating data of the plurality of facilities based at least in part on the model; wherein the iterative performance of the one or more iterations of outlier bias reduction comprises computer operations of: (i) determining a set of model predicted values; (ii) comparing the set of model predicted values to the facility operating data to produce a set of error values; (iii) removing bias facility operating data of one or more performance outlier facilities from the facility operating data of the plurality of facilities to form a non-biased facility operating data of one or more performance non-biased facilities that are selected from the plurality of facilities, wherein the one or more performance outlier facilities are determined based at least in part on the set of error values and the one or more acceptable bias criteria; (iv) constructing, based at least in part on the non-biased facility operating a data, an updated model for the one or more operating conditions, wherein the updated model comprises one or more updated coefficients; (v) repeating steps (i) through (iv) when one or more termination criteria are not satisfied; determine, based at least in part on the non-biased facility operating data for the one or more operating conditions of the one or more performance non-biased facilities, one or more non-biased performance standards for the one or more operating conditions; and track, based at least in part on the one or more non-biased performance standards and the facility operating data, operating performance of each respective facility of the plurality of facilities.

13. The system of claim 12, wherein the operations of (i) through (iv) further comprise: recombining the non-biased facility operating data of one or more performance non-biased facilities with the bias facility operating data of the one or more performance outlier facilities to produce the facility operating data.

14. The system of claim 12, wherein the iterative performance of the one or more iterations of the outlier bias reduction further comprises the operations of: determining a set of first improvement error values for the facility operating data; determining a set of second improvement error values for the non-biased facility operating data; and comparing the at least one set of first improvement error values with the at least one set of second improvement error values.

15. The system of claim 14, wherein the determination that the one or more termination criteria are not satisfied is based on the comparison of the at least one set of first improvement error values with the at least one set of second improvement error values.

16. The system of claim 15, wherein the determination that the one or more termination criteria are not satisfied is based on determining that the one or more termination criteria have at least one improvement value that does not exceed the difference of the at least one set of first improvement error values and the at least one set of second improvement error values.

17. The system of claim 14, wherein the first improvement error values are standard error values.

18. The system of claim 14, wherein the first improvement values are coefficient of determination values.

19. The system of claim 12, wherein a particular termination criterium is a specified number of iterations.

20. The system of claim 12, wherein a particular termination is a convergence criterium.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a flowchart illustrating an embodiment of the data outlier identification and removal method.

(2) FIG. 2 is a flowchart illustrating an embodiment of the data outlier identification and removal method for data quality operations.

(3) FIG. 3 is a flowchart illustrating an embodiment of the data outlier identification and removal method for data validation.

(4) FIG. 4 is an illustrative node for implementing a method of the invention.

(5) FIG. 5 is an illustrative graph for quantitative assessment of a data set.

(6) FIGS. 6A and 6B are illustrative graphs for qualitative assessment of the data set of FIG. 5, illustrating the randomized and realistic data set, respectively, for the entire data set.

(7) FIGS. 7A and 7B are illustrative graphs for qualitative assessment of the data set of FIG. 5, illustrating the randomized and realistic data set, respectively, after removal of 30% of the data as outliers.

(8) FIGS. 8A and 8B are illustrative graphs for qualitative assessment of the data set of FIG. 5, illustrating the randomized and realistic data set, respectively, after removal of 50% of the data as outliers.

DETAILED DESCRIPTION OF THE INVENTION

(9) The following disclosure provides many different embodiments, or examples, for implementing different features of a system and method for accessing and managing structured content. Specific examples of components, processes, and implementations are described to help clarify the invention. These are merely examples and are not intended to limit the invention from that described in the claims. Well-known elements are presented without detailed description so as not to obscure the preferred embodiments of the present invention with unnecessary detail. For the most part, details unnecessary to obtain a complete understanding of the preferred embodiments of the present invention have been omitted inasmuch as such details are within the skills of persons of ordinary skill in the relevant art.

(10) A mathematical description of one embodiment of Dynamic Outlier Bias Reduction is shown as follows:

(11) TABLE-US-00001 Nomenclature: {circumflex over (X)} Set of all data records: {circumflex over (X)} = {circumflex over (X)}.sub.k + {circumflex over (X)}.sub.Ck, where: {circumflex over (X)}.sub.k-Set of accepted data records for the k.sup.th iteration {circumflex over (X)}.sub.Ck-Set of outlier (removed) data records for the k.sup.th iteration {circumflex over (Q)}.sub.k Set of computed model predicted values for {circumflex over (X)}.sub.k {circumflex over (Q)}.sub.Ck Set of outlier model predicted values for data records, {circumflex over (X)}.sub.Ck Â Set of actual values (target values) on which the model is based {circumflex over (β)}.sub.k.fwdarw.k+1 Set of model coefficients at the k + 1.sup.st iteration computed as a result of the model computations using {circumflex over (X)}.sub.k M({circumflex over (X)}.sub.k : {circumflex over (β)}.sub.k.fwdarw.k+1) Model computation producing {circumflex over (Q)}.sub.k+1 from {circumflex over (X)}.sub.k storing model derived and user-supplied coefficients: β.sub.k.fwdarw.k+1 C User supplied error criteria (%) Ψ({circumflex over (Q)}.sub.k, custom character Error threshold function F (Ψ, C ) Error threshold value (E) {circumflex over (Ω)}.sub.k Iteration termination criteria, e.g., iteration count, r.sup.2, standard error, etc.
Initial Computation, k=0
Initial Step 1: Using initial model coefficient estimates, {circumflex over (β)}.sub.0.fwdarw.1, compute initial model predicted values by applying the model to the complete data set:
{circumflex over (Q)}.sub.1=M({circumflex over (X)}:{circumflex over (β)}.sub.0.fwdarw.1)
Initial Step 2: Compute initial model performance results:
{circumflex over (Ω)}.sub.1=f({circumflex over (Q)}.sub.1,Â,k=0,r.sup.2,standard error,etc.)
Initial Step 3: Compute model error threshold value(s):
E.sub.1=F(Ψ({circumflex over (Q)}.sub.1, custom character ,C)
Initial Step 4: Filter the data records to remove outliers:
{circumflex over (X)}.sub.1={∀x∈{circumflex over (X)}|Ψ({circumflex over (Q)}.sub.1,<E.sub.1}

(12) Iterative Computations, k>0

(13) Iteration Step 1: Compute predicted values by applying the model to the accepted data set:
{circumflex over (Q)}.sub.k+1=M({circumflex over (X)}.sub.k:{circumflex over (β)}.sub.k.fwdarw.k+1)
Iteration Step 2: Compute model performance results:
{circumflex over (Ω)}.sub.k+1=f({circumflex over (Q)}.sub.k+1,Â,k,r.sup.2,standard error,etc.)
If termination criteria are achieved, stop, otherwise proceed to Step 3:
Iteration Step 3: Compute results for removed data, {circumflex over (X)}.sub.Ck={∀x∈{circumflex over (X)}|x.Math.{circumflex over (X)}.sub.k} using current model:
{circumflex over (Q)}.sub.Ck+1=M({circumflex over (X)}.sub.Ck:β.sub.k.fwdarw.k+1)
Iteration Step 4: Compute model error threshold values:
E.sub.k+1=F(Ψ({circumflex over (Q)}.sub.k+1+{circumflex over (Q)}.sub.Ck+1, custom character ,C)
Iteration Step 5: Filter the data records to remove outliers:
{circumflex over (X)}.sub.k+1={∀x∈{circumflex over (X)}|Ψ({circumflex over (Q)}.sub.k+1+,{circumflex over (Q)}.sub.Ck+1,<E.sub.k+1}

(14) Another mathematical description of one embodiment of Dynamic Outlier Bias Reduction is shown as follows:

(15) TABLE-US-00002 Nomenclature: {circumflex over (X)} Set of all data records: {circumflex over (X)} = {circumflex over (X)}.sub.k + {circumflex over (X)}.sub.Ck, where: {circumflex over (X)}.sub.k-Set of accepted data records for the k.sup.th iteration {circumflex over (X)}.sub.Ck-Set of outlier (removed) data records for the k.sup.th iteration {circumflex over (Q)}.sub.k Set of computed model predicted values for {circumflex over (X)}.sub.k {circumflex over (Q)}.sub.Ck Set of outlier model predicted values for {circumflex over (X)}.sub.Ck Â Set of actual values (target values) on which the model is based {circumflex over (β)}.sub.k.fwdarw.k+1 Set of model coefficients at the k + 1.sup.st iteration computed as a result of the model computations using {circumflex over (X)}.sub.k M({circumflex over (X)}.sub.k : {circumflex over (β)}.sub.k.fwdarw.k+1) Model computation producing {circumflex over (Q)}.sub.k+1 from {circumflex over (X)}.sub.k storing model derived and user-supplied coefficients: β.sub.k.fwdarw.k+1 C.sub.RE User supplied error criteria (%) C.sub.AE User supplied absolute error criterion(%) RE({circumflex over (Q)}.sub.k + {circumflex over (Q)}.sub.Ck, Â) Relative error values for all data records AE({circumflex over (Q)}.sub.k + {circumflex over (Q)}.sub.Ck, Â) Absolute error values for all data records P.sub.RE.sub.k Relative error threshold value for the k.sup.th iteration where P.sub.RE.sub.k = Percentile(RE({circumflex over (Q)}.sub.k + {circumflex over (Q)}.sub.Ck, Â) , C.sub.RE) P.sub.AE.sub.k Absolute error threshold value for the k.sup.th iteration where P.sub.AE.sub.k = Percentile(AE({circumflex over (Q)}.sub.k + {circumflex over (Q)}.sub.Ck, Â) , C.sub.AE) {circumflex over (Ω)}.sub.k Iteration termination criteria, e.g., iteration count, r.sup.2, standard error, etc.
Initial Computation, k=0 Initial Step 1: Using initial model coefficient estimates, β.sub.0.fwdarw.1, compute initial model predicted value results by applying the model to the complete data set:
{circumflex over (Q)}.sub.1=M({circumflex over (X)}:{circumflex over (β)}.sub.0.fwdarw.1) Initial Step 2: Compute initial model performance results:
{circumflex over (Ω)}.sub.1=f({circumflex over (Q)}.sub.1,Â,k=0,r.sup.2,standard error,etc.) Initial Step 3: Compute model error threshold values:
P.sub.RE.sub.1=Percentile(RE({circumflex over (Q)}.sub.1,Â),C.sub.RE)
P.sub.AE.sub.1=Percentile(AE({circumflex over (Q)}.sub.1,Â),C.sub.AE) Initial Step 4: Filter the data records to remove outliers:

(16) ${\hat{X}}_{1} = {\forall x \in \hat{X} .Math. {\begin{matrix} RE ({\hat{Q}}_{1}, \hat{A}) \\ AE ({\hat{Q}}_{1}, \hat{A}) \end{matrix}} < {(\begin{matrix} P_{RE} \\ P_{AE} \end{matrix})}_{1}}$

(17) Iterative Computations, k>0 Iteration Step 1: Compute model predicted values by applying the model to the outlier removed data set:
{circumflex over (Q)}.sub.k.fwdarw.1=M({circumflex over (X)}.sub.k:{circumflex over (β)}.sub.k.fwdarw.k+1) Iteration Step 2: Compute model performance results:
{circumflex over (Ω)}.sub.k+1=f({circumflex over (Q)}.sub.k+1,Â,k,r.sup.2,standard error,etc.) If termination criteria are achieved, stop, otherwise proceed to Step 3: Iteration Step 3: Compute results for the removed data, {circumflex over (X)}.sub.Ck={∀x∈{circumflex over (X)}|x.Math.{circumflex over (X)}k} using current model:
{circumflex over (Q)}.sub.Ck+1=M({circumflex over (X)}.sub.Ck:{circumflex over (β)}.sub.k.fwdarw.k+1) Iteration Step 4: Compute model error threshold values:
P.sub.RE.sub.k+1=Percentile(RE({circumflex over (Q)}.sub.k+1+{circumflex over (Q)}.sub.Ck+1,Â),C.sub.RE)
P.sub.AE.sub.k+1=Percentile(AE({circumflex over (Q)}.sub.k+1+{circumflex over (Q)}.sub.Ck+1,Â),C.sub.AE) Iteration Step 5: Filter the data records to remove outliers:

(18) ${\hat{X}}_{k + 1} = {\forall x \in \hat{X} .Math. {\begin{matrix} RE ({\hat{Q}}_{k + 1} + {\hat{Q}}_{Ck + 1}, \hat{A}) \\ AE ({\hat{Q}}_{k + 1} + {\hat{Q}}_{Ck + 1}, \hat{A}) \end{matrix}} < {(\begin{matrix} P_{RE} \\ P_{AE} \end{matrix})}_{k + 1}}$ Increment k and proceed to Iteration Step 1.

(19) After each iteration where new model coefficients are computed from the current censored dataset, the removed data from the previous iteration plus the current censored data are recombined. This combination encompasses all data values in the complete dataset. The current model coefficients are then applied to the complete dataset to compute a complete set of predicted values. The absolute and relative errors are computed for the complete set of predicted values and new bias criteria percentile threshold values are computed. A new censored dataset is created by removing all data values where the absolute or relative errors are greater than the threshold values and the nonlinear optimization model is then applied to the newly censored dataset computing new model coefficients. This process enables all data values to be reviewed every iteration for their possible inclusion in the model dataset. It is possible that some data values that were excluded in previous iterations will be included in subsequent iterations as the model coefficients converge on values that best fit the data.

(20) In one embodiment, variations in GHG emissions can result in overestimation or underestimation of emission results leading to bias in model predicted values. These non-industrial influences, such as environmental conditions and errors in calculation procedures, can cause the results for a particular facility to be radically different from similar facilities, unless the bias in the model predicted values is removed. The bias in the model predicted values may also exist due to unique operating conditions.

(21) The bias can be removed manually by simply removing a facility's data from the calculation if analysts are confident that a facility's calculations are in error or possess unique, extenuating characteristics. Yet, when measuring a facility performance from many different companies, regions, and countries, precise a priori knowledge of the data details is not realistic. Therefore any analyst-based data removal procedure has the potential for adding undocumented, non-data supported biases to the model results.

(22) In one embodiment, Dynamic Outlier Bias Reduction is applied to a procedure that uses the data and a prescribed overall error criteria to determine statistical outliers that are removed from the model coefficient calculations. This is a data-driven process that identifies outliers using a data produced global error criteria using for example, the percentile function. The use of Dynamic Outlier Bias Reduction is not limited to the reduction of bias in model predicted values, and its use in this embodiment is illustrative and exemplary only. Dynamic Outlier Bias Reduction may also be used, for example, to remove outliers from any statistical data set, including use in calculation of, but not limited to, arithmetic averages, linear regressions, and trend lines. The outlier facilities are still ranked from the calculation results, but the outliers are not used in the filtered data set applied to compute model coefficients or statistical results.

(23) A standard procedure, commonly used to remove outliers, is to compute the standard deviation (σ) of the data set and simply define all data outside a 2σ interval of the mean, for example, as outliers. This procedure has statistical assumptions that, in general, cannot be tested in practice. The Dynamic Outlier Bias Reduction method description applied in an embodiment of this invention, is outlined in FIG. 1, uses both a relative error and absolute error. For example: for a facility, ‘m’:
Relative Error.sub.m=((Predicted Value.sub.m−Actual Value.sub.m)/Actual Value.sub.m).sup.2 (1)
Absolute Error.sub.m=(Predicted Value.sub.m−Actual Value.sub.m).sup.2 (2)

(24) In Step 110, the analyst specifies the error threshold criteria that will define outliers to be removed from the calculations. For example using the percentile operation as the error function, a percentile value of 80 percent for relative and absolute errors could be set. This means that data values less than the 80th percentile value for a relative error and less than the 80th percentile value for absolute error calculation will be included and the remaining values are removed or considered as outliers. In this example, for a data value to avoid being removed, the data value must be less than both the relative and absolute error 80th percentile values. However, the percentile thresholds for relative and absolute error may be varied independently, and, in another embodiment, only one of the percentile thresholds may be used.

(25) In Step 120, the model standard error and coefficient of determination (r.sup.2) percent change criteria are specified. While the values of these statistics will vary from model to model, the percent change in the preceding iteration procedure can be preset, for example, at 5 percent. These values can be used to terminate the iteration procedure. Another termination criteria could be the simple iteration count.

(26) In Step 130, the optimization calculation is performed, which produces the model coefficients and predicted values for each facility.

(27) In Step 140, the relative and absolute errors for all facilities are computed using Eqns. (1) and (2).

(28) In Step 150, the error function with the threshold criteria specified in Step 110 is applied to the data computed in Step 140 to determine outlier threshold values.

(29) In Step 160, the data is filtered to include only facilities where the relative error, absolute error, or both errors, depending on the chosen configuration, are less than the error threshold values computed in Step 150.

(30) In Step 170, the optimization calculation is performed using only the outlier removed data set.

(31) In Step 180, the percent change of the standard error and r.sup.2 are compared with the criteria specified in Step 120. If the percent change is greater than the criteria, the process is repeated by returning to Step 140. Otherwise, the iteration procedure is terminated in step 190 and the resultant model computed from this Dynamic Outlier Bias Reduction criteria procedure is completed. The model results are applied to all facilities regardless of their current iterative past removed or admitted data status.

(32) In another embodiment, the process begins with the selection of certain iterative parameters, specifically:

(33) (1) an absolute error and relative error percentile value wherein one, the other or both may be used in the iterative process,

(34) (2) a coefficient of determination (also known as r.sup.2) improvement value, and

(35) (3) a standard error improvement value.

(36) The process begins with an original data set, a set of actual data, and either at least one coefficient or a factor used to calculate predicted values based on the original data set. A coefficient or set of coefficients will be applied to the original data set to create a set of predicted values. The set of coefficients may include, but is not limited to, scalars, exponents, parameters, and periodic functions. The set of predicted data is then compared to the set of actual data. A standard error and a coefficient of determination are calculated based on the differences between the predicted and actual data. The absolute and relative error associated with each one of the data points is used to remove data outliers based on the user-selected absolute and relative error percentile values. Ranking the data is not necessary, as all data falling outside the range associated with the percentile values for absolute and/or relative error are removed from the original data set. The use of absolute and relative errors to filter data is illustrative and for exemplary purposes only, as the method may be performed with only absolute or relative error or with another function.

(37) The data associated with the absolute and relative error within a user-selected percentile range is the outlier removed data set, and each iteration of the process will have its own filtered data set. This first outlier removed data set is used to determine predicted values that will be compared with actual values. At least one coefficient is determined by optimizing the errors, and then the coefficient is used to generate predicted values based on the first outlier removed data set. The outlier bias reduced coefficients serve as the mechanism by which knowledge is passed from one iteration to the next.

(38) After the first outlier removed data set is created, the standard error and coefficient of determination are calculated and compared with the standard error and coefficient of determination of the original data set. If the difference in standard error and the difference in coefficient of determination are both below their respective improvement values, then the process stops. However, if at least one of the improvement criteria is not met, then the process continues with another iteration. The use of standard error and coefficient of determination as checks for the iterative process is illustrative and exemplary only, as the check can be performed using only the standard error or only the coefficient of determination, a different statistical check, or some other performance termination criteria (such as number of iterations).

(39) Assuming that the first iteration fails to meet the improvement criteria, the second iteration begins by applying the first outlier bias reduced data coefficients to the original data to determine a new set of predicted values. The original data is then processed again, establishing absolute and relative error for the data points as well as the standard error and coefficient of determination values for the original data set while using the first outlier removed data set coefficients. The data is then filtered to form a second outlier removed data set and to determine coefficients based on the second outlier removed data set.

(40) The second outlier removed data set, however, is not necessarily a subset of the first outlier removed data set and it is associated with second set of outlier bias reduced model coefficients, a second standard error, and a second coefficient of determination. Once those values are determined, the second standard error will be compared with the first standard error and the second coefficient of determination will be compared against the first coefficient of determination.

(41) If the improvement value (for standard error and coefficient of determination) exceeds the difference in these parameters, then the process will end. If not, then another iteration will begin by processing the original data yet again; this time using the second outlier bias reduced coefficients to process the original data set and generate a new set of predicted values. Filtering based on the user-selected percentile value for absolute and relative error will create a third outlier removed data set that will be optimized to determine a set of third outlier bias reduced coefficients. The process will continue until the error improvement or other termination criteria are met (such as a convergence criteria or a specified number of iterations).

(42) The output of this process will be a set of coefficients or model parameters, wherein a coefficient or model parameter is a mathematical value (or set of values), such as, but not limited to, a model predicted value for comparing data, slope and intercept values of a linear equation, exponents, or the coefficients of a polynomial. The output of Dynamic Outlier Bias Reduction will not be an output value of its own right, but rather the coefficients that will modify data to determine an output value.

(43) In another embodiment, illustrated in FIG. 2, Dynamic Outlier Bias Reduction is applied as a data quality technique to evaluate the consistency and accuracy of data to verify that the data is appropriate for a specific use. For data quality operations, the method may not involve an iterative procedure. Other data quality techniques may be used alongside Dynamic Outlier Bias Reduction during this process. The method is applied to the arithmetic average calculation of a given data set. The data quality criteria, for this example is that the successive data values are contained within some range. Thus, any values that are spaced too far apart in value would constitute poor quality data. Error terms are then constructed of successive values of a function and Dynamic Outlier Bias Reduction is applied to these error values.

(44) In Step 210 the initial data is listed in any order.

(45) Step 220 constitutes the function or operation that is performed on the dataset. In this embodiment example, the function and operation is the ascending ranking of the data followed by successive arithmetic average calculations where each line corresponds to the average of all data at and above the line.

(46) Step 230 computes the relative and absolute errors from the data using successive values from the results of Step 220.

(47) Step 240 allows the analyst to enter the desired outlier removal error criteria (%). The Quality Criteria Value is the resultant value from the error calculations in Step 230 based on the data in Step 220.

(48) Step 250 shows the data quality outlier filtered dataset. Specific values are removed if the relative and absolute errors exceed the specified error criteria given in Step 240.

(49) Step 260 shows the arithmetic average calculation comparison between the complete and outlier removed datasets. The analyst is the final step as in all applied mathematical or statistical calculations judging if the identified outlier removed data elements are actually poor quality or not. The Dynamic Outlier Bias Reduction system and method eliminates the analyst from directly removing data but best practice guidelines suggest the analyst review and check the results for practical relevance.

(50) In another embodiment illustrated in FIG. 3, Dynamic Outlier Bias Reduction is applied as a data validation technique that tests the reasonable accuracy of a data set to determine if the data are appropriate for a specific use. For data validation operations, the method may not involve an iterative procedure. In this example, Dynamic Outlier Bias Reduction is applied to the calculation of the Pearson Correlation Coefficient between two data sets. The Pearson Correlation Coefficient can be sensitive to values in the data set that are relatively different than the other data points. Validating the data set with respect to this statistic is important to ensure that the result represents what the majority of data suggests rather than influence of extreme values. The data validation process for this example is that successive data values are contained within a specified range. Thus, any values that are spaced too far apart in value (e.g. outside the specified range) would signify poor quality data. This is accomplished by constructing the error terms of successive values of the function. Dynamic Outlier Bias Reduction is applied to these error values, and the outlier removed data set is validated data.

(51) In Step 310, the paired data is listed in any order.

(52) Step 320 computes the relative and absolute errors for each ordered pair in the dataset.

(53) Step 330 allows the analyst to enter the desired data validation criteria. In the example, both 90% relative and absolute error thresholds are selected. The Quality Criteria Value entries in Step 330 are the resultant absolute and relative error percentile values for the data shown in Step 320.

(54) Step 340 shows the outlier removal process where data that may be invalid is removed from the dataset using the criteria that the relative and absolute error values both exceed the values corresponding to the user selected percentile values entered in Step 330. In practice other error criteria may be used and when multiple criteria are applied as shown in this example, any combination of error values may be applied to determine the outlier removal rules.

(55) Step 350 computes the data validated and original data values statistical results. In this case, the Pearson Correlation Coefficient. These results are then reviewed for practical relevance by the analyst.

(56) In another embodiment, Dynamic Outlier Bias Reduction is used to perform a validation of an entire data set. Standard error improvement value, coefficient of determination improvement value, and absolute and relative error thresholds are selected, and then the data set is filtered according to the error criteria. Even if the original data set is of high quality, there will still be some data that will have error values that fall outside the absolute and relative error thresholds. Therefore, it is important to determine if any removal of data is necessary. If the outlier removed data set passes the standard error improvement and coefficient of determination improvement criteria after the first iteration, then the original data set has been validated, since the filtered data set produced a standard error and coefficient of determination that too small to be considered significant (e.g. below the selected improvement values).

(57) In another embodiment, Dynamic Outlier Bias Reduction is used to provide insight into how the iterations of data outlier removal are influencing the calculation. Graphs or data tables are provided to allow the user to observe the progression in the data outlier removal calculations as each iteration is performed. This stepwise approach enables analysts to observe unique properties of the calculation that can add value and knowledge to the result. For example, the speed and nature of convergence can indicate the influence of Dynamic Outlier Bias Reduction on computing representative factors for a multi-dimensional data set.

(58) As an illustration, consider a linear regression calculation over a poor quality data set of 87 records. The form of the equation being regressed is y=mx+b. Table 1 shows the results of the iterative process for 5 iterations. Notice that using relative and absolute error criteria of 95%, convergence is achieved in 3 iterations. Changes in the regression coefficients can be observed and the Dynamic Outlier Bias Reduction method reduced the calculation data set based on 79 records. The relatively low coefficient of determination (r.sup.2=39%) suggests that a lower (<95%) criteria should be tested to study the additional outlier removal effects on the r.sup.2 statistic and on the computed regression coefficients.

(59) TABLE-US-00003 TABLE 1 Dynamic Outlier Bias Reduction Example: Linear Regression at 95% Iteration N Error r.sup.2 m b 0 87 3.903 25% −0.428 41.743 1 78 3.048 38% −0.452 43.386 2 83 3.040 39% −0.463 44.181 3 79 3.030 39% −0.455 43.630 4 83 3.040 39% −0.463 44.181 5 79 3.030 39% −0.455 43.630

(60) In Table 2 the results of applying Dynamic Outlier Bias Reduction are shown using the relative and absolute error criteria of 80%. Notice that a 15 percentage point (95% to 80%) change in outlier error criteria produced 35 percentage point (39% to 74%) increase in r.sup.2 with a 35% additional decrease in admitted data (79 to 51 records included). The analyst can use a graphical view of the changes in the regression lines with the outlier removed data and the numerical results of Tables 1 and 2 in the analysis process to communicate the outlier removed results to a wider audience and to provide more insights regarding the effects of data variability on the analysis results.

(61) TABLE-US-00004 TABLE 2 Dynamic Outlier Bias Reduction Example: Linear Regression at 80% Iteration N Error r.sup.2 m b 0 87 3.903 25% −0.428 41.743 1 49 1.607 73% −0.540 51.081 2 64 1.776 68% −0.561 52.361 3 51 1.588 74% −0.558 52.514 4 63 1.789 68% −0.559 52.208 5 51 1.588 74% −0.558 52.514

(62) As illustrated in FIG. 4, one embodiment of system used to perform the method includes a computing system. The hardware consists of a processor 410 that contains adequate system memory 420 to perform the required numerical computations. The processor 410 executes a computer program residing in system memory 420 to perform the method. Video and storage controllers 430 may be used to enable the operation of display 440. The system includes various data storage devices for data input such as floppy disk units 450, internal/external disk drives 460, internal CD/DVDs 470, tape units 480, and other types of electronic storage media 490. The aforementioned data storage devices are illustrative and exemplary only. These storage media are used to enter data set and outlier removal criteria into to the system, store the outlier removed data set, store calculated factors, and store the system-produced trend lines and trend line iteration graphs. The calculations can apply statistical software packages or can be performed from the data entered in spreadsheet formats using Microsoft Excel, for example. The calculations are performed using either customized software programs designed for company-specific system implementations or by using commercially available software that is compatible with Excel or other database and spreadsheet programs. The system can also interface with proprietary or public external storage media 300 to link with other databases to provide data to be used with the Dynamic Outlier Bias Reduction system and method calculations. The output devices can be a telecommunication device 510 to transmit the calculation worksheets and other system produced graphs and reports via an intranet or the Internet to management or other personnel, printers 520, electronic storage media similar to those mentioned as input devices 450, 460, 470, 480, 490 and proprietary storage databases 530. These output devices used herein are illustrative and exemplary only.

(63) As illustrated in FIGS. 5, 6A, 6B, 7A, 7B, 8A, and 8B, in one embodiment, Dynamic Outlier Bias Reduction can be used to quantitatively and qualitatively assess the quality of the data set based on the error and correlation of the data set's data values, as compared to the error and correlation of a benchmark dataset comprised of random data values developed from within an appropriate range. In one embodiment, the error can be designated to be the data set's standard error, and the correlation can be designated to be the data set's coefficient of determination (r.sup.2). In another embodiment, correlation can be designated to be the Kendall rank correlation coefficient, commonly referred to as Kendall's tau (τ) coefficient. In yet another embodiment, correlation can be designated to be the Spearman's rank correlation coefficient, or Spearman's ρ (rho) coefficient. As explained above, Dynamic Outlier Bias Reduction is used to systematically remove data values that are identified as outliers, not representative of the underlying model or process being described. Normally, outliers are associated with a relatively small number of data values. In practice, however, a dataset could be unknowingly contaminated with spurious values or random noise. The graphical illustration of FIGS. 5, 6A, 6B, 7A, 7B, 8A, and 8B illustrate how the Dynamic Outlier Bias Reduction system and method can be applied to identify situations where the underlying model is not supported by the data. The outlier reduction is performed by removing data values for which the relative and/or absolute errors, computed between the model predicted and actual data values, are greater than a percentile-based bias criteria, e.g. 80%. This means that the data values are removed if either the relative or absolute error percentile values are greater than the percentile threshold values associated with the 80th percentile (80% of the data values have an error less than this value.)

(64) As illustrated in FIG. 5, both a realistic model development dataset and a dataset of random values developed within the range of the actual dataset are compared. Because in practice the analysts typically do not have prior knowledge of any dataset contamination, such realization must come from observing the iterative results from several model calculations using the dynamic outlier bias reduction system and method. FIG. 5 illustrates an exemplary model development calculation results for both datasets. The standard error, a measure of the amount of model unexplained error, is plotted versus the coefficient of determination (%) or r.sup.2, representing how much data variation is explained by the model. The percentile values next to each point represent the bias criteria. For example, 90% signifies that data values for relative or absolute error values greater than the 90th percentile are removed from the model as outliers. This corresponds to removing 10% of the data values with the highest errors each iteration.

(65) As FIG. 5 illustrates, for both the random and realistic dataset models, error is reduced by increasing the bias criteria, i.e., the standard error and the coefficient of determination are improved for both datasets. However, the standard error for the random dataset is two to three times larger than the realistic model dataset. The analyst may use a coefficient of determination requirement of 80%, for example, as an acceptable level of precision for determining model parameters. In FIG. 5, an r.sup.2 of 80% is achieved at a 70% bias criteria for the random dataset, and at an approximately 85% bias criteria for the realistic data. However, the corresponding standard error for the random dataset is over twice as large as the realistic dataset. Thus, by systematically running the model dataset analysis with different bias criteria and repeating the calculations with a representative spurious dataset and plotting the result as shown in FIG. 5, analysts can assess acceptable bias criteria (i.e., the acceptable percentage of data values removed) for a data set, and accordingly, the overall dataset quality. Moreover, such systematic model dataset analysis may be used to automatically render advice regarding the viability of a data set as used in developing a model based on a configurable set of parameters. For example, in one embodiment wherein a model is developed using Dynamic Outlier Bias Removal for a dataset, the error and correlation coefficient values for the model dataset and for a representative spurious dataset, calculated under different bias criteria, may be used to automatically render advice regarding the viability of the data set in supporting the developed model, and inherently, the viability of the developed model in supporting the dataset.

(66) As illustrated in FIG. 5, observing the behavior of these model performance values for several cases provides a quantitative foundation for determining whether the data values are representative of the processes being modeled. For example, referring to FIG. 5, the standard error for the realistic data set at a 100% bias criteria (i.e., no bias reduction), corresponds to the standard error for the random data set at approximately 65% bias criteria (i.e., 35% of the data values with the highest errors removed). Such a finding supports the conclusion that data is not contaminated.

(67) In addition to the above-described quantitative analysis facilitated by the illustrative graph of FIG. 5, Dynamic Outlier Bias Reduction can be utilized in an equally, if not more powerful, subjective procedure to help assess a dataset's quality. This is done by plotting the model predicted values against the data given actual target values for both the outlier and included results.

(68) FIGS. 6A and 6B illustrate these plots for the 100% points of both the realistic and random curves in FIG. 5. The large scatter in FIG. 6A is consistent with the arbitrary target values and the resultant inability of the model to fit this intentional randomness. FIG. 6B is consistent and common with the practical data collection in that the model prediction and actual values are more grouped around the line whereon model predicted values equal actual target values (hereinafter Actual=Predicted line).

(69) FIGS. 7A and 7B illustrate the results from the 70% points in FIG. 5 (i.e., 30% of data removed as outliers). In FIGS. 7A and 7B the outlier bias reduction is shown to remove the points most distant from the Actual=Predicted line, but the large variation in model accuracy between FIGS. 7A and 7B suggests that this dataset is representative of the processes being modeled.

(70) FIGS. 8A and 8B show the results from the 50% points in FIG. 5 (i.e., 50% of data removed as outliers). In this case about half of the data is identified as outliers and even with this much variation removed from the dataset, the model, in FIG. 8A, still does not closely describe the random dataset. The general variation around the Actual=Predicted line is about the same as in the FIGS. 6A and 7A taking into account the removed data in each case. FIG. 8B shows that with 50% of the variability removed, the model was able to produce predicted results that closely match the actual data. Analyzing these types of visual plots in addition to the analysis of performance criteria shown in FIG. 5 can be used by analysts to assess the quality of actual datasets in practice for model development. While FIGS. 5, 6A, 6B, 7A, 7B, 8A, and 8B illustrate visual plots wherein the analysis is based on performance criteria trends corresponding to various bias criteria values, in other embodiments, the analysis can be based on other variables that correspond to bias criteria values, such as model coefficient trends corresponding to various bias criteria selected by the analyst.

(71) The foregoing disclosure and description of the preferred embodiments of the invention are illustrative and explanatory thereof and it will be understood by those skilled in the art that various changes in the details of the illustrated system and method may be made without departing from the scope of the invention.

Dynamic outlier bias reduction system and method

Assignee

Inventors

Cpc classification

Classification Explorer

G06N7/01

PHYSICS

Classification Explorer

G06F2201/81

PHYSICS

Classification Explorer

G06F11/3447

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06N5/04

PHYSICS

Classification Explorer

G06F18/2433

PHYSICS

Classification Explorer

G06F18/245

PHYSICS

Classification Explorer

G06F17/18

PHYSICS

Classification Explorer

G06F11/3452

PHYSICS

Classification Explorer

G06F18/2193

PHYSICS

Classification Explorer

G06F18/10

PHYSICS

International classification

Classification Explorer

G06N5/04

PHYSICS

Classification Explorer

G06F11/34

PHYSICS

Classification Explorer

G06F17/18

PHYSICS

Classification Explorer

G06N7/00

PHYSICS

Abstract

Claims

Description