METHODS FOR SCREENING A SUBJECT FOR THE RISK OF CHRONIC KIDNEY DISEASE AND COMPUTER-IMPLEMENTED METHOD

Abstract

The disclosure relates to a method for screening a subject for the risk of chronic kidney disease (CKD), comprising: receiving marker data indicative for a plurality of marker parameters for a subject, such plurality of marker parameters indicating, for the subject for a measurement period, an age value, a sample level of creatinine, and a sample level of albumin; and determining a risk factor indicative of the risk of suffering CKD for the subject from the plurality of marker parameters, wherein the determining comprises: weighting the age value higher than the sample level of albumin, and weighting the sample level of creatinine higher than the sample level of albumin. Further, a computer-implemented method for screening a subject and a method for screening a subject for the risk of chronic kidney disease (CKD) are provided.

Claims

1. A method for screening a subject for the risk of chronic kidney disease (CKD), comprising receiving marker data indicative for a plurality of marker parameters for a subject, such plurality of marker parameters indicating, for the subject for a measurement period, an age value, a sample level of creatinine, and a sample level of albumin; and determining a risk factor indicative of the risk of suffering CKD for the subject from the plurality of marker parameters, wherein the determining comprises weighting the age value higher than the sample level of albumin, and weighting the sample level of creatinine higher than the sample level of albumin.

2. The method according to claim 1, further comprising the plurality of marker parameters indicating, for the subject, a blood sample level of creatinine.

3. The method according to claim 1, further comprising the plurality of marker parameters indicating, for the subject, a blood sample level of albumin.

4. The method according to claim 1, wherein the subject is a diabetes patient.

5. The method according to claim 1, wherein the measurement period is limited to two years.

6. The method according to claim 1, wherein the subject has not been diagnosed with diabetes by the end of the measurement period.

7. The method according to claim 4, wherein the measurement period lies after a diabetes diagnosis for the subject, at least in part.

8. The method according to claim 1, wherein the risk factor is indicative of the risk of suffering CKD for the subject within a prediction time period of three years from the end of the measurement period.

9. The method according to claim 1, wherein the determining further comprises weighting the age higher than the sample level of creatinine.

10. The method according to claim 1, wherein the receiving comprises receiving marker data indicative for a plurality of marker parameters for a subject having a sample level of HbA1c of less than 6.5%.

11. The method according to claim 1, further comprising the plurality of marker parameters indicating, for the subject, a sample level of a glomerular filtration rate; and in the determining, weighting each of the age value, the sample level of albumin, and the sample level of creatinine higher than the sample level of a glomerular filtration rate.

12. The method according to claim 1, wherein the risk factor is determined according to the equation $P_{CKD} = \frac{e^{P_{CKD_Pred}}}{1 + e^{P_{CKD_Pred}}},$ and wherein P.sub.CKD is the risk factor;
P.sub.CKD_Pred=c.sub.CKD1.Math.age+c.sub.CKD2.Math.creatinine+c.sub.CKD3.Math.albumin+c.sub.CKD4; age is the age of the subject; creatinine is a sample level of creatinine for the subject; albumin is a sample level of albumin for the subject; and c.sub.CKD1, c.sub.CKD2, c.sub.CKD3, and c.sub.CKD4 are constants.

13. The method according to claim 1, wherein the risk factor is determined according to the equation $P_{CKD}^{'} = \frac{e^{P_{CKD_Pred}^{'}}}{1 + e^{P_{CKD_Pred}^{'}}},$ and wherein P′.sub.CKD is the risk factor;
P′.sub.CKD_Pred=c′.sub.CKD1.Math.age+c′.sub.CKD2.Math.creatinine+c′.sub.CKD3.Math.albumin+c′.sub.CKD4+c′.sub.CKD5.Math.eGFR; age is the age of the subject; creatinine is a sample level of creatinine for the subject; albumin is a sample level of albumin for the subject; eGFR is a sample level of estimated glomerular filtration rate for the subject; and c′.sub.CKD1, c′.sub.CKD2, c′.sub.CKD3, c′.sub.CKD4, and c′.sub.CKD5 are constants.

14. A computer-implemented method for screening a subject for the risk of chronic kidney disease (CKD) in a data processing system having a processor and a non-transitory memory storing a program causing the processor to execute: receiving marker data indicative for a plurality of marker parameters for a subject, such plurality of marker parameters indicating, for the subject for a measurement period, an age value, a sample level of albumin, and a sample level of creatinine; and determining a risk factor indicative of the risk suffering CKD for the subject from the plurality of marker parameters, wherein the determining comprises weighting the age value higher than the sample level of albumin, and weighting the sample level of creatinine higher than the sample level of albumin.

15. A method for screening a subject for the risk of chronic kidney disease (CKD), comprising receiving marker data indicative for a plurality of marker parameters, such plurality of marker parameters indicating an age value for the subject, a sample level of creatinine for a measurement period, and a sample level of albumin for a measurement period; and determining a risk factor indicative of the risk of suffering CKD for the subject from the plurality of marker parameters, wherein the determining comprises weighting the age value higher than the sample level of albumin, and weighting the sample level of creatinine higher than the sample level of albumin, wherein at least one of the sample level of creatinine and the sample level of albumin is indicative of a generalized value of sample levels for a reference group of subjects not comprising the subject, for a respective measurement period of each subject of the reference group of subjects.

Description

DESCRIPTION OF FURTHER EMBODIMENTS

[0133] Following, further embodiments are described by way of example. In the figures show:

[0134] FIG. 1 the distribution of age in an example teaching training set, validation set and further validation set;

[0135] FIG. 2 the distribution of HbA1C in an example teaching training set, validation set and further validation set;

[0136] FIG. 3 a comparison of algorithms for predicting CKD;

[0137] FIG. 4 a comparison of algorithms for predicting CKD using subcohorts;

[0138] FIG. 5 another comparison of algorithms for predicting CKD; and

[0139] FIG. 6 a further comparison of algorithms for predicting CKD.

[0140] In general, in any of the embodiments of the method for screening a subject for the risk of CKD, creatinine.sub.max may be a maximum sample level of creatinine from a plurality of sample levels of creatinine for the subject, albumin.sub.min may be a minimum sample level of albumin from a plurality of sample levels of albumin for the subject, eGFR.sub.min may be a minimum sample level of estimated glomerular filtration rate from a plurality of sample levels of estimated glomerular filtration rate for the subject, BMI.sub.min may be a minimum value of the Body Mass Index (BMI) from a plurality of values of the BMI for the subject, Glucose.sub.min may be a minimum sample level of glucose from a plurality of sample levels of glucose for the subject and HbA.sub.mean may be a mean sample level of C-fraction of glycated haemoglobin A1 from a plurality of sample levels of C-fraction of glycated haemoglobin A1 for the subject. Such values and/or sample levels may be determined from values and/or sample levels already on file for the subject. Alternatively or in addition, values and/or sample levels may be determined for the subject specifically for use with the method for screening a subject for the risk of CKD. Values and/or sample levels may be real world data, i.e., unlike clinical data, they may not be restricted regarding, for example, completeness or veracity of the data.

[0141] In the method for screening a subject for the risk of CKD, creatinine.sub.max may be expressed in units of mg/dl, albumin.sub.min may be expressed in units of g/dl, eGFR.sub.min may be expressed in units of ml/min/1.73 m.sup.2, BMI.sub.min may be expressed in units of kg/m.sup.2, Glucose.sub.min may be a expressed in units of mg/dl and HbA.sub.mean may be expressed in units of %. Glomerular filtration rates may be estimated using an MDRD formula, known in the art as such. Alternatively, glomerular filtration rates may be estimated using the CKD-EPI formula, known in the art as such.

[0142] Marker data may be received for a subject suffering from diabetes. In alternative, the subject does not suffer from diabetes but may is at risk of suffering from diabetes in the future. The marker data is indicative for marker parameters age, creatinine.sub.max and albumin.sub.min for the subject. The parameter “age” indicates the age of the subject in years. The parameter “creatinine.sub.max” is indicative of a maximum sample level of creatinine from a plurality of sample levels of creatinine on file for the subject and collected over the prior 2 years from blood samples. The parameter “albumin.sub.min” is indicative of a minimum sample level of albumin from a plurality of sample levels of albumin on file for the subject and collected over the prior 2 years from blood samples.

[0143] According to this embodiment, marker data is indicative for the marker parameters age, creatinine.sub.max and albumin.sub.min for the subject, thereby providing a simplified method for calculating a risk factor indicative of the risk of suffering CKD for the subject. In further embodiments, as will be set forth in more detail below, further marker data indicative for at least one of the marker parameters eGFR.sub.min, BMI.sub.min, Glucose.sub.min and HbA.sub.mean for the subject may be included in the calculation to provide a more accurate calculation for the risk factor.

[0144] In an example, a risk factor indicative of the risk of suffering CKD for the subject is determined from the plurality of marker parameters according to the following equations:

[00007] $.Math. P_{CKD} = \frac{e^{P_{CKD_Pred}}}{1 + e^{P_{CKD_Pred}} + e^{P_{Death_Pred}}}$ $P_{CKD_Pred} = 0.02739 .Math. age .Math. / .Math. year + 1.387 .Math. {creatinine}_{\max} .Math. dl .Math. / .Math. mg - 0.3356 .Math. {albumin}_{\min} .Math. dl .Math. / .Math. g - 3.1925$ $P_{{Death}_{-} .Math. Pred} = 0.06103 .Math. age .Math. / .Math. year + 0.8194 .Math. {creatinine}_{\max} .Math. dl .Math. / .Math. mg - 0.9336 .Math. {albumin}_{\min} .Math. dl .Math. / .Math. g - 3.3325$

[0145] Thereby, the age value is weighted higher than the sample level of albumin and the sample level of creatinine is weighted higher than the sample level of albumin.

[0146] Marker data may be received for a subject suffering from diabetes. In alternative, the subject does not suffer from diabetes but may is at risk of suffering from diabetes in the future. The marker data is indicative for marker parameters age, creatinine.sub.max, albumin.sub.min and eGFR.sub.min for the subject. The parameter “age” indicates the age of the subject in years. The parameter “creatinine.sub.max” is indicative of a maximum sample level of creatinine from a plurality of sample levels of creatinine on file for the subject and collected over the prior 2 years from blood samples. The parameter “albumin.sub.min” is indicative of a minimum sample level of albumin from a plurality of sample levels of albumin on file for the subject and collected over the prior 2 years from blood samples. The parameter “eGFR.sub.min” is indicative of a minimum sample level of estimated glomerular filtration rate from a plurality of sample levels of estimated glomerular filtration rate on file for the subject and collected over the prior 2 years.

[0147] In an example, a risk factor indicative of the risk of suffering CKD for the subject is determined from the plurality of marker parameters according to the following equations:

[00008] $.Math. P_{CKD} = \frac{e^{P_{CKD_Pred}}}{1 + e^{P_{CKD_Pred}} + e^{P_{Death_Pred}}}$ $P_{CKD_Pred} = 0.02739 .Math. age .Math. / .Math. year + 1.387 .Math. {creatinine}_{\max} .Math. dl .Math. / .Math. mg - 0.3356 .Math. {albumin}_{\min} .Math. dl .Math. / .Math. g - 0.02843 .Math. {eGFR}_{\min} .Math. \min .Math. 1.73 .Math. .Math. m^{2} .Math. / .Math. ml - 1.3013$ $P_{{Death}_{-} .Math. Pred} = 0.06103 .Math. age .Math. / .Math. year + 0.8194 .Math. {creatinine}_{\max} .Math. dl .Math. / .Math. mg - 0.9336 .Math. {albumin}_{\min} .Math. dl .Math. / .Math. g + 0.01654 .Math. {eGFR}_{\min} .Math. \min .Math. 1.73 .Math. .Math. m^{2} .Math. / .Math. ml - 4.4328$

[0148] Thereby, the age value is weighted higher than the sample level of albumin, the sample level of creatinine is weighted higher than the sample level of albumin and each of the age value, the sample level of albumin, and the sample level of creatinine are weighted higher than the sample level of glomerular filtration rate.

[0149] Marker data may be received for a subject suffering from diabetes. In alternative, the subject does not suffer from diabetes but may is at risk of suffering from diabetes in the future. The marker data is indicative for marker parameters age, creatinine.sub.max, albumin.sub.min, eGFR.sub.min, BMI.sub.min, Glucose.sub.min and HbA.sub.mean for the subject. The parameter “age” indicates the age of the subject in years. The parameter “creatinine.sub.max” is indicative of a maximum sample level of creatinine from a plurality of sample levels of creatinine on file for the subject and collected over the prior 2 years from blood samples. The parameter “albumin.sub.min” is indicative of a minimum sample level of albumin from a plurality of sample levels of albumin on file for the subject and collected over the prior 2 years from blood samples. The parameter “eGFR.sub.min” is indicative of a minimum sample level of estimated glomerular filtration rate from a plurality of sample levels of estimated glomerular filtration rate on file for the subject and collected over the prior 2 years. The parameter “BMI.sub.min” is indicative of a minimum value for the Body Mass Index from a plurality of values for the Body Mass Index on file for the subject and collected over the prior 2 years. The parameter “Glucose.sub.min” is indicative of a minimum sample level of blood glucose from a plurality of sample levels of blood glucose on file for the subject and collected over the prior 2 years. The parameter “HbA.sub.mean” is indicative of a mean sample level of C-fraction of glycated haemoglobin A1 from a plurality of sample levels of C-fraction of glycated haemoglobin A1 on file for the subject and collected over the prior 2 years.

[0150] A risk factor indicative of the risk of suffering CKD for the subject is determined from the plurality of marker parameters according to the following equations:

[00009] $.Math. P_{CKD} = \frac{e^{P_{CKD_Pred}}}{1 + e^{P_{CKD_Pred}} + e^{P_{Death_Pred}}}$ $P_{CKD_Pred} = 0.02739 .Math. age .Math. / .Math. year + 1.387 .Math. {creatinine}_{\max} .Math. dl .Math. / .Math. mg - 0.3356 .Math. {albumin}_{\min} .Math. dl .Math. / .Math. g - 0.02843 .Math. {eGFR}_{\min} .Math. \min .Math. 1.73 .Math. .Math. m^{2} .Math. / .Math. ml + 0.01128 .Math. {BMI}_{\min} + 0.0004946 .Math. {Glucose}_{\min} .Math. dl .Math. / .Math. mg + 0.0893 .Math. {HbA}_{mean} / % - 2.409$ $P_{{Death}_{-} .Math. Pred} = 0.06103 .Math. age .Math. / .Math. year + 0.8194 .Math. {creatinine}_{\max} .Math. dl .Math. / .Math. mg - 0.9336 .Math. {albumin}_{\min} .Math. dl .Math. / .Math. g + 0.01654 .Math. {eGFR}_{\min} .Math. \min .Math. 1.73 .Math. .Math. m^{2} .Math. / .Math. ml - 0.0101 .Math. {BMI}_{\min} + 0.0009107 .Math. {Glucose}_{\min} .Math. dl .Math. / .Math. mg + 0.04368 .Math. {HbA}_{mean} / % - 4.557$

[0151] Thereby, the age value is weighted higher than the sample level of albumin, the age is weighted higher than the sample level of creatinine, the sample level of creatinine is weighted higher than the sample level of albumin and each of the age value, the sample level of albumin, and the sample level of creatinine are weighted higher than the sample level of glomerular filtration rate. Further, each of the age value, the sample level of albumin, the sample level of creatinine and the sample level of glomerular filtration rate are weighted higher than each of the value of the Body Mass Index, the sample level of of blood glucose and the sample level of C-fraction of glycated haemoglobin A1.

[0152] In the method for screening a subject for the risk of CKD, all or any of the values to be multiplied with the values and/or sample levels for the subject in determining P.sub.CKD_Pred and/or P.sub.Death_Pred may be determined as follows.

[0153] An algorithm is taught using electronic health record (EHR) data, for example from 417,912 people with diabetes (types 1 and 2) among more than 55 million people represented in a database. The data is retrieved for the time window starting 2 years before the initial diagnosis of diabetes and lasting until up to 3 years following this diagnosis. The data can be considered as real-world data (RWD) and no general restrictions on, for example, completeness or veracity of the data are applied. Missing data is imputed with the cohort's mean value before feature selection and teaching the algorithm. Logistic regression is chosen for teaching rather than a black box approach such as deep learning. This may allow for the medical interpretation of the data-driven analysis. After teaching, an independent sample set of data, for example originating from 104,504 further individuals in the same database, is used for independent validation. In addition, the algorithm is applied to data, for example from 82,912 persons with type-2 diabetes included in a further database.

[0154] ICD codes may be used as target variables for training as well as the CKD reference diagnosis in the analysis of the validation results. The definition of the target feature “CKD” may be solely based on the occurrence of the respective ICD codes in the databases. In order to maintain the RWD character of the data set, no additions or changes may be made to the databases. Such ICD codes may comprise ICD-9 codes and ICD-10 codes, for example the following ICD codes: 250.40, 250.41, 250.42, 250.43, 585.1, 585.2, 585.3, 585.4, 585.5, 585.6, 585.9, 403.00, 403.01, 403.11, 403.90, 403.91, 404.0, 404.00, 404.01, 404.02, 404.03, 404.1, 404.10, 404.11, 404.12, 404.13, 404.9, 404.90, 404.91, 404.92, 404.93, 581.81, 581.9, 583.89, 588.9, E10.2, E10.21, E10.22, E10.29, E11.2, E11.21, E11.22, E11.29, N17.0, N17.1, N17.2, N17.8, N17.9, N18.1, N18.2, N18.3, N18.4, N18.5, N18.6, N18.9, N19, I12.0, I12.9, I13, I13.0, I13.1, I13.10, I13.11, I13.2, N04.9, N05.8, N08 and/or N25.9.

[0155] The ICD-9 codes 250.40, 403.90, 585.3, 585.9 may be the most abundant diagnosis in the respective time windows of the data set and they occur in >5% of the cases within each of the data sets.

[0156] In a further method for screening a subject for the risk of CKD, all or any of the values to be multiplied with the values and/or sample levels for the subject in determining P.sub.CKD_Pred and/or P.sub.Death_Pred may be determined as follows.

[0157] In order to allow an early risk assessment for CKD, EHR data is extracted from a database, which includes longitudinal data originating from more than 55 million patients with thousands of person-specific features. The data extracted from the database for the investigation originates from 522,416 people newly diagnosed with diabetes. The data is retrieved for the time window starting 2 years before the initial diagnosis of diabetes and lasting until up to 3 years following this diagnosis. People with prior renal dysfunctions are excluded in order to perform an unbiased risk assessment for the later development of CKD. Following the guidelines for the diagnosis of diabetes, it is requested that the concentration of the β-N-1-deoxyfructosyl component of hemoglobin (HbA1C), an important clinical laboratory parameter in diabetes diagnosis and treatment, was determined at least once prior to (or within 7 days after) the initial diagnosis of diabetes. The data selected from the database can be considered as RWD because no further restrictions on the completeness or veracity of the data are applied. In order to cope with these challenges arising from the use of RWD the following approach may be implemented: [0158] 1. The data selected from the database is randomly split into a teaching set (417,912 people) and a validation set (104,504 people). [0159] 2. Features are selected on the basis of a data-driven correlation analysis within the teaching set and cross-checked for conceptual (especially medical) relevance. [0160] 3. Missing values are imputed with the dataset's mean value. Optionally, a screening or determination of outlier values has been performed prior to teaching. In case of determining an outlier, the value has been substituted by an appropriate value (If the feature value is higher than the upper limit of the specific allowable range for that feature, the value can be replaced by the upper limit of that range before using it in the prediction formula. If the feature value is lower than the lower limit of the specific allowable range for that feature, the value can be replaced by the lower limit before using it in the prediction formula). [0161] 4. The risk predictor is taught exclusively in this RWD's teaching set. [0162] 5. After the teaching is completed, the validation set is subjected to the algorithm in order to assess the quality of the algorithm. No further readjustment of the algorithm is performed. [0163] 6. In addition, RWD from 82,912 people represented in a further database is used as a further, independent validation set.

[0164] Analysis of an example teaching training set (from the IBM Explorys database; see Kaelber, D. C. et al., Patient characteristics associated with venous thromboembolic events: a cohort study using pooled electronic health record data, J Am Med Inform Assoc 19, 965-972, 2012), validation set (from the IBM Explorys database) and further validation set (from the Indiana Network for Patient Care (INPC); see McDonald, C. J. et al., The Indiana Network for Patient Care: a working local health information infrastructure, Health Affairs 24, 1214-1220, 2005) has been conducted. In the teaching logistic regression has been applied.

[0165] In the teaching and validation sets, 50.7%, 50.9% and 51.7% of the persons, respectively, are female. The median age of each population is 60 years, 60 years, and 59 years, respectively. The median concentrations of HbA1C are 6.8%, 6.8%, and 6.6%, respectively. The distributions of age and HbA1C are shown in FIGS. 1 and 2, respectively.

[0166] In certain embodiments, for feature selection, almost 300 features are initially chosen based on medical as well as data-driven criteria. This feature set is then culled in multiple steps. Observational features that are defined for less than half of the patients in the cohort are removed, as are outliers of continuous features. Categorical features with 99% of occurrences in a single category and continuous features with a standard deviation of 0.001% are not considered. Finally, only those features which already showed correlation with the diagnosis of CKD in a univariate analysis as quantified by Pearson's chi-squared coefficient χ.sup.2>0.95 are retained. For predictive analysis, a logistic regression model based on forward selection (see Bursac, Z. et al., Purposeful selection of variables in logistic regression, Source code for biology and medicine 3, 17, 2008; and Hosmer Jr., D. W. et al., Applied logistic regression, Vol. 398, John Wiley & Sons, 2013) is trained on the teaching set and delivers the person's age, body mass index, glomerular filtration rate and the concentrations of glucose, albumin, and creatinine as the most prominent parameters. An assessment of the medical relevance of these features may be performed to ensure clinical applicability, in contrast to a “black box” approach based on, for example, deep learning. HbA.sub.1C may be added to the top-7 feature list in order to reflect current state-of-the-art methods. The teaching of algorithms may be based on correlation, but may not infer any causality. After teaching, the algorithm is applied to the two independent datasets, namely the validation sets.

[0167] ICD codes may be used as target variables for training as well as the CKD reference diagnosis in the analysis of the validation results. The definition of the target feature “CKD” may be solely based on the occurrence of the respective ICD codes in the databases. In order to maintain the RWD character of the data set, no additions or changes may be made to the databases. Such ICD codes may comprise ICD-9 codes and ICD-10 codes, for example the following ICD codes: 250.40, 250.41, 250.42, 250.43, 585.1, 585.2, 585.3, 585.4, 585.5, 585.6, 585.9, 403.00, 403.01, 403.11, 403.90, 403.91, 404.0, 404.00, 404.01, 404.02, 404.03, 404.1, 404.10, 404.11, 404.12, 404.13, 404.9, 404.90, 404.91, 404.92, 404.93, 581.81, 581.9, 583.89, 588.9, E10.2, E10.21, E10.22, E10.29, E11.2, E11.21, E11.22, E11.29, N17.0, N17.1, N17.2, N17.8, N17.9, N18.1, N18.2, N18.3, N18.4, N18.5, N18.6, N18.9, N19, I12.0, I12.9, I13, I13.0, I13.1, I13.10, I13.11, I13.2, N04.9, N05.8, N08 and/or N25.9.

[0168] In an embodiment, the ICD-9 codes 250.40, 403.90, 585.3, 585.9 are the most abundant diagnosis in the respective time windows of the data set and they occur in >5% of the cases within each of the data sets.

[0169] Following, experimental data are discussed.

[0170] The area under the receiver operating characteristic (compare Swets, J. A., Measuring the accuracy of diagnostic systems, Science 240, 1285-1293, 1988) curve (AUC) is frequently used to measure the quality of clinical markers as well as machine learning algorithms (see Bradley, A. P., The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition 30, 1145-1159, 1997). A perfect marker would achieve AUC=1.0, whereas flipping a coin would result in AUC=0.5. After teaching the model (based on Explorys) according to the present disclosure using the seven most promising features, the AUC of the prediction algorithm amounted to 0.7937 (0.790 . . . 0.797) when applied to the overall independent validation data (Explorys: 0.761, INPC: 0.831).

[0171] The AUC increased to 0.7939 and 0.7967 if the top-10 and top-12 features were used for evaluation, respectively. In turn, a simple HbA1C model (see The Diabetes Control and Complications Trial Research Group. The effect of intensive treatment of diabetes on the development and progression of long-term complications in insulin-dependent diabetes mellitus, N Engl J Med 329, 977-986, 1993) yielded 0.483 (0.477 . . . 0.489) for the same datasets. The algorithm according to the present disclosure therefore outperforms risk predictors using HbA1C alone for people newly diagnosed with diabetes.

[0172] In further analysis, the algorithm according to the present disclosure was compared to published algorithms derived from data sourced from major clinical studies such as the ONTARGET, ORIGIN, RENAAL and ADVANCE studies (cf. Dunkler, D. et al., Risk Prediction for Early CKD in Type 2 Diabetes, Clin J Am Soc Nephrol 10, 1371-1379, 2015; Vergouwe, Y. et al., Progression to microalbuminuria in type 1 diabetes: development and validation of a prediction rule, Diabetologia 53, 254-262, 2010; Keane, W. F. et al., Risk Scores for Predicting Outcomes in Patients with Type 2 Diabetes and Nephropathy: The RENAAL Study, Clin J Am Soc Nephrol 1, 761-767, 2006; and Jardine, M. J. et al., Prediction of Kidney-Related Outcomes in Patients With Type 2 Diabetes, Am J Kidney Dis. 60, 770-778, 2012). As shown in FIG. 3, the algorithm according to the present disclosure outperformed each of these algorithms for all RWD cohorts. While this finding is important in terms of applicability and relevance in everyday settings, it may be argued that the validity of the published algorithms is limited to the inclusion and exclusion criteria of the corresponding clinical studies. Therefore, subcohorts of the IBM Explorys and INPC databases were formed according to the selection criteria of these studies, and the algorithm according to the present disclosure (without any retraining) was benchmarked against the literature algorithms solely for these subcohorts. Although the AUCs of the published algorithms increased for all specific subcohorts as expected, the superiority of the RWD-trained model according to the present disclosure prevailed (FIG. 4). However, the inclusion and exclusion criteria for the subcohorts could not be met precisely in all cases for the present RWD set because the clinical studies demanded some information which is not available in the database (e.g. waist-to-hip ratio). In addition, there were differences in the choice of the complication incidence time window. Nevertheless, the features that were prioritized for classification with the algorithm according to the present disclosure are similar to those reported in the literature, thus further bolstering the algorithm's validity.

[0173] The use of RWD and in particular the inclusion of incomplete or erroneous data in the training set for the algorithm according to the present disclosure constitutes a major difference compared to clinical study-based algorithms. The imputation of missing data provides a typical example of predictive analytics in RWD cohorts, whereas imputation would be inconceivable in a clinical study setting. To further elucidate the role of imputation, the algorithm according to the present disclosure was applied to RWD solely representing individuals providing a complete set of information (i.e. no imputation was necessary). In this case, the AUCs remained comparable to the previous values for the overall RWD set, that is 0.792 (0.787 . . . 0.797), 0.791 (0.780 . . . 0.801), and 0.809 (0.769 . . . 0.846) for the Explorys teaching training set, the Explorys validation set, and the INPC validation set, respectively. Further analysis revealed the rapid loss of classification accuracy with an increasing fraction of imputed data when the earlier algorithms were tested, whereas the algorithm according to the present disclosure achieved much higher stability, even for higher proportions of imputed data (FIG. 5). It is concluded that—at least in the present example—the teaching training of predictive analytics algorithms using RWD could achieve equivalent or even enhanced accuracy compared to clinical trial data, but further testing on additional datasets will be necessary before these conclusions can be generalised.

[0174] In summary, it is demonstrated that a predictive algorithm for CKD performed significantly better in individuals newly diagnosed with diabetes if trained on RWD rather than clinical study data. This statement held true when the algorithm according to the present disclosure was applied to the overall RWD cohort as well as specific subcohorts as defined by the corresponding clinical studies. The results support the path towards high-quality predictive models that can be applied in a clinical setting, enabling the shift towards personalized and outcome-based healthcare.

[0175] The performance of a method for screening a subject for the risk of CKD or for identifying those people at high risk of developing CKD may be judged according to sensitivity (fraction of correctly predicted high-risk patients) and specificity (fraction of correctly assigned low-risk patients). However, either of these numbers can be improved at the expense of the other simply by changing the threshold between high and low risk. Hence, data pairs of sensitivity and specificity may be illustrated in forms of the so-called receiver operating characteristic (ROC) curve (see Swets, J. A., Measuring the accuracy of diagnostic systems, Science 240, 1285-1293, 1988) in which the sensitivity is plotted as a function of 1-specificity (which corresponds to the fraction of falsely assigned high-risk persons). The ROC curve of the risk model according to the present disclosure is shown for the Explorys training set, the Explorys validation set and the INPC validation set in FIG. 6 together with the corresponding ROC curves for a model based solely on HbA1C.

[0176] For a perfect classifier, the ROC curve reaches the upper-left corner. In fact, the threshold corresponding to the data pair closest to this corner is dubbed the “optimal threshold”. When aiming for high sensitivity, an alternative threshold may be chosen to guarantee a sensitivity of, for example, 90%. The corresponding results are summarized in the following Table together with the positive predictive value (PPV) and negative predictive value (NPV). Similar measures from the field of bioinformatics—namely accuracy and F-score (Van Rijsbergen, C. J., Information Retrieval, Butterworth-Heinemann Newton, Mass., USA, 1979)—supplement the list of examples in the Table 2.

TABLE-US-00001 TABLE 2 Cohort sensitivity specificity PPV NPV acc. F-measure a) HbA1c Explorys 53.5 55.1 11.7 91.4 55.0 19.2 (teach) Explorys (val) 54.4 55.2 11.9 91.6 55.1 19.5 INPC (val) 37.5 61.5 11.3 88.2 58.7 17.4 b) present Explorys 68.2 72.6 21.7 95.4 72.1 32.9 model (teach) Explorys (val) 68.3 72.4 21.6 95.3 72.0 32.8 INPC (val) 79.3 71.2 26.6 96.3 72.2 39.8 c) present Explorys (90.0) 35.0 13.3 96.9 40.5 23.2 model* (teach) Explorys (val) 90.0 34.9 13.3 96.9 40.4 23.2 INPC (val) 95.3 27.6 14.7 97.8 35.5 25.5

[0177] A comparative evaluation of the full algorithm according to the present disclosure (seven values/sample levels for the subject, missing values/sample levels imputed) to a reduced algorithm according to the present disclosure (age, creatinine.sub.max and albumin.sub.min for the subject, population mean values for the remaining values/sample levels), respectively applied to INPC data, has resulted in an AUC of 0.831 (confidence interval 0.827 to 0.836) for the full algorithm and an AUC of 0.823 (confidence interval 0.818 to 0.827) for the reduced algorithm. Therefore, even with the reduced algorithm, useful predictions may be achieved.

METHODS FOR SCREENING A SUBJECT FOR THE RISK OF CHRONIC KIDNEY DISEASE AND COMPUTER-IMPLEMENTED METHOD

Assignee

Inventors

Cpc classification

Classification Explorer

G01N33/6893

PHYSICS

Classification Explorer

G01N2800/54

PHYSICS

Classification Explorer

Y02A90/10

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

G16B25/00

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G01N2800/347

PHYSICS

Classification Explorer

G16H50/30

PHYSICS

International classification

Classification Explorer

G16H50/30

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G16B25/00

PHYSICS

Abstract

Claims

Description