METHOD AND DEVICES FOR AGE DETERMINATION

Abstract

The present invention relates to the determination of ages. Specifically, the present invention relates to a method for determining an age indicator, and a method for determining the age of an individual. Said methods are based on data comprising the DNA methylation levels of a set of genomic DNA sequences. Preferably, said age indicator is determined by applying on the data a regression method comprising a Least Absolute Shrinkage and Selection Operator (LASSO), preferably in combination with subsequent stepwise regression. Furthermore, the invention relates to an ensemble of genomic DNA sequences and a gene set, and their uses for diagnosing the health state and/or the fitness state of an individual and identifying a molecule which affects ageing. In further aspects, the invention relates to a chip or a kit, in particular which can be used for detecting the DNA methylation levels of said ensemble of genomic DNA sequences.

Claims

1. A method for determining an age indicator comprising the steps of (a) providing a training data set of a plurality of individuals comprising for each individual (i) the DNA methylation levels of a set of genomic DNA sequences and (ii) the chronological age, and (b) applying on the training data set a regression method comprising a Least Absolute Shrinkage and Selection Operator (LASSO), thereby determining the age indicator and a reduced training data set, wherein the independent variables are the methylation levels of the genomic DNA sequences and wherein the dependent variable is the age, wherein the age indicator comprises (i) a subset of the set of genomic DNA sequences as ensemble and (ii) at least one coefficient per genomic DNA sequence contained in the ensemble, and wherein the reduced training data set comprises all data of the training data set except the DNA methylation levels of the genomic DNA sequences which are eliminated by the LASSO.

2. A method for determining the age of an individual comprising the steps of (a) providing a training data set of a plurality of individuals comprising for each individual (i) the DNA methylation levels of a set of genomic DNA sequences and (ii) the chronological age, and (b) applying on the training data set a regression method comprising a Least Absolute Shrinkage and Selection Operator (LASSO), thereby determining the age indicator and a reduced training data set, wherein the independent variables are the methylation levels of the genomic DNA sequences and wherein the dependent variable is the age, wherein the age indicator comprises (i) a subset of the set of genomic DNA sequences as ensemble and (ii) at least one coefficient per genomic DNA sequence contained in the ensemble, and wherein the reduced training data set comprises all data of the training data set except the DNA methylation levels of the genomic DNA sequences which are eliminated by the LASSO, and (c) providing the DNA methylation levels of the individual for whom the age is to be determined of at least 80% or 100% of the genomic DNA sequences comprised in the age indicator, and (d) determining the age of the individual based on its DNA methylation levels and the age indicator, wherein the determined age can be different from the chronological age of the individual.

3. The method of claim 1, wherein the regression method further comprises applying a stepwise regression subsequently to the LASSO.

4. The method of claim 3, wherein the stepwise regression is applied on the reduced training data set.

5. The method of claim 1, wherein the ensemble comprised in the age indicator is smaller than the set of genomic DNA sequences.

6. The method of claim 1, wherein the ensemble comprised in the age indicator is smaller than the set of genomic DNA sequences comprised in the reduced training data set.

7. The method of 3, wherein the stepwise regression is a bidirectional elimination, wherein statistically insignificant independent variables, are removed, wherein the significance level is 0.05.

8. The method of claim 1, wherein the LASSO is performed with the biglasso R package or by applying the command “cv.biglasso” or wherein the “nfold” is 20.

9. The method of claim 1, wherein the regression method does not comprise a Ridge regression (L2 regularization) or the L2 regularization parameter/lambda parameter is 0.

10. The method of claim 1, wherein the LASSO L1 regularization parameter/alpha parameter is 1.

11. The method of claim 1, wherein the age indicator is iteratively updated comprising adding the data of at least one further individual to the training data in each iteration, thereby iteratively expanding the training data set.

12. The method of claim 11, wherein in one updating round the added data of each further individual comprise the individual's DNA methylation levels of (i) at least 5% or 50% or 100% of the set of genomic DNA sequences comprised in the initial or any of the expanded training data sets, and/or (ii) the genomic DNA sequences contained in the reduced training data set.

13. The method of claim 11, wherein all genomic DNA sequences (independent variables) which are not present for all individuals who contribute data to the expanded training data set are removed from the expanded training data set.

14. The method of claim 11, wherein in one updating round the set of genomic DNA sequences whereof the methylation levels are added is identical for each of the further individual(s).

15. The method of claim 11, wherein one updating round comprises applying the LASSO on the expanded training data set, thereby determining an updated age indicator and/or an updated reduced training data set.

16. The method of claim 11, wherein the training data set to which the data of the at least one further individual are added is the reduced training data set, which can be the initial or any of the updated reduced training data sets.

17. The method of claim 16, wherein the reduced training data set is the previous reduced training data set in the iteration.

18. The method of claim 11, wherein one updating round comprises applying the stepwise regression on the reduced training data set thereby determining an updated age indicator.

19. The method of claim 1, wherein in one updating round, the data of at least one individual is removed from the training data set and/or the reduced training data set.

20. The method of claim 11, wherein the addition and/or removal of the data of an individual depends on at least one characteristic of the individual, wherein the characteristic is the ethnos, the sex, the chronological age, the domicile, the birth place, at least one disease and/or at least one life style factor, wherein the life style factor is selected from drug consumption, exposure to an environmental pollutant, shift work or stress.

21. The method of claim 1, wherein the quality of the age indicator is determined, wherein the determination of said quality comprises the steps of (a) providing a test data set of a plurality of individuals who have not contributed data to the training data set comprising for each said individual (i) the DNA methylation levels of the set of genomic DNA sequences comprised in the age indicator and (ii) the chronological age; and (b) determining the quality of the age indicator by statistical evaluation and/or evaluation of the domain boundaries, wherein the statistical evaluation comprises (i) determining the age of the individuals comprised in the test data set, (ii) correlating the determined age and the chronological age of said individual(s) and determining at least one statistical parameter describing this correlation, and (iii) judging if the statistical parameter(s) indicate(s) an acceptable quality of the age indicator or not or wherein the statistical parameter is selected from a coefficient of determination (R.sup.2) and a mean absolute error (MAE), wherein a R.sup.2 of greater than 0.50 or greater than 0.70 or greater than 0.90 or greater than 0.98 and/or a MAE of less than 6 years or less than 4 years or at most 1 year, indicates an acceptable quality, and wherein evaluation of the domain boundaries comprises (iv) determining the domain boundaries of the age indicator, wherein the domain boundaries are the minimum and maximum DNA methylation levels of each genomic DNA sequence comprised in the age indicator and wherein said minimum and maximum DNA methylation levels are found in the training data set which has been used for determining the age indicator, and (v) determining if the test data set exceeds the domain boundaries, wherein not exceeding the domain boundaries indicates an acceptable quality.

22. The method of claim 1, wherein the training data set and/or the test data set comprises at least 10 or at least 30 individuals or at least 200 individuals or wherein the training data set comprises at least 200 individuals and the test data set at least 30 individuals.

23. The method of claim 21, wherein the age indicator is updated when its quality is not acceptable.

24. The method of any of claim 11, wherein the age of the individual is determined based on its DNA methylation levels and the updated age indicator.

25. The method of claim 2, wherein the age of the individual is only determined with the age indicator when he/she has not contributed data to the training data set which is used for generating said age indicator.

26. The method of any of claim 1, wherein the age indicator is not further updated when the number of individuals comprised in the data has reached a predetermined value and/or a predetermined time has elapsed since a previous update.

27. The method of claim 1, wherein the set of genomic DNA sequences comprised in the training data set is preselected from genomic DNA sequences whereof the methylation level is associable with chronological age.

28. The method of claim 27, wherein, the preselected set comprises at least 400000 or at least 800000 genomic DNA sequences.

29. The method of claim 1, wherein the genomic DNA sequences comprised in the training data set are not overlapping with each other and/or only occur once per allele.

30. The method of claim 1, wherein the reduced training data set comprises at least 90 or at least 100 or at least 140 genomic DNA sequences.

31. The method of claim 1, wherein the reduced training data set comprises less than 5000 or less than 2000 or less than 500 or less than 350 or less than 300 genomic DNA sequences.

32. The method of any of claim 1, wherein the age indicator comprises at least 30 or at least 50 or at least 60 or at least 80 genomic DNA sequences.

33. The method of any of claim 1, wherein the age indicator comprises less than 300 or less than 150 or less than 110 or less than 100 or less than 90 genomic DNA sequences.

34. The method of claim 1, wherein the DNA methylation levels of the genomic DNA sequences of an individual are measured in a sample of biological material of said individual comprising said genomic DNA sequences.

35. The method of claim 34, wherein the sample comprises buccal cells.

36. The method of claim 34, further comprising a step of obtaining the sample, wherein the sample is obtained non-invasively.

37. The method of claim 34, wherein the DNA methylation levels are measured by methylation sequencing, bisulfate sequencing, a PCR method, high resolution melting analysis (HRM), methylation-sensitive single-nucleotide primer extension (MS-SnuPE), methylation-sensitive single-strand conformation analysis, methyl-sensitive cut counting (MSCC), base-specific cleavage/MALDI-TOF, combined bisulfate restriction analysis (COBRA), methylated DNA immunoprecipitation (MeDIP), micro array-based methods, bead array-based methods, pyrosequencing and/or direct sequencing without bisulfate treatment (nanopore technology).

38. The method of claim 34, wherein the DNA methylation levels of genomic DNA sequences of an individual are measured by base-specific cleavage/MALDI-TOF and/or a PCR method or wherein base-specific cleavage/MALDI-TOF is the Agena technology and the PCR method is methylation specific PCR.

39. The method of claim 34, wherein the DNA methylation levels of the genomic DNA sequences comprised in the age indicator are determined in a sample of biological material comprising said genomic DNA sequences of the individual for whom the age is to be determined.

40-72. (canceled)

73. A data carrier comprising the age indicator obtained by the method of claim 2.

74. (canceled)

75. The method of claim 1, wherein the training data set, reduced training data set and/or added data further comprise at least one factor relating to a life-style or risk pattern associable with the individual(s).

76. The method of claim 75, wherein the factor is selected from drug consumption, environmental pollutants, shift work and stress.

77. The method of 75, wherein the training data set and/or the reduced training data set is restricted to sequences whereof the DNA methylation level and/or the activity/level of an encoded proteins is associated with at least one of the life-style factors.

78. The method of claim 75, further comprising a step of determining at least one life-style factor which is associated with the difference between the determined and the chronological age of said individual.

79. A method of determination of an age indicator for an individual in a series of individuals, the determination being based on levels of methylation of genomic DNA sequences found in the individual, wherein based on methylation levels of an ensemble of genomic DNA sequences selected from a set of genomic DNA sequences having levels of methylation associable with an age of the individuals an age indicator for the individual is provided in a manner relying on a statistical evaluation of levels of methylation for genomic DNA sequences of the plurality of individuals, wherein the age indicator for the individual is provided in a manner relying on a statistical evaluation of levels of methylation for genomic DNA sequences of a plurality of individuals which is different from the plurality of individuals that was referred to for a preceding statistical evaluation used for the determination of the same age indicator of an individual preceding in the series, the difference of the pluralities of individuals being caused in that a plurality of individuals used for the first statistical evaluation is amended at least by inclusion of at least one additional preceding individual from the series, and wherein the age indicator for the individual is provided in a manner where the at least two different statistical evaluations of the two different plurality of individuals result in a change of at least one coefficient used when calculating the age indicator from the methylation levels of an ensemble and/or result in levels of methylation of different genomic DNA sequences or CgP loci found being considered.

80. The method of age determination of an individual according to claim 79, based on the levels of methylation of genomic DNA sequences found in the individual, comprising providing a set of genomic DNA sequences from genomic DNA sequences having levels of methylation associable with an age of the individual; determining for a plurality of individuals levels of methylation for the genomic DNA sequences of the set; selecting from the set an ensemble of genomic DNA sequences such that the number of genomic DNA sequences in the ensemble is smaller than or equal to the number of genomic DNA sequences in the set, and ages of the individuals can be calculated based on the levels of methylation of the sequences of the ensemble; determining in a sample of biological material from the individual the levels of the methylation of at least the sequences of the ensemble; calculating an age of the individual based on levels of the methylation of the sequences of the ensemble; judging whether or not a re-selection of genomic DNA sequences of the ensemble is necessary and/or the way an age of the individual based on levels of the methylation is calculated is to be altered, or in view of a statistical assessment, depending on the judgment, amending the group of individuals to include the individual; and at least one of re-selecting an ensemble of genomic DNA sequences from the set based on determinations of the levels of the methylation of individuals of the amended group and/or changing of at least one coefficient used when calculating the age indicator from the methylation levels of an ensemble.

81. The method of age determination of an individual according to claim 80, comprising the steps of preselecting from genomic DNA sequences having levels of methylation associable with an age of the individual the set of genomic DNA sequences; determining for a plurality of individuals levels of methylation for the preselected genomic DNA sequences; selecting from the preselected set an ensemble of genomic DNA sequences such that the number of genomic DNA sequences in the ensemble is smaller than the number of genomic DNA sequences in the preselected set, ages of the individuals can be calculated based on the levels of methylation of the sequences of the ensemble, and a statistical evaluation of the ages calculated indicates an acceptable quality of the calculated ages; determining in a sample of biological material from the individual levels of the methylation of the sequences of the ensemble; calculating an age of the individual based on levels of the methylation of the sequences of the ensemble; calculating a statistical measure of the quality of the age calculated; judging whether or not the quality according to the statistical measure is acceptable or not; outputting the age of the individual calculated if the quality is judged to be acceptable; determining that a re-selection of genomic DNA sequences is necessary if the quality is judged to be not acceptable, amending the group of individuals to include the individual; re-selecting an ensemble of genomic DNA sequences from the preselected subset based on determinations of the levels of the methylation of individuals of the amended group.

82-91. (canceled)

92. A chip comprising a number of spots or less than 500 or less than 385 or less than 193 or less than 160 spots, adapted for use in determining methylation levels, the spots comprising at least one spot or several spots specifically adapted to be used in the determination of methylation levels of at least one of cg11330075, cg25845463, cg22519947, cg21807065, cg09001642, cg18815943, cg06335143, cg01636910, cg10501210, cg03324695, cg19432688, cg22540792, cg11176990, cg00097800, cg09805798, cg03526652, cg09460489, cg18737844, cg07802350, cg10522765, cg12548216, cg00876345, cg15761531, cg05990274, cg05972734, cg03680898, cg16593468, cg19301963, cg12732998, cg02536625, cg24088134, cg24319133, cg03388189, cg05106770, cg08686931, cg25606723, cg07782620, cg16781885, cg14231565, cg18339380, cg25642673, cg10240079, cg19851481, cg17665505, cg13333913, cg07291317, cg12238343, cg08478427, cg07625177, cg03230469, cg13154327, cg16456442, cg26430984, cg16867657, cg24724428, cg08194377, cg10543136, cg12650870, cg00087368, cg17760405, cg21628619, cg01820962, cg16999154, cg22444338, cg00831672, cg08044253, cg08960065, cg07529089, cg11607603, cg08097417, cg07955995, cg03473532, cg06186727, cg04733826, cg20425444, cg07513002, cg14305139, cg13759931, cg14756158, cg08662753, cg13206721, cg04287203, cg18768299, cg05812299, cg04028695, cg07120630, cg17343879, cg07766948, cg08856941, cg16950671, cg01520297, cg27540719, cg24954665, cg05211227, cg06831571, cg19112204, cg12804730, cg08224787, cg13973351, cg21165089, cg05087008, cg05396610, cg23677767, cg21962791, cg04320377, cg16245716, cg21460868, cg09275691, cg19215678, cg08118942, cg16322747, cg12333719, cg23128025, cg27173374, cg02032962, cg18506897, cg05292016, cg16673857, cg04875128, cg22101188, cg07381960, cg06279276, cg22077936, cg08457029, cg20576243, cg09965557, cg03741619, cg04525002, cg15008041, cg16465695, cg16677512, cg12658720, cg27394136, cg14681176, cg07494888, cg14911690, cg06161948, cg15609017, cg10321869, cg15743533, cg19702785, cg16267121, cg13460409, cg19810954, cg06945504, cg06153788, and cg20088545.

93. A chip according to claim 92, wherein the spots comprise at least 10 spots for CpG loci or 20 spots for CpG loci or at least 50 spots for CpG loci or spots for all of the CpG loci listed in the claim 92.

Description

BRIEF DESCRIPTION OF THE FIGURES

[0529] FIG. 1. Performance of LASSO. A set of 148 cg sites was determined as optimal. Shown are four plots referring to Lasso regression and its performances. In all four plots a vertical dotted line represents the automatic threshold chosen for the number of variables selected. All plots report mean values plus range intervals produced by 20 cross validation runs. The different axes show different model metrics according to the biglasso package. The two upper plots report sums of cross-validated errors and coefficient of determination (R.sup.2), while the bottom two plots report two particular parameter from R implementation of LASSO regression: signal-to-noise ratio and <bs>. Details are in https://cran.rstudio.com/web/packages/biglasso/biglasso.pdf

[0530] FIG. 2. Performance of the age indicator obtained by LASSO and subsequent stepwise regression. Shown are the chronological age (actual age) and the determined age (predicted age) of 259 individuals of the training data set and 30 individuals of the test data set. No relevant or significant differences between training and test data set were observed. The shown coefficient of variation R.sup.2 is based on the training and test data merged.

[0531] FIG. 3. Correlations of representative CpG sites with the chronological age. Individuals of training and test data merged were grouped based on their chronological age (>48 years, 25-48 years, and <25 years; “old”, “mid” and “young”, respectively). The distributions of DNA methylation levels (“value”) are shown for 8 representative CpG sites per age group. The genes comprised in the CpG sites are annotated.

[0532] FIG. 4. Overlap of CG sites with the set of CG sites as described by Horvath in Genome Biology 2013, 14:R115. The Venn diagram reports the amount of overlap between the set of 148 genomic DNA sequences (CpGs) determined herein by applying LASSO (IME-Cerascreen) and the 353 CpG List reported by Horvath in Genome Biology 2013, 14:R115. See also FIG. 5.

[0533] FIG. 5. Overlap of CG sites determined herein by applying LASSO (IME-Cerascreen) and subsequent stepwise regression (IME_Cerascreen_8). Also shown is the overlap with the set of CG sites as described by Horvath in Genome Biology 2013, 14:R115. See also FIG. 4.

EXAMPLES

Example 1: Measuring CpG Methylation Levels of DNA from Biological Samples

[0534] For a very large number of app. 850.000 (850000) CpGs, the respective methylation levels have been measured in the following way:

[0535] Buccal cells were collected from a number of test persons with buccal swabs and genomic DNA was purified from the buccal cells using a QIAamp 96 DNA Swab BioRobot Kit (Qiagen, Hilden, Germany). The purified genomic DNA was treated with sodium bisulfite using the Zymo EZ DNA Methylation Kit (Zymo, Irvine, Calif., USA). This treatment converts unmethylated cytosines to uracil, while methylated cytosines remain unchanged.

[0536] All further steps were performed with components from the Infinium MethylationEPIC Kit (Illumina™, San Diego, Calif., USA) according to the manufacturer's instructions. In short, bisulfite-treated samples were denatured and neutralized to prepare them for amplification. The amplified DNA was then isothermally amplified in an overnight step and enzymatically fragmented. Fragmented DNA was precipitated with isopropanol, collected by centrifugation at 4° C. and resuspended in hybridization buffer. The fragmented, resuspended DNA samples were then dispensed onto an Infinium MethylationEPIC BeadChip (Illumina™) and the BeadChip was incubated overnight in the Illumina™ Hybridization Oven to hybridize the samples onto the BeadChip by annealing the fragments to locus-specific 50mers that are covalently linked to the beads.

[0537] Unhybridized and nonspecifically hybridized DNA was washed away and the BeadChip was prepared for staining and extension in a capillary flow-through chamber. Single-base extension of the oligos on the BeadChip, using the captured DNA as a template, incorporates fluorescent labels on the BeadChip and thereby determines the methylation level of the query CpG sites. The BeadChip was scanned with the iScan System, using a laser to excite the fluorophore of the single-base extension product on the beads and recording high resolution images of the light emitted from the fluorophores. The data was analyzed using the GenomeStudio Methylation Module (Illumina™), which allows the calculation of beta-values for each analyzed CpG.

[0538] With this procedure, the methylation levels of more than 850′000 (850000) different Illumina™ defined CpGs were measured per sample and person and a numerical value for each methylation level of the more than 850′000 (850000) different CpGs was provided. This was done for a large number of samples, each sample from a different individual. The numerical values have been normalized such that 0 corresponds to minimum methylation possible for a CpG and 1 corresponds to the maximum methylation for the CpG. Of note, 1 also corresponds to 100% or full methylation.

Example 2: Measuring CpG Methylation Levels by Base-Specific Cleavage/MALDI-TOF (Agena)

[0539] To determine methylation levels of a pre-selected set of several hundred different CpGs, the EpiTYPER DNA Methylation Analysis Kit from Agena Bioscience (San Diego, Calif., USA) was used. In the example, 384 methylation levels of 384 different CpGs have been determined.

[0540] Again, Buccal cells were collected from a number of persons with buccal swabs and genomic DNA was purified from the buccal cells using a QIAamp 96 DNA Swab BioRobot Kit (Qiagen, Hilden, Germany). The purified genomic DNA was treated with sodium bisulfite using the Zymo EZ DNA Methylation Kit (Zymo, Irvine, Calif., USA). This treatment converts unmethylated cytosines to uracil, while methylated cytosines remain unchanged.

[0541] Subsequently, the target regions containing the CpGs of interest were amplified by PCR using a specific primer pair per target region, each containing a T7-promoter-tagged reverse primer, respectively.

[0542] The PCR products were then treated with shrimp alkaline phosphatase to remove the unreacted nucleotides from the sample and in vitro transcribed using T7 RNA polymerase. The resulting RNA transcripts were specifically cleaved at uracil residues and dispensed onto a SpectroCHIP Array. This chip was placed into a MALDI-TOF mass spectrometer for data acquisition and the resulting data was analyzed with EpiTYPER software.

[0543] From the results, a numerical value for each methylation levels of the 384 different CpGs was provided. The numerical value was again normalized such that 0 corresponds to minimum methylation possible for a CpG and 1 (100%) corresponds to the maximum methylation for the CpG.

[0544] While methylation levels of 384 different genomic DNA sequences were determined by the method of Example 2, compared to the app. 850.000 (850000) different genomic DNA sequences, it is noted that the cost of an analysis according to Example 2 is significantly lower, amounting to less than 1/5 of the costs at the time of application.

Example 3: Measuring CpG Methylation Levels by Methylation Specific PCR (msPCR)

[0545] To determine methylation levels of a pre-selected set of 192 different CpGs, real-time quantitative methylation specific PCR (msPCR) was performed in the following manner:

[0546] For each of the 192 CpG-containing target regions to be analyzed, a specific set of three oligonucleotides was designed, containing one forward primer and two reverse primers. The two reverse primers were designed such that one is having a G at the 3′ end that is complementary to the methylated, unchanged C while the second forward primer is having an A at the 3′ end that is complementary to the converted uracil.

[0547] Then, buccal cells were collected from a number of persons with buccal swabs and genomic DNA was purified from the buccal cells using a QIAamp 96 DNA Swab BioRobot Kit (Qiagen, Hilden, Germany). The purified genomic DNA was treated with sodium bisulfite using the Zymo EZ DNA Methylation Kit (Zymo, Irvine, Calif., USA). This treatment converts unmethylated cytosines to uracil, while methylated cytosines remain unchanged.

[0548] To determine methylation levels of CpGs contained in the sample, for each set of three oligonucleotides two PCR reactions were initiated, the first PCR reaction using the forward and the first of the two reverse primers, the second PCR reaction using the forward and the second of the two reverse primers. The methylation level of each CpG was determined, using real-time quantitative msPCR with TaqMan probes specific for each amplified target region.

[0549] From the results, a numerical value for each methylation levels of the 192 different CpGs was provided. The numerical value was again normalized such that 0 corresponds to minimum methylation possible for a CpG and 1 (100%) corresponds to the maximum methylation for the CpG.

[0550] While the number of different genomic DNA sequences is lower than in the method of Example 2, the method is extremely competitive with respect to costs.

Example 4: Generation of an Age Predictor Using LASSO

[0551] DNA methylation levels of 289 individuals (259 for the training data set and 30 for the test data set) have been determined as described in Example 1 unless noted differently. In brief, the DNA methylation levels of 850000 different genomic DNA sequences have been determined from buccal swab samples using the Infinium MethylationEPIC BeadChip (Illumina™). The methylation levels were normalized as beta values using program R v3.4.2, and thus could have a value between 0 and 1. The data set, i.e. the training data set, was a data matrix with a structure as in Table 1.

TABLE-US-00001 TABLE 1 Chronological ID age CG1 CG2 . . . CG850000 Individual 1 28 0.2 1.0 . . . 0.1 Individual 2 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Individual 65 . . . . . . . . . . . . 259

[0552] Using the statistical software R v3.4.1 and the biglasso package, a LASSO regression was performed using the command [0553] cvfit<-cv.biglasso(Vars800bm, Age, seed=2401, nfolds=20),
wherein Vars800bm is the training data set which relates to an exemplarily matrix as shown in Table 1, wherein the cg sites are the independent variables and the age is the dependent variable to be modeled; seed is a number used by random generator; and nfolds is the number of cross-validation repetition which the model has to be build with. The value 20 was used for cross-validation. The biglasso package was: “The biglasso Package: A Memory- and Computation-Effic Solver for LASSO Model Fitting with Big Data in R” by Yaohui Zeng and Patrick Breheny in arXiv:1701.05936v2 [statCO] 11 Mar. 2018.

[0554] The formula of the obtained model (age indicator) upon LASSO regression was:

Age=+53.9126*cg27320127+43.1588*cg16267121+31.5464*cg00831672+30.4384*cg27173374+26.5197*cg16867657+20.9302*cg14681176+19.0975*cg25606723+16.8674*cg11607603+16.6092*cg08097417+15.0595*cg11330075+14.5786*cg12333719+14.1955*cg10543136+13.6743*cg21807065+12.4988*cg19851481+12.1954*cg08224787+11.7822*cg19702785+11.7706*cg13759931+11.6845*cg19112204+11.4521*cg07955995+10.869*cg18815943+10.829*cg24724428+10.7537*cg22101188+10.4571*cg19215678+9.551*cg22519947+9.5225*cg06161948+9.3932*cg16677512+9.2647*cg05396610+8.9059*cg21628619+8.7864*cg15609017+8.6846*cg24954665+8.5015*cg25642673+8.284*cg07802350+7.9408*cg05087008+7.8335*cg12548216+7.7144*cg09965557+7.6203*cg16999154+7.6057*cg12238343+7.5126*cg08044253+7.0673*cg16465695+6.939*cg13206721+6.6733*cg09001642+6.1215*cg11176990+6.0675*cg07625177+6.0657*cg05292016+5.9961*cg16593468+5.9511*cg07291317+5.5409*cg18506897+5.4739*cg07120630+5.2279*cg08662753+5.1938*cg24088134+5.1655*cg00097800+4.8623*cg16950671+4.6431*cg16245716+4.6364*cg06279276+4.6224*cg08686931+4.1089*cg27540719+4.0082*cg07529089+3.9294*cg06945504+3.8147*cg23677767+3.7304*cg07766948+3.7296*cg00876345+3.541*cg05972734+3.5305*cg22540792+3.4169*cg08118942+3.1845*cg02032962+3.1329*cg09460489+3.0723*cg22444338+3.0498*cg08856941+2.8317*cg03741619+2.7707*cg03230469+2.6979*cg06153788+2.6678*cg10522765+2.6533*cg14911690+2.5934*cg06186727+2.5488*cg03526652+2.5152*cg01520297+2.4409*cg09805798+2.3836*cg07513002+2.3539*cg08960065+2.3285*cg06335143+2.3044*cg16673857+2.2379*cg05990274+2.0254*cg04525002+1.9303*cg13154327+1.8016*cg07494888+1.7889*cg03388189+1.7543*cg08478427+1.7476*cg18768299+1.6312*cg21165089+1.6196*cg17665505+1.613*cg13460409+1.5347*cg14305139+1.4346*cg12804730+1.2032*cg04875128+1.2025*cg05211227+1.1767*cg18737844+1.1712*cg21460868+1.15*cg26430984+1.135*cg10321869+1.0067*cg14756158+1.0021*cg16322747+0.9948*cg17343879+0.9605*cg22077936+0.7994*cg18339380+0.5436*cg00087368+0.3003*cg05812299+0.281*cg12732998+0.0507*cg16456442+0.0277*cg17760405+0.0165*cg12658720−0.2038*cg08457029−0.4098*cg21962791−0.4232*cg15761531−0.4506*cg19810954−0.4626*cg20425444−0.5866*cg23128025−0.6731*cg25845463−0.6945*cg03324695−1.0445*cg01636910−1.4555*cg12650870−1.8012*cg01820962−2.2813*cg07782620−2.4468*cg04320377−2.6024*cg09275691−2.6286*cg15008041−2.7124*cg20576243−3.4046*cg13973351−3.5199*cg08194377−3.5713*cg07381960−4.0608*cg10240079−4.2758*cg14231565−4.8117*cg24319133−4.8449*cg03680898−5.694*cg19301963−6.83*cg03473532−7.515*cg13333913−8.0702*cg05106770−8.3397*cg04287203−9.4713*cg27394136−9.4931*cg10501210−10.8424*cg19432688−12.9786*cg02536625−13.2229*cg04028695−14.2271*cg16781885−14.728*cg15743533−14.9252*cg04733826−15.7917*cg20088545−16.5954*cg06831571−367.4866.

[0555] This age indicator comprised 148 terms such as +16.6092*cg08097417, wherein a positive sign indicated that the methylation level positively correlated with age, and a negative sign indicated that the methylation level negatively correlated with age. A numbered cg refers to a genomic DNA sequence according to the Infinium MethylationEPIC BeadChip; and the absolute value of the coefficient with which the cg is multiplied indicates the importance of this cg.

[0556] Various model performance checks confirmed that the selection of 148 cg sites was optimal (FIG. 1).

[0557] This age indicator had the following performances: R.sup.2=0.72, variable selected=148 (nonzero coefficients), wherein R.sup.2 is the coefficient of determination. The statistics have been determined with an independent test data set consisting of data of 30 individuals (about 10%) which were different from the 259 (289-30) individuals used for the training data set but which were drawn from the same population as said 289 individuals.

[0558] Furthermore, LASSO has been applied on the data of 64 or 150 individuals from the 289 individuals (Table 2).

TABLE-US-00002 TABLE 2 Number of selected Size of training variables in Performance of age indicator data set age indicator with test data set (R.sup.2) 64 30 0.39 150 105 0.6 259 148 0.72

[0559] This suggested that the performance of the LASSO increased when data of further individuals were iteratively added to the data set and the age indicator was iteratively updated.

Example 5: Generation of an Age Predictor Using LASSO and Subsequent Stepwise Regression

[0560] Stepwise regression was applied on a reduced training data set obtained after performing LASSO (Example 4) to distill the best significant set of cg sites/CpGs and thereby optimize the model. The reduced training data set (IME_blasso[,−1]) was the same as the training data set used in Example 4 except that it retained only the 148 columns relating to the 148 cg sites selected by LASSO.

[0561] Stepwise regression was performed using the statistical software R v3.4.1 and the following command:

model_blasso<-step(lm(Age˜., data=IME_blasso[,−1]), direction=“both”), wherein the direction for removing not significant variables was “both”, meaning that both adding and removing variables was allowed.

[0562] The formula of the obtained model (age indicator) upon LASSO regression and subsequent stepwise regression was:

Age=+66.2822*cg11330075+65.203*cg00831672+55.7265*cg27320127+44.4116*cg27173374+38.3902*cg14681176+37.8069*cg06161948+36.6564*cg08224787+31.9397*cg05396610+30.1919*cg15609017+28.089*cg09805798+27.9392*cg19215678+27.8502*cg12333719+27.226*cg03741619+27.0323*cg16677512+25.9599*cg03230469+25.3932*cg19851481+24.5374*cg10543136+22.5525*cg07291317+21.8666*cg26430984+20.3621*cg16950671+20.3269*cg16867657+19.7973*cg22077936+18.7137*cg08044253+18.2047*cg12548216+18.1936*cg05211227+18.0812*cg13759931+17.6857*cg08686931+17.5303*cg07955995+16.1143*cg07529089+14.8703*cg01520297+14.6684*cg00087368+14.4397*cg05087008+14.4361*cg24724428+14.3055*cg19112204+14.2968*cg04525002+14.2302*cg08856941+13.3831*cg16465695+11.8127*cg08097417+11.7798*cg21628619+11.3523*cg09460489+11.2461*cg13460409+10.6268*cg25642673+10.4347*cg19702785+9.7844*cg18506897+9.5931*cg21165089+9.093*cg27540719+8.9361*cg21807065+8.8577*cg18815943+8.6138*cg23677767+7.1699*cg07802350+7.0528*cg11176990+6.5416*cg10321869+6.5049*cg17343879+5.8296*cg08662753+5.696*cg14911690+3.2983*cg12804730+3.1388*cg16322747−4.8653*cg14231565−5.5608*cg10501210−6.047*cg09275691−6.35*cg15008041−9.1942*cg05812299−9.3144*cg24319133−9.4566*cg12658720−9.8704*cg20576243−10.4082*cg03473532−10.6429*cg07381960−11.1592*cg05106770−12.0021*cg04320377−12.3296*cg19432688−12.9858*cg22519947−13.7116*cg06831571−13.8029*cg08194377−13.8668*cg01636910−14.6975*cg14305139−15.0408*cg04028695−16.3295*cg15743533−16.3314*cg03680898−18.6196*cg20088545−19.0952*cg13333913−19.3068*cg19301963−21.5752*cg13973351−23.0892*cg16781885−26.0415*cg04287203−32.3606*cg27394136 48.0918*cg10240079−50.0227*cg02536625−63.4434*cg23128025−519.3495.

[0563] The meaning of the terms and statistics is as explained in Example 4. Further details on the cg sequences and the coefficients can be found in Table 6.

[0564] Thus, the number of variables selected was further reduced upon applying the stepwise regression. In fact, the age indicator contained only 88 genomic DNA sequences (cg sites/CpGs).

[0565] Moreover, the performance of the age indicator obtained by LASSO and subsequent stepwise regression was: [0566] R.sup.2=0.9884 with the training data; and R.sup.2=0.9929 (with the test data set containing the data of 30 test individuals as explained in Example 4). Thus, the performance was enhanced over the age indicator obtained by LASSO without stepwise regression.

[0567] The performance on the test data was as good as on the training data set which suggests that the age indicator has an outstanding performance (FIG. 2). Moreover, such a high coefficient of determination value indicates a significant improvement over prior art age indicators.

[0568] By grouping individuals (training and test data sets merged) based on their chronological age, it could be confirmed that the methylation level of representative cg sites selected by the regression analysis correlated well with the age groups (FIG. 3).

[0569] The age indicator and its determination was then compared to the age indicator of Horvath, Genome Biology 2013, 14:R115 in Table 3:

TABLE-US-00003 TABLE 3 Horvath, Genome Biology Characteristics 2013, 14: R115 Present invention Sample Various cell types Buccal swabs Starting number of cg sites/ 450000 850000 Illumina ™ chip Algorithm Elastic net LASSO + stepwise regression No. of cg sites used in model 353 88 No. of cross-validation runs Unknown 20 Coefficient of determination 0.83 (buccal epithelium) 0.996 (R.sup.2) 0.83 (saliva) Median absolute deviation 0.8 (buccal epithelium) 1.0 (years) 2.7 (saliva) p-value of coefficients Unknown p < 0.05

[0570] This confirmed that the age indicator obtained by LASSO+stepwise regression performed as least as good as a relevant prior art age indicator, or even better, despite having only about 25% of the number of genomic DNA sequences (independent variables).

[0571] The small set of genomic DNA sequences comprised in the age indicator allows to use alternative, i.e. simpler, methods (see Examples 2 and 3) to determine the DNA methylation levels of individuals for whom the age is to be determined.

[0572] Moreover the set of cg sites determined by LASSO alone or with LASSO+subsequent stepwise regression had very little overlap with the cg sites determined in Horvath, Genome Biology 2013, 14:R115 (FIGS. 4 and 5).

Example 6: Determination of Gene Sets from the Sets of Cg Sites/CpGs

[0573] The list of cg sites determined by applying LASSO (Example 4) or LASSO+stepwise regression (Example 5) was filtered for those cg sites which were fully contained within a gene. In a first list (Table 4), 106 (partially redundant) coding sequences and non-coding sequences such as miRNAs or long non-coding RNAs were selected based on the 148 CpGs determined by LASSO:

TABLE-US-00004 TABLE 4 Illumina ID UCSC_RefGene_Accession Name of first accession No. cg00087368 NM_005068 SIM bHLH transcription factor 1 (SIM1) cg12548216 NM_030885; microtubule associated protein 4 (MAP4) NM_001134364; NM_002375 cg25845463 NM_001033582; protein kinase C zeta (PRKCZ) NM_002744; NM_001033581 cg05087008 NM_001077243; glutamate ionotropic receptor AMPA type subunit 4 NM_001112812; (GRIA4) NM_001077244; NM_000829 cg05396610 NR_046356; glutamate ionotropic receptor AMPA type subunit 4 NM_001077243; (GRIA4) NM_000829 cg01636910 NM_003921 BCL10, immune signaling adaptor (BCL10) cg01820962 NM_152729 5′-nucleotidase domain containing 1 (NT5DC1) cg07529089 NM_018412; suppression of tumorigenicity 7 (ST7) NM_021908 cg02032962 NM_006255 protein kinase C eta (PRKCH) cg03230469 NM_000514 glial cell derived neurotrophic factor (GDNF) cg03473532 NM_001145354 muskelin 1 (MKLN1) cg03526652 NM_015189 exocyst complex component 6B (EXOC6B) cg03680898 NM_000313 protein S (PROS1) cg05990274 NM_000720; calcium voltage-gated channel subunit alpha1 D NM_001128840; (CACNA1D) NM_001128839 cg04320377 NM_020782 kelch like family member 42 (KLHL42) cg04875128 NM_130901 OTU deubiquitinase 7A (OTUD7A) cg17665505 NM_004394 death associated protein (DAP) cg05211227 NM_001195637 coiled-coil domain containing 179 (CCDC179) cg05292016 NM_000793; iodothyronine deiodinase 2 (DIO2) NR_038355 cg03741619 NM_145068 transient receptor potential cation channel subfamily V member 3 (TRPV3) cg05812299 NM_001190478 MT-RNR2 like 5 (MTRNR2L5) cg05972734 NM_001164319; filamin B (FLNB) NM_001164318; NM_001457; NM_001164317 cg07381960 NM_002569; furin, paired basic amino acid cleaving enzyme NM_001289823 (FURIN) cg06153788 NR_104238; solute carrier family 25 member 17 (SLC25A17) NR_104235; NR_104237; NR_104236; NM_006358; NM_001282727; NM_001282726 cg06161948 NM_018025 G-patch domain containing 1 (GPATCH1) cg06279276 NM_033309 UDP-GlcNAc: betaGal beta-1,3-N- acetylglucosaminyltransferase 9 (B3GNT9) cg06335143 NM_001004339 zyg-11 family member A, cell cycle regulator (ZYG11A) cg06945504 NM_001184776; seizure related 6 homolog like (SEZ6L) NM_001184775; NM_001184774; NM_001184777; NM_001184773; NM_021115 cg07291317 NM_012334 myosin X (MYO10) cg16677512 NM_198838; acetyl-CoA carboxylase alpha (ACACA) NM_198837; NM_198839; NM_198836; NM_198834 cg08044253 NM_002069; G protein subunit alpha i1 (GNAI1) NM_001256414 cg07766948 NM_024040 CUE domain containing 2 (CUEDC2) cg07802350 NM_000523 homeobox D13 (HOXD13) cg07955995 NM_138693 Kruppel like factor 14 (KLF14) cg19112204 NM_004171 solute carrier family 1 member 2 (SLC1A2) cg08097417 NM_138693 Kruppel like factor 14 (KLF14) cg08118942 NM_023928 acetoacetyl-CoA synthetase (AACS) cg08194377 NM_015245 ankyrin repeat and sterile alpha motif domain containing 1A (ANKS1A) cg08478427 NR_106988; microRNA 7641-2 (MIR7641-2) NM_001145522; NM_001145521; NM_001145520; NM_015577; NM_001145523; NM_001145525 cg08662753 NM_001278074; collagen type V alpha 1 chain (COL5A1) NM_000093 cg08856941 NM_020682 arsenite methyltransferase (AS3MT) cg08960065 NM_206885; solute carrier family 26 member 5 (SLC26A5) NM_206884; NM_206883; NR_120443; NR_120442; NR_120441; NM_001167962; NM_198999 cg09275691 NM_020401; nucleoporin 107 (NUP107), NR_038930 cg09805798 NR_110265; long intergenic non-protein coding RNA 1797 NR_110264 (LINC01797) cg09965557 NM_001080779; myosin IC (MYO1C) NM_03375; NM_001080950 cg10240079 NM_181726 ankyrin repeat domain 37 (ANKRD37) cg17861230 NM_000923 phosphodiesterase 4C (PDE4C) cg10543136 NM_018100 EF-hand domain containing 1 (EFHC1) cg11176990 NR_028386; uncharacterized LOC375196 (LOC375196) NM_001145451 cg16867657 NM_017770 ELOVL fatty acid elongase 2 (ELOVL2) cg12333719 NM_006646; WAS protein family member 3 (WASF3) NM_001291965 cg24724428 NM_017770 ELOVL fatty acid elongase 2 (ELOVL2) cg12658720 NM_203425 chromosome 17 open reading frame 82 (C17orf82) cg13206721 NM_020752; G protein-coupled receptor 158 (GPR158) NR_027333 cg13333913 NM_012304; F-box and leucine rich repeat protein 7 (FBXL7) NM_001278317 cg13460409 NM_018962 ripply transcriptional repressor 3 (RIPPLY3) cg13973351 NM_017966 VPS37C subunit of ESCRT-I (VPS37C) cg14231565 NM_001034845 polypeptide N-acetylgalactosaminyltransferase like 6 (GALNTL6) cg14305139 NM_014957 DENN domain containing 3 (DENND3) cg19215678 NM_006312; nuclear receptor corepressor 2 (NCOR2) NM_001077261 cg00097800 NM_001430 endothelial PAS domain protein 1 (EPAS1) cg14911690 NM_025245 PBX homeobox 4 (PBX4) cg15609017 NR_040046 long intergenic non-protein coding RNA 1531 (LINC01531) cg15743533 NM_207121; family with sequence similarity 110 member A NM_001042353 (FAM110A) cg15761531 NM_001010983; glycosyltransferase 8 domain containing 1 NM_152932; (GLT8D1) NM_018446; NM_152932 cg27173374 NM_053064 G protein subunit gamma 2 (GNG2), transcript variant 1 cg16267121 NM_001190472; MT-RNR2 like 3 (MTRNR2L3) NM_001015885; NM_003610 cg16322747 NM_003440; zinc finger protein 140 (ZNF140) NM_001300777; NM_001300778; NM_001300776 cg16465695 NM_014238 kinase suppressor of ras 1 (KSR1) cg16593468 NR_028444; protein disulfide isomerase family A member 5 NM_006810 (PDIA5) cg16673857 NM_018418; spermatogenesis associated 7 (SPATA7) NM_001040428 cg17343879 NM_148977; pantothenate kinase 1 (PANK1) NM_138316; NM_148978; NR_029524 cg16781885 NM_001034845 polypeptide N-acetylgalactosaminyltransferase like 6 (GALNTL6) cg00876345 NM_003363; ubiquitin specific peptidase 4 (USP4) NM_199443 cg14756158 NM_002072 G protein subunit alpha q (GNAQ) cg19702785 NM_002251 potassium voltage-gated channel modifier subfamily member 1 (KCSN1) cg27394136 NM_007215 DNA polymerase gamma 2, accessory subunit (POLG2) cg18339380 NM_020225 storkhead box 2 (STOX2) cg18506897 NR_073547; neurexin 3 (NRXN3) NM_004796 cg18768299 NM_014753 BMS1, ribosome biogenesis factor (BMS1) cg18815943 NM_012186 forkhead box E3 (FOXE3) cg10522765 NM_004544 NADH: ubiquinone oxidoreductase subunit A10 (NDUFA10) cg12238343 NM_016568 relaxin family peptide receptor 3 (RXFP3) cg19301963 NM_032638; GATA binding protein 2 (GATA2) NR_125398 cg00831672 NM_001101426; isoprenoid synthase domain containing (ISPD) NM_001101417 cg19810954 NM_015833; adenosine deaminase, RNA specific B1 (ADARB1) NM_001160230; NM_001112; NR_027674; NR_027672; NM_015834; NR_027673 cg20088545 NM_058238 Wnt family member 7B (WNT7B) cg20425444 NM_015310 pleckstrin and Sec7 domain containing 3 (PSD3) cg21165089 NM_001300803; membrane anchored junction protein (MAJIN) NM_001037225 cg21962791 NM_024854 pyridine nucleotide-disulphide oxidoreductase domain 1 (PYROXD1) cg22101188 NM_001252335; cingulin like 1 (CGNL1), transcript variant 1 NM_032866 cg22444338 NM_001134395; chromosome 7 open reading frame 50 NM_032350; (C7orf50) NM_001134396 cg22519947 NM_024848 MORN repeat containing 1 (MORN1) cg22540792 NR_024191; atlastin GTPase 2 (ATL2), transcript variant 3 NM_001308076; NM_001135673; NM_022374 cg23128025 NM_052950 WD repeat and FYVE domain containing 2 (WDFY2) cg23677767 NM_001198675; transmembrane protein 136 (TMEM136) NM_001198674; NM_001198673; NM_001198672; NM_001198671; NM_001198670; NM_174926 cg01520297 NM_005539 inositol polyphosphate-5-phosphatase A (INPP5A) cg25606723 NM_015130 TBC1 domain family member 9 (TBC1D9) cg25642673 NM_002199 interferon regulatory factor 2 (IRF2) cg14681176 NM_016538 sirtuin 7 (SIRT7) cg26430984 NM_173465 collagen type XXIII alpha 1 chain (COL23A1) cg02536625 NM_003875; guanine monophosphate synthase (GMPS) NM_003875 cg27320127 NM_022055 potassium two pore domain channel subfamily K member 12 (KCNK12) cg16245716 NM_021238; SIN3-HDAC complex associated factor (SINHCAF) NM_001135811; NM_001135812 cg27540719 NM_005330 hemoglobin subunit epsilon 1 (HBE1) cg16950671 NM_198795 tudor domain containing 1 (TDRD1)

[0574] In a reduced gene set (Table 5), druggable gene targets have been selected from Table 4. In particular, the genes have been selected if an in vitro assay for determining the activity or function of the encoded protein was known in the art.

TABLE-US-00005 TABLE 5 UCSC_RefGene_Accession Name of first accession No. NM_030885; microtubule associated protein 4 (MAP4) NM_001134364; NM_002375 NM_001033582; protein kinase C zeta (PRKCZ) NM_002744; NM_001033581 NM_001077243; glutamate ionotropic receptor AMPA type NM_001112812; subunit 4 (GRIA4) NM_001077244; NM_000829 NR_046356; glutamate ionotropic receptor AMPA type NM_001077243; subunit 4 (GRIA4) NM_000829 NM_018412; suppression of tumorigenicity 7 (ST7) NM_021908 NM_006255 protein kinase C eta (PRKCH) NM_000720; calcium voltage-gated channel subunit NM_001128840; alpha1 D (CACNA1D) NM_001128839 NM_004394 death associated protein (DAP) NM_145068 transient receptor potential cation channel subfamily V member 3 (TRPV3) NM_002569: furin, paired basic amino acid cleaving NM_001289823 enzyme (FURIN) NM_198838; acetyl-CoA carboxylase alpha (ACACA) NM_198837; NM_198839; NM_198836; NM_198834 NM_002069; G protein subunit alpha i1 (GNAI1) NM_001256414 NM_004171 solute carrier family 1 member 2 (SLC1A2) NM_000923 phosphodiesterase 4C (PDE4C) NM_017770 ELOVL fatty acid elongase 2 (ELOVL2) NM_017770 ELOVL fatty acid elongase 2 (ELOVL2) NM_006312; nuclear receptor corepressor 2 (NCOR2) NM_001077261 NM_001430 endothelial PAS domain protein 1 (EPAS1) NM_053064 G protein subunit gamma 2 (GNG2) NM_148977; pantothenate kinase 1 (PANK1) NM_138316; NM_148978; NR_029524 NM_003363; ubiquitin specific peptidase 4 (USP4) NM_199443 NM_002072 G protein subunit alpha q (GNAQ) NM_002251 potassium voltage-gated channel modifier subfamily S member 1 (KCNS1) NM_007215 DNA polymerase gamma 2, accessory subunit (POLG2) NM_004544 NADH: ubiquinone oxidoreductase subunit A10 (NDUFA10) NM_016568 relaxin family peptide receptor 3 (RXFP3) NM_001101426; isoprenoid synthase domain containing (ISPD) NM_001101417 NM_005539 inositol polyphosphate-5-phosphatase A (INPP5A) NM_016538 sirtuin 7 (SIRT7) NM_003875; guanine monophosphate synthase (GMPS) NM_003875 NM_021238; SIN3-HDAC complex associated factor (SINHCAF) NM_001135811; NM_001135812 NM_198795 tudor domain containing 1 (TDRD1)

[0575] Finally, a list with 68 (partially redundant) coding sequences and non-coding sequences such as miRNAs or long non-coding RNAs was selected from the 88 CpGs determined by LASSO+stepwise regression (Table 6). The table further shows the coefficients of the respective age indicator and their standard errors (see Example 5).

TABLE-US-00006 TABLE 6 Coefficient +/−Std. Error ID UCSC_Ref_Gene 66.2822 9.8319 cg11330075 65.203 12.7828 cg00831672 ISPD 55.7265 7.5377 cg27320127 KCNK12 44.4116 8.4185 cg27173374 GNG2 38.3902 11.4848 cg14681176 SIRT7 37.8069 7.8695 cg06161948 GPATCH1 36.6564 9.964 cg08224787 31.9397 8.4487 cg05396610 GRIA4 30.1919 9.7667 cg15609017 LINC01531 28.089 8.4046 cg09805798 LOC101927577 27.9392 6.4631 cg19215678 NCOR2 27.8502 6.5183 cg12333719 WASF3 27.226 11.4717 cg03741619 TRPV3 27.0323 8.3075 cg16677512 ACACA 25.9599 6.5411 cg03230469 GDNF 25.3932 7.5404 cg19851481 24.5374 9.2886 cg10543136 EFHC1 22.5525 110.8777 cg07291317 MYO10 21.8666 13.0388 cg26430984 COL23A1 20.3621 4.083 cg16950671 TDRD1 20.3269 4.3239 cg16867657 ELOVL2 19.7973 11.6224 cg22077936 18.7137 3.9634 cg08044253 GNAI1 18.2047 6.1215 cgl2548216 MAP4 18.1936 4.9361 cg05211227 CCDC179 18.0812 6.0906 cg13759931 17.6857 5.0036 cg08686931 17.5303 4.5192 cg07955995 KLF14 16.1143 6.2049 cg07529089 ST7 14.8703 8.1841 cg01520297 INPP5A 14.6684 4.3239 cg00087368 SIM1 14.4397 9.0743 cg05087008 GRIA4 14.4361 3.4811 cg24724428 ELOVL2 14.3055 5.5169 cg19112204 SLC1A2 14.2968 4.1059 cg04525002 14.2302 9.571 cg0885694l AS3MT 13.3831 8.8481 cg16465695 KSR1 11.8127 8.6353 cg08097417 KLF14 11.7798 7.2263 cg21628619 11.3523 5.5046 cg09460489 11.2461 3.2763 cg13460409 DSCR6 10.6268 4.8908 cg25642673 IRF2 10.4347 7.2693 cg19702785 KCNS1 9.7844 7.4354 cg18506897 NRXN3 9.5931 5.0988 cg21165089 C11orf85 9.093 3.9039 cg27540719 HBE1 8.9361 6.2141 cg21807065 8.8577 3.708 cg18815943 FOXE3 8.6138 2.8016 cg23677767 TMEM136 7.1699 3.726 cg07802350 HOXD13 7.0528 4.2489 cg11176990 LOC375196 6.5416 1.9413 cg10321869 6.5049 3.478 cg17343879 PANK1 5.8296 2.8652 cg08662753 COL5A1 5.696 3.7948 cg14911690 PBX4 3.2983 1.8057 cg12804730 3.1388 2.007 cg16322747 ZNF140 −4.8653 3.4742 cg14231565 GALNTL6 −5.5608 2.5813 cg10501210 −6.047 2.4969 cg09275691 NUP107 −6.35 3.4617 cg15008041 −9.1942 6.2636 cg05812299 MTRNR2L5 −9.3144 3.8416 cg24319133 −9.4566 4.137 cg12658720 C17orf82 −9.8704 3.0654 cg20576243 −10.4082 3.2632 cg03473532 MKLN1 −10.6429 7.4387 cg07381960 FURIN −11.1592 3.2236 cg05106770 −12.0021 4.6698 cg04320377 KLHL42 −12.3296 2.7158 cg19432688 −12.9858 10.2914 cg22519947 MORN1 −13.7116 2.9505 cg06831571 −13.8029 3.2707 cg08194377 ANKS1A −13.8668 4.4903 cg01636910 BCL10 −14.6975 11.6384 cg14305139 DENND3 −15.0408 2.9644 cg04028695 −16.3295 7.5252 cg15743533 FAM110A −16.3314 5.0278 cg03680898 PROS1 −18.6196 4.4565 cg20088545 WNT7B −19.0952 3.3737 cg13333913 FBXL7 −19.3068 7.0512 cg19301963 GATA2 −21.5752 6.8028 cg13973351 VPS37C −23.0892 4.2648 cg16781885 GALNTL6 −26.0415 6.6199 cg04287203 NRP1 −32.3606 8.9103 cg27394136 POLG2 −48.0918 10.9191 cg10240079 ANKRD37 −50.0227 10.3763 cg02536625 GMPS −63.4434 21.7615 cg23128025 WDFY2

Example 7: Iterative Updating of the Age Indicator

[0576] The age indicator was automatically updated with cases (probands; individuals) based on the decision if the domain boundaries of the test data were outside the domain boundaries of the training set of age indicator. The domain boundaries were the minimum and maximum DNA methylation levels of each genomic DNA sequence comprised in the age indicator. The minimum and maximum DNA methylation levels were found in the original training data set which has been used for determining the age indicator. These values change any time if the values of further individuals come in and replace the original min and max values for each of the CpGs. Min values will consequently diminish (if min is not yet 0) and max values will increase (if not yet 1) per CpG. In doing so the domain boundaries of the age indicator will expand to optimal values and it will be increasingly improbable that the age indicator is further updated.

[0577] The updating was done with the following R code:

TABLE-US-00007 ##%######################################################%## # # #### Predictions with a test data set #### # # ##%######################################################%## prdct <− data.frame(SampleID = newsamlesdf$SampleID, pred_age = predict(model_blasso, newsamplesdf), stringsAsFactors = F) plot(newsamplesdf$Age, prdct$pred_age, pch = 16, col = “red”, xlab = “Real Age”, ylab = “Predicted Age”) abline(0,1,col = “red”) ##%######################################################%## # # #### If the predictions this way are #### #### not satisfactory need to run this #### # # ##%######################################################%## IME_blasso <− IME_blasso %>% dplyr:: select(Age, everything( )) domain <− data.frame(min=apply(as.matrix(IME_blasso[,−1]),2, min), max=apply(as.matrix(IME_blasso[,−1]),2, max)) #calculate domain for new samples domain_curr <− data.frame(min=apply(as.matrix(newsamplesdf),2, min), max=apply(as.matrix(newsamplesdf),2, max)) ##%######################################################%## # # #### operative check for prediction #### ## # ## #%######################################################%## if(sum((domain$min-domain_curr$min)<0 & (domain$max-domain_curr$max)>0)){ nnew <− NROW(newsamplesdf) nn <− NROW(IME_blasso) # add new probands to the training set newIME_blasso <− rbind(IME_blasso, newsamplesdf) # concatenate the two set # rerun the model model_blasso_new <− step(lm(Age ~ . , data = newIME_blasso), direction = “both”) sstep <− summary(model_blasso_new) sstep ##check par(mfrow=c(1,1)) plot(newIME_blasso$Age, model_blasso_new$fitted.values, xlab = “Real Age [red points = new points]”, ylab = “Predicted Age”, main = paste(“Stepwise Regression with IME_newModel CpGs R.sup.2 = ”, round(sstep$r.squared,3), sep = “”), pch = 1) abline(0,1,col = “red”) errs <− newIME_blasso$Age − model_blasso_new$fitted.values mae(errs) postResample(newIME_blasso$Age, as.vector(model_blasso_new$fitted.values)) points(newIME_blasso$Age[nn:(nn+nnew)], as.vector(model_blasso_new$fitted.values[nn:(nn+nnew)]), col = “red”, pch = 16) ## predictions <− data.frame(Age = newIME_blasso$Age[nn:(nn+nnew)], PredAge = model_blasso_new$fitted.values[nn:(nn+nnew)]) write.csv(predictions, “predictions.csv”) save(model_blasso_confy_new, file = “model_blasso_new.lm”) #rm(newIME_blasso) }else{ predicted <− predict.lm(model_blasso_confy, newsamplesdf) plot(newsamplesdf$Age, predicted, pch=12, main=“Predictions with IME_model”) abline(coef = c(model_blasso_new,1), col = “red”) external_pred <− data.frame(PredAge= predicted, RAge = newsamplesdf$Age) postResample(predicted, newsamplesdf$Age) }

Example 8: Further Statistical Analyses of Data and Prediction of Age

[0578] DNA has been sampled from app. 200 individuals. These samples have all been obtained in northern Germany, but in order to have a broad database, care was taken to not exclude any individual in view of factors such as chronological age, general health state, obesity, level of physical fitness, drug consumption including drugs such as nicotine and alcohol. Therefore, the group is considered to be representative for the general population.

[0579] CpG methylation levels of the DNA from biological samples of app. 100 individuals have been determined using the method of Example 1, resulting in a large number of app. 850.000 (850000) CpGs for each individual.

[0580] In view of the amount of data and the computational expense of its analysis, the data was split into smaller arbitrary groups, and then, the data of these smaller groups was analyzed.

[0581] Using the data of a first group of 16 individuals, a principal component analysis has been effected and it was found that about 10 principal components account for almost all of the variance observed in the methylation levels of the CpGs in the groups samples, with the first two components already covering 98% of the variation, clearly indicating that despite the extremely large number of different CpG methylation levels considered, a reduction of the number is advised. Based on the principal component analysis and using regression techniques, a predictor model was established for each group that however basically showed that the model constructed was still suffering from insignificance of some of the coefficients.

[0582] It was also determined that even so, a number of the coefficients determined were found to have no statistical significance.

[0583] Given this, data from a first larger group of 98 individuals was analysed with the intention of establishing a model having a clearly reduced number of CpGs to be considered while maintaining a high statistical significance of all parameters. To this end, first a LASSO regression was executed; note that LASSO regression is a technique well known in the art and that software packages to implement Lasso regression are readily available. Note that it is possible to distinguish whether or not the methylation levels of a given CpG are of particular statistical relevance or not; this allows to consider only CpGs having some relevance. In particular, in this respect, reference is being made to “The biglasso Package: A Memory- and Computation-Effic Solver for LASSO Model Fitting with Big Data in R” by Yaohui Zeng and Patrick Breheny in arXiv:1701.05936v2 [statCO] 11 Mar. 2018. Using a selection of only 50 different CpGs determined to constitute an optimal set by the LASSO regression, an attempt was made to further optimize the model derived. This was done using the XgBoost algorithm. Note that XgBoost is a well known open-source software library which provides a gradient boosting framework for a number of languages. Note that XgBoost serves to amend coefficients used in a statistical model. For further details with respect to the XgBoost algorithm and the implementation thereof, reference is made to “XGBoost: A Scalable Tree Boosting System”, by T. Chen and C. Guestrin, arXiv: 1603.02754v3, 10. Juni 2016. The contents of the cited documents is enclosed herein in its entirety for purposes of disclosure.

[0584] It was found that a performant model could be obtained yielding good regression coefficients.

[0585] However, rather than contenting oneself with having achieved a high regression coefficient for the group considered, and maintaining the performant model as is, data from another 98 individuals were analyzed in the same manner as before. It was found that for the second group, about 78 CpGs should be considered in a model, with 8 of the 78 CpGs overlapping with the 50 CpGs selected for the first arbitrary group of 98 individuals.

[0586] Then, another run was made and it was determined that in a merged group, 70 CpG would constitute a useful selection of CpG from the initially considered app. 850000 different CpGs. From these 70 CpG, 10 were overlapping with only those of the first group, 12 were were overlapping with only those of the second group and 8 were overlapping with both groups.

[0587] The regression performed with XgBoost allowed to maintain the same high performance after 20 rounds of cross-validation.

[0588] This shows that by statistical means, in particular a LASSO regression, PCA or other means of distinguishing whether or not a specific CpG of a large number of CpG has statistical relevance, the number of CpGs can be significantly reduced from an overall extremely large set to a rather small set, allowing cheap detection using methods as referred to in Examples 2 and 3 above.

[0589] Then, relating only to the small set of CpGs, a useful model can be established that despite the small number of CpGs considered allows a determination of an age with high precision and a small confidence intervall, in particular by re-iterating parameters of a statistical model established.

[0590] In this manner, despite an overall small number of CpGs considered, determination of an age will be quite precise initially and will have a reliability increasing with time.

METHOD AND DEVICES FOR AGE DETERMINATION

Inventors

Cpc classification

Classification Explorer

C12Q2600/154

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6883

CHEMISTRY; METALLURGY

Classification Explorer

G16B20/00

PHYSICS

Classification Explorer

C12Q2600/124

CHEMISTRY; METALLURGY

International classification

Classification Explorer

C12Q1/6883

CHEMISTRY; METALLURGY

Classification Explorer

G16B20/00

PHYSICS

Abstract

Claims

Description