EPIGENETIC AGE PREDICTOR
20230154566 · 2023-05-18
Assignee
Inventors
- Sandra Ann R. STEYAERT (San Diego, CA, US)
- Geert TROOSKENS (Meise, BE)
- Wim Maria R. Van Criekinge (Waarloos, BE)
- Adriaan VERHELLE (San Diego, CA, US)
- Johan Irma H. VANDERSMISSEN (Hoeselt, BE)
Cpc classification
A61B5/14532
HUMAN NECESSITIES
G01N33/92
PHYSICS
G16H50/30
PHYSICS
G16B20/00
PHYSICS
International classification
G16B30/00
PHYSICS
A61B5/00
HUMAN NECESSITIES
A61B5/145
HUMAN NECESSITIES
Abstract
We propose an epigenetic age predictor and a method of training the same. The epigenetic age predictor is configured to receive a plurality of inputs corresponding to methylation values at CpG sites. The epigenetic age predictor is configured to receive a plurality of inputs corresponding to phenotypic values of an individual. The epigenetic age predictor predicts an epigenetic age of the individual based on the sequence of inputs.
Claims
1. A computer-implemented method of generating an epigenetic age prediction, the method comprising: receiving a first sequence of inputs corresponding to methylation values at a plurality of CpG sites of an individual; receiving a second sequence of inputs corresponding to phenotypic values relative to the individual; and applying a model on the first sequence of inputs and the second sequence of inputs to predict the epigenetic age, wherein the model includes at least one of a trained neural network, fitted linear regression or trained random forest.
2. The method of claim 1, wherein the second sequence of inputs corresponds to phenotypic values of a phenotypic input type.
3. The method of claim 2, wherein the phenotypic values of the phenotypic input type include image data obtained from an image of the individual.
4. The method of claim 2, wherein the phenotypic values of the phenotypic input type include biometric data of the individual.
5. The method of claim 4, wherein the biometric data includes a plurality of values corresponding to one or more of the following input subtypes: oxygen-consumption rate, basal metabolic rate, waist circumference, basal metabolic index, fat percentage, muscle percentage, lean muscle mass, visceral fat mass, and bone mass.
6. The method of claim 2, wherein the phenotypic values of the phenotypic input type include cardiovascular data of the individual.
7. The method of claim 6, wherein the cardiovascular data includes a plurality of values corresponding to one or more of the following input subtypes: heart rate variability, maximum heart rate, minimum heart rate, heart rate range, and resting heart rate.
8. The method of claim 2, wherein the phenotypic values of the phenotypic input type include blood data obtained from a blood sample.
9. The method of claim 8, wherein the blood data includes a plurality of values corresponding to one or more of the following input subtypes: blood pressure, blood oxygen level, lipid profile, sugar profile, and hormone levels.
10. The method of claim 9, wherein the input subtype corresponding to a lipid profile includes values of one or more of the following phenotypic measurements: total cholesterol, low-density lipoproteins, high-density lipoproteins, and triglycerides.
11. The method of claim 9, wherein the input subtype corresponding to a sugar profile includes values of one or more of the following phenotypic measurements: glucose and hemoglobin A1C.
12. The method of claim 9, wherein the input subtype corresponding to hormone levels includes values of one or more of the following phenotypic measurements: cortisol, testosterone, and estrogen.
13. The method of claim 1, wherein the second sequence of inputs corresponds to phenotypic values of a plurality of phenotypic input types.
14. The method of claim 13, wherein the phenotypic values include values from two or more of the following phenotypic input types: image data, cardiovascular data, biometric data, and blood data.
15. A computer-implemented method of generating an epigenetic clock predictor, the method comprising: receiving a plurality of methylation profiles from a plurality of individuals, the plurality of methylation profiles comprising methylation values for m CpG sites; receiving a plurality of phenotypic profiles, the plurality of phenotypic profiles comprising phenotypic values for one or more phenotypic input types; and training a model based on the plurality of methylation profiles and the plurality of phenotypic profiles, the model being configured to predict an epigenetic age based on methylation values for n CpG sites and the phenotypic values for the one or more phenotypic input types, wherein the model includes at least one of a trained neural network, fitted linear regression or trained random forest.
16. The method of claim 15, wherein the plurality of phenotypic profiles includes phenotypic values corresponding to one or more of the following input types: image data, cardiovascular data, biometric data, and blood data.
17. The method of claim 15, wherein the plurality of phenotypic profiles is received from a plurality of different individuals than the plurality of methylation profiles.
18. A computer-implemented epigenetic age predictor comprising: an input component configured to receive a first sequence of inputs corresponding to methylation values at CpG sites of an individual and a second sequence of inputs corresponding to phenotypic values of the individual; wherein the second sequence of inputs include phenotypic values corresponding to image data obtained from an image of the individual, the image data being indicative of one or more phenotypic characteristics of the individual; and wherein the epigenetic age predictor applies at least one of a trained neural network, fitted linear regression or trained random forest to predict an epigenetic age of the individual based on the first sequence of inputs and the second sequence of inputs.
19. The system of claim 18, wherein the one or more phenotypic characteristics include characteristics relating to the skin of the individual.
20. The system of claim 18, wherein the image is a facial image of the individual.
21. The method of claim 1, further including in the first sequence of inputs at least 42 and less than 200 CpG sites of the individual.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings, in which:
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
[0034] The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
[0035] Many studies have shown that a human's environment or behavior can cause their body to biologically age at an accelerated rate. It has been difficult in the past to estimate an individual's biological age with accuracy. Genetics have been used with limited effectiveness as aging does not predictively correlate with alterations of the human DNA sequence. Epigenetics, however, have been shown to correlate with biological aging.
[0036] Epigenetics is the study of changes in gene expression that are not the result of changes in the DNA sequence itself. Some examples of processes that are in the field of epigenetics include acetylation, phosphorylation, ubiquitylation, and methylation. Methylation specifically, as we have found can correlate with a number of different diseases, conditions, the health of an individual as well as the biological age of the individual. Methylation can occur at several million different places in the human genome. Methylation can also differ from tissue to tissue and even cell to cell in the human body.
[0037]
[0038]
[0039]
[0040]
[0041]
[0042] At location 420 is the Mono-ADP Ribosylhydrolase 1 (MACROD1) gene. The protein encoded by this gene has a mono-ADP-ribose hydrolase enzymatic activity, i.e. removing ADP-ribose in proteins bearing a single ADP-ribose moiety, and is thus involved in the finetuning of ADP-ribosylation systems. ADP-ribosylation is a (reversible) post-translational protein modification controlling major cellular and biological processes, including DNA damage repair, cell proliferation and differentiation, metabolism, stress, and immune responses. MACROD1 appears to primarily be a mitochondrial protein and is highly expressed in skeletal muscle (a tissue with high mitochondrial content). It has been shown to play a role estrogen and androgen signaling and sirtuin activity. Dysregulation of MACROD1 has been associated with familial hypercholesterolemia and pathogenesis of several forms of cancer, and particular with progression of hormone-dependent cancers. The CpG site, cg15769472, is a probe mapping to the protein coding MACROD1 gene.
[0043] At location 508 is the Diacylglycerol Kinase Zeta (DGKZ) gene. The DGKZ gene coding for the diacylglycerol kinase zeta. This kinase transforms diacylglycerol (DAG) into phosphatidic acid (PA). The latter product activates mTORC1. The overall effect of mTORC1 activation is upregulation of anabolic pathways. Downregulation of mTORC1 has been shown to drastically increase lifespan. The CpG site, cg00530720, is a probe mapping to the promoter of the DGKZ gene.
[0044] At position 516 is the CD248 Molecule (CD248) gene. The CpG site, cg06419846, is a probe mapping to the CD248 gene.
[0045]
[0046] At location 424 is the PML gene. The phosphoprotein coded by this gene localizes to nuclear bodies where it functions as a transcription factor and tumor suppressor. Expression is cell-cycle related and it regulates the p53 response to oncogenic signals. The gene is often involved in the PML-RARA translocation between chromosomes 15 and 17, a key event in acute promyelocytic leukemia (APL). The exact role of PML-nuclear body (PML-NBs) interaction is still under further investigation. Current consensus is that PML-NBs are structures which are involved in processing cell damages and DNA-double strand break repairs. Interestingly, these PML-NBs bodies have been shown to decrease with age and their stress response also declines with age. The latter can be in a p53 dependent or independent way. PML has also been implicated in cellular senescence, particularly its induction and acts as a modulator of the Werner syndrome, a type of progeria. The CpG site, cg05697231, is a probe mapping to the south shore of a CpG island in the PML gene.
[0047] At location 456 is the ADAM Metallopeptidase with Thrombospondin Type 1 Motif 17 (ADAMTS17) gene. The CpG site, cg07394446, is a probe mapping to ADAMTS17.
[0048] At location 460 is the SMAD Family Member 6 (SMAD6) gene. The CpG site, cg07124372, is a probe mapping to SMAD6.
[0049] At location 488 is the Carbonic Anhydrase 12 (CA12) gene. The CpG site, cg10091775, is a probe mapping to CA12.
[0050]
[0051] At location 428 is the Protein Tyrosine Phosphatase Receptor Type N (PTPRN) gene. This gene codes for a protein receptor involved in a multitude of processes including cell growth, differentiation, mitotic cycle and oncogenic transformation. More specifically, this PTPRN plays a significant role in the signal transduction of multiple hormone pathways (neurotransmitters, insulin and pituitary hormones). PTPRN expression levels are also used as a prognostic tool for hepatocellular carcinoma (negative outcome correlation). The CpG site, cg03545227, is a probe mapping in the protein tyrosine phosphatase receptor type N (PTPRN) gene.
[0052]
[0053] At location 512 is the Midnolin (MIDN) gene. The CpG site, cg07843568 is a probe mapping to MIDN.
[0054]
[0055]
[0056] At location 468 is the Stromal Antigen 3 (STAG3/GPC2) gene. The CpG site, cg18691434, is a probe mapping to the STAG3/GPC2 gene. At location 500 is the Huntingtin Interacting Protein 1 (HIP1) gene. The CpG site, cg13702357, is a probe mapping to the HIP1 gene. At location 524 is the DPY19L2 Pseudogene 4 (DPY19L2P4) gene. The CpG site, cg22370005, is a probe mapping to the DPY19L2P4 gene.
[0057]
[0058] At location 452 is the TMEM181 gene. cg02447229 is a probe mapping to the TMEM181 gene. At location 492 is the Zinc Finger and BTB Domain Containing 12 (ZBTB12) gene. The CpG site, cg06540876, is a probe mapping to the ZBTB12 gene.
[0059]
[0060]
[0061]
[0062] At location 472 is the Transglutaminase 4 (TGM4) gene. The CpG site, cg12112234, is a probe mapping to the TGM4 gene.
[0063] At location 480 is the Scm Like With Four Mbt Domains 1 (SFMBT1) gene. The CpG site, cg03607117, is a probe mapping to the SFMBT1 gene.
[0064] At location 504 is Nudix Hydrolase 16 (NUDT16P) gene. The CpG site, cg22575379, is a probe mapping to the NUDT16P gene.
[0065]
[0066]
[0067]
[0068]
[0069]
[0070]
[0071] On the left side of this figure, the locus of interest is unmethylated. It matches perfectly with unmethylated bead probe 608-1, enabling single-base extension and detection. The unmethylated locus has a single-base mismatch to the methylated bead probe 608-2, inhibiting extension that results in a low signal on the array. If the CpG locus of interest is methylated, the reverse occurs: the methylated bead 608-4 type will display a signal, and the unmethylated bead 608-3 type will show a low signal on the array. If the locus has an intermediate methylation state, both probes will match the target site and will be extended. Methylation status of the CpG site is determined by a β-value calculation, which is the ratio of the fluorescent signals from the methylated beads to the total locus intensity. The array chip containing the beads 608 can be read by an array scanning device, such as the iScan™ System provided by Illumina or the NexSeq™ 550 System provided by Illumina.
[0072]
[0073]
[0074] As indicated by block 712, the plurality of individuals' ages are known. This known age can be used to generate the model. For example, the plurality of methylation values in each profile can be used as an input vector and the known age is the scalar output value.
[0075] As indicated by block 714, the number of CpG sites in each of the plurality of methylation profiles is m. The quantity m can be a variety of different numbers. For example, m can correspond to a resolution of the methylation analysis method used on the plurality of individuals. For instance, Illumina provides methylation microarrays that detect ˜850,000 CpG sites (Infinium methylation EPIC array) or, for instance, Illumina provides methylation sequencing for 3.3 million or 36 million CpG sites.
[0076] Operation 700 proceeds at block 720 where the plurality of methylation profiles received in block 710 are normalized or otherwise pre-processed. As indicated by block 722 the methylation profiles can be normalized based on age. For instance, a specific age range may be overrepresented in the plurality of profiles, therefore the profiles from this age range may need to be weighted less or their chance of being sampled made less likely. In the alternate, a specific age range may be underrepresented in the plurality of profiles, therefore the profiles from this age range may need to be weighted higher or their chance of being sampled made more likely.
[0077] As indicated by block 724, the plurality of methylation profiles may be curated based on a quality metric. For example, methylation profiles that have a quality metric lower than a threshold are discarded or weighted less than methylation profiles of a higher quality metric. In some examples, the quality metric is indicative of the accuracy of the methylation process. In some examples, the quality metric is indicative of the quality of the sample used to generate the particular methylation profile. In some examples, the quality metric is indicative of the quality of the test used to generate the particular methylation profile. In other examples, the quality metric may be indicative of other factors relating the particular methylation profile. As indicated by block 728, the plurality of methylation profiles can be normalized or pre-processed in other ways as well.
[0078] Operation 700 proceeds at block 730 where a feature selection operation is applied on the plurality of methylation profiles. In some examples, the feature selection is applied on the plurality of methylation profiles after they are pre-processed. In some examples, the feature selection operation is applied on the plurality of methylation profiles received in block 710.
[0079] As indicated by block 732, the feature selection operation applied on the plurality of methylation includes elastic net regression. Elastic net regression combines L1 penalties from lasso regression and L2 penalties from ridge regression to reduce the number of CpG sites.
[0080] As indicated by block 734, the feature selection applied reduces the number of applicable CpG sites from m to n sites. This reduction can balance dimensionality when there are small number of methylation profiles, but each profile has a large amount of CpG site methylation values. In some examples, m is greater than 100,000. In some examples, m is greater than 400,000. In some examples, m is greater than 800,000. In some examples, n is less than 200. In some examples, n is less than 100. In some examples, n is less than 50.
[0081] Operation 700 proceeds at block 740 where a model is fit on the plurality of methylation profiles. In some implementations, the model is fit on the plurality of methylation profiles, but only considers the CpG sites in the subset of n CpG sites. In some implementations, the model is fit on all m CpG sites. As indicated by block 742, the model can include a linear regression model. As indicated by block 744, the model can include a random forest model. As indicated by block 748, the model can include a different type of model as well.
[0082] Operation 700 proceeds at block 750 where the model is generated. In some examples, the model is generated as one or more files that can be imported by other systems which can further train the model or use the model for predictions.
[0083]
[0084] At block 812, the inputs are input into the epigenetic age prediction model. As noted above this model could include a linear regression model (e.g., where each input forms part of a linear expression), random forest model (e.g., where each input forms part of one or more decision trees), or some other type of model. At block 814, the model outputs an epigenetic age prediction. In some implementations, the model also outputs a confidence score.
[0085] Operation 900 begins at block 910 where the input values are received. As shown, there are 186 input values corresponding to the methylation values at 186 different CpG cites. These CpG sites are identified by their CpG cluster identifier number. In some implementations, only a subset of the shown CpG sites is used as inputs. In some implementations, the shown CpG sites are in order of their feature importance to the model (e.g., in the case of a random forest model) or in order of the absolute value of their coefficient (e.g., in a linear model). In one implementation, the input values are derived from a methylation analysis on a sample of human blood.
[0086] At block 912, the inputs are input into the epigenetic age prediction model. As noted above this model could include a linear regression model (e.g., where each input forms part of a linear expression), random forest model (e.g., where each input forms part of one or more decision trees), or some other type of model. At block 914, the model outputs an epigenetic age prediction. In some implementations, the model also outputs a confidence score.
[0087]
[0088] Operation 1100 begins at block 1110 where one or more sources of phenotypic data is received. Receiving phenotypic data can include, for example, receiving image data 1112. In one example, image data 1112 can be contained within an image of an individual (e.g., a facial image). The image received can be produced in a number of different formats. For example, the image can be an image taken by a camera. In another example, the image can be taken by a camera on a computer, such as a webcam. In another example, the received image can be produced by a camera on a mobile device (e.g., a smartphone). Additionally, it is expressly contemplated that the image can be produced by other manners as well. The received image can be in a number of different formats. For example, the image can be a JPEG image, PNG image, PDF, or any other suitable image format. Image data 1112 includes data relative to the individual's epigenetic age. For example, image data 1112 can include data relating to the individual's skin (e.g., wrinkles, blemishes, growths, marks, color, etc.). Additionally, image data 1112 can include data relating to other phenotypic characteristics of the individual as well, such as data relating to facial features (e.g., eyes, nose, mouth, etc.). Image data 1112 can be utilized to generate a phenotypic profile of the individual, which can provide further information pertaining to epigenetic age.
[0089] Receiving phenotypic data can also include, for example, receiving cardiovascular data 1114. Cardiovascular data 1114 can include, for example, heart rate data, such as heart rate variability (HRV), maximum heart rate (Max. HR), minimum heart rate (Min. HR), heart rate range, resting heart rate, etc. In one example, cardiovascular data 1114 includes a plurality of cardiovascular data types. However, in another example, cardiovascular data 1114 can include only one type of cardiovascular data.
[0090] Receiving phenotypic data 1110 can also include receiving biometric data 1116. Biometric data includes body measurements and/or calculations relative to the physical characteristics of the individual. For example, biometric data 1116 can include the individual's basal metabolic rate (BMR). In other examples, biometric data 1116 can include the individual's waist circumference, body mass index (BMI), fat percentage, muscle percentage, muscle-to-fat ratio, lean muscle mass, visceral fat mass, bone mass, oxygen consumption rate (VO.sub.2 max), etc. In one example, biometric data 1116 includes one type of biometric data (e.g., BMR). However, in other examples, biometric data 1116 can include multiple biometric data types, as indicated above.
[0091] In another example, receiving phenotypic data can also include receiving blood data 1118. Blood data 1118 can be obtained by measuring an individual's blood characteristics using conventional methods. For example, blood pressure of the individual can be obtained using a blood pressure monitor. Additionally, blood data 1118 can be received by a blood sample of the individual. The blood sample can be obtained in a number of ways. For instance, the blood sample can be obtained by a blood extraction device (e.g., a fingerstick). In another example, the blood sample can be obtained by use of a needle and syringe. Additionally, it is expressly contemplated that the blood sample can be obtained in other manners as well.
[0092] The blood sample can be processed to obtain blood data 1118. Processing the blood sample can include, for example, a complete blood count (CBC) or blood panel. Additionally, processing the blood can include other conventional means of blood analysis as well, such as machine analysis, centrifugation, etc. The blood data 1118 obtained from the blood sample may include a number of different data types. For example, blood data 1118 can include blood oxygen levels, lipid profiles (e.g., total cholesterol, low-density lipoproteins, high-density lipoproteins, triglycerides, etc.), sugar profiles (e.g., glucose, hemoglobin A1C, etc.), hormone levels (e.g., cortisol, testosterone, estrogen, etc.), and other parameters. In one example, blood data 1118 includes one type of blood data such as, for example, blood oxygen level. However, in other examples, blood data 1118 can include multiple blood data types, as indicated above.
[0093] As indicated by block 1120, receiving phenotypic data 1110 can include receiving multiple data types from multiple sources. For instance, image data 1112 can be received in addition to, for example, cardiovascular data 1114. In another example, cardiovascular data 1114 can be received in addition to biometric data 1116. In another example, data is received from all the above-indicated data sources. However, in another example, only one data type can be received as well. For example, the phenotypic data received may only include image data 1112. Additionally, it is expressly contemplated that other data types can be received as well, as indicated by block 1122.
[0094] Operation 1100 proceeds at block 1130, where the phenotypic data is converted to one or more phenotypic inputs. Converting phenotypic data to phenotypic inputs includes converting the phenotypic data to a measurable value that can be inputted into an epigenetic age prediction model to provide an individual's modified epigenetic age. For example, converting the phenotypic data can include obtaining phenotypic inputs from image data indicative of one or more phenotypic characteristics of the individual. The phenotypic data can be, for example, converted to an input in a format of a number value, as indicated by block 1132. For instance, in the case where the phenotypic data includes image data, the image data can be converted to an input of a decimal between zero and one (β-value), where zero indicates that there are no notable phenotypic markings (e.g., facial wrinkles) and one indicates that there is a high level of phenotypic markings. Additionally, image data can be converted to a different measurable phenotypic input as well, such as a different number value.
[0095] As indicated by block 1134, converting the phenotypic data to one or more phenotypic inputs can include averaging the phenotypic data. Averaging can occur for any of the phenotypic data examples set forth above with respect to blocks 1112-1118. For example, if multiple data sets of cardiovascular data are obtained relative to an individual (e.g., a plurality of heart rate values), the data can be averaged to generate an average value of the cardiovascular data. In another example, multiple data sets of biometric data can be obtained relative to an individual (e.g., a plurality of VO.sub.2 max measurements), where the data is averaged to generate an average value of the biometric data.
[0096] As indicated by block 1136, converting the phenotypic data to one or more phenotypic inputs can also include clustering values of relevant data types together. For example, data relating to an individual's total cholesterol, blood sugar profiles, and blood pressure can be organized as a set of blood inputs. In another example, data relating to an individual's BMR, BMI, and bone mass can be organized as a set of biometric inputs. In another example, physical characteristics present in an image, such as facial wrinkles, blemishes, and growths, can be organized as a set of image inputs. Additionally, it is expressly contemplated that other ways of converting phenotypic data to phenotypic inputs can be utilized as well, as indicated by block 1138.
[0097] Operation 1100 proceeds at block 1140 where the phenotypic profile of the individual is generated. A phenotypic profile contains one or more phenotypic inputs relative to the individual. For example, the phenotypic profile can include one or more sets of observable phenotypes of an individual. In addition to observable characteristics, the phenotypic profile can also include biochemical and/or physiological conditions relative to the individual, as indicated above. In this way, the phenotypic profile can provide one or more additional inputs indicative of the individual's epigenetic age. As indicated in block 1142, the phenotypic profile can comprise a single phenotypic input. For example, the phenotypic profile can include phenotypic inputs indicative of the obtained image data. In another example, the phenotypic profile can include phenotypic inputs indicative of the obtained biometric data. Additionally, as indicated by block 1144, the phenotypic profile can also include multiple phenotypic inputs. For example, the phenotypic profile can include one or more inputs obtained from image data 1112 in combination with one or more inputs obtained from cardiovascular data 1114. In another example, the phenotypic profile can include a combination of all the data types set forth above with respect to blocks 1112-1118.
[0098]
[0099] As indicated by block 1212, the plurality of individuals' ages are known. This known age can be used to generate the model. For example, the plurality of methylation values in each profile can be used as an input vector and the known age is the scalar output value. Additionally, as indicated by block 1214, the number of CpG sites in each of the plurality of methylation profiles is m. The quantity m can be a variety of different numbers. Additionally, the methylation profiles of each individual of the plurality of individuals can include other information as well, as indicated by block 1216.
[0100] Operation 1200 proceeds at block 1220 where a plurality of phenotypic profiles from a plurality of individuals are received. The received phenotypic profiles can be, for example, derived from the same individuals where the plurality of methylation profiles were derived above with respect to block 1210. However, in another example, the plurality of phenotypic profiles can be derived from a different plurality of individuals. A phenotypic profile contains one or more phenotypic inputs relative to an individual. Phenotypic inputs can be in a number of different formats, as described above. These phenotypic profiles for the individuals can be derived from the method described above with respect to
[0101] Operation 1200 proceeds at block 1230 where the plurality of methylation profiles received in block 1210 and the plurality of phenotypic profiles received in block 1220 are normalized or otherwise pre-processed. As indicated by block 1232, the methylation profiles and phenotypic profiles can be normalized based on age. For instance, a specific age range may be overrepresented in the plurality of profiles, therefore the profiles from this age range may need to be weighted less or their chance of being sampled made less likely. In the alternate, a specific age range may be underrepresented in the plurality of profiles, therefore the profiles from this age range may need to be weighted higher or their chance of being sampled made more likely.
[0102] As indicated by block 1234, the plurality of methylation profiles and phenotypic profiles may be curated based on a quality metric. For example, phenotypic profiles that have a quality metric lower than a threshold are discarded or weighted less than the profiles of a higher quality metric. In some examples, the quality metric is indicative of the quality of the data source and obtained phenotypic inputs. For instance, if a received image containing image data is of a low resolution and/or quality, phenotypic characteristics can be difficult to determine in the image, and consequently the resulting image data can be discarded or weighted less. In another example, the quality metric is indicative of the quality of the sample used to generate the phenotypic inputs. For instance, in the case of a blood sample used to produce blood data, the blood sample can be of low quality due to error in acquiring the sample. In such a case, the resulting phenotypic data received from the blood sample may be discarded or weighted differently. Additionally, in other examples, the quality metric may be indicative of other factors relating to the particular phenotypic profile. As indicated by block 1236, the plurality of methylation profiles and/or phenotypic profiles can be normalized or pre-processed in other ways as well.
[0103] Operation 1200 proceeds at block 1240 where a feature selection operation is applied on the plurality of methylation profiles. In some examples, the feature selection is applied on the plurality of methylation profiles after they are pre-processed. In some examples, the feature selection operation is applied on the plurality of methylation profiles received in block 1210.
[0104] As indicated by block 1242, the feature selection operation applied on the plurality of methylation includes elastic net regression. Elastic net regression combines L1 penalties from lasso regression and L2 penalties from ridge regression to reduce the number of CpG sites. Additionally, as indicated by block 1244, the feature selection applied reduces the number of applicable CpG sites from m to n sites. This reduction can balance dimensionality when there are small number of methylation profiles, but each profile has a large amount of CpG site methylation values. As indicated by block 1246, other feature selection operations can be applied as well.
[0105] Operation 1200 proceeds at block 1250 where a feature selection operation can be applied to the plurality of phenotypic profiles. In some examples, the feature selection is applied on the plurality of phenotypic profiles after they are pre-processed. However, in other examples, the feature selection operation is applied on the plurality of phenotypic profiles received in block 1220. The feature selection operation applied on the plurality of phenotypic profiles can be the same as or similar to the operation applied above with respect to block 1240. For example, elastic net regression can be applied to reduce the number of phenotypic inputs used to generate the model. In another example, a different feature selection operation can be applied.
[0106] Operation 1200 proceeds at block 1260 where a model is fit on the plurality of methylation profiles. In some implementations, the model is fit on the plurality of profiles, but only considers the CpG sites in the subset of n CpG sites. In some implementations, the model is fit on all m CpG sites. Additionally, in some implementations, the model is fit on the plurality of phenotypic profiles, but only considers a specific phenotypic input (e.g., image data inputs). In other implementations, the model is fit on all phenotypic inputs. As indicated by block 1262, the model can include a linear regression model. As indicated by block 1264, the model can include a random forest model. As indicated by block 1266, the model can include a different type of model as well.
[0107] Operation 1200 proceeds at block 1270 where the model is generated. In some examples, the model is generated as one or more files that can be imported by other systems which can further train the model or use the model for predictions.
[0108]
[0109] As shown at block 1320, a plurality of phenotypic input values are also received. As shown, there are four input types with corresponding subtypes for each type. The four input types correspond to image data values, cardiovascular values, biometric values, and blood data values. The four input types shown can be determined by the method described above with respect to
[0110] At block 1330, the inputs are input into the epigenetic age prediction model. As noted above, this model could include a linear regression model (e.g., where each input forms part of a linear expression), random forest model (e.g., where each input forms part of one or more decision trees), or some other type of model. The model used may be, for example, the model described above with respect to
[0111] Operation 1400 begins at block 1410 where the methylation input values are received. As shown, there are 42 input values corresponding to the methylation values at 42 different CpG cites. These CpG sites are identified by their CpG cluster identifier number. In some implementations, only a subset of the shown CpG sites is used as inputs. In some implementations, the shown CpG sites are in order of their feature importance to the model (e.g., in the case of a random forest model) or in order of the absolute value of their coefficient (e.g., in a linear model).
[0112] As shown at block 1420, a plurality of phenotypic input values are also received. As shown, there are four input types corresponding to image data values, cardiovascular values, biometric values, and blood data values. The four input types shown may be determined by the method described above with respect to
[0113] At block 1430, the inputs are input into the epigenetic age prediction model. As noted above, this model could include a linear regression model (e.g., where each input forms part of a linear expression), random forest model (e.g., where each input forms part of one or more decision trees), or some other type of model. The model used may be, for example, the model described above with respect to
[0114] Operation 1500 begins at block 1510 where the methylation input values are received. In one example, the methylation input values received at block 1510 are 42 input values corresponding to the methylation values at 42 different CpG cites. In another example, the methylation input values received are 186 input values corresponding to the methylation values at 186 different CpG sites.
[0115] As shown at block 1520, only one phenotypic input value type is received. Specifically, as shown there is one phenotypic input type corresponding to image data. The input can be generated by the method described above with respect to
[0116] At block 1530, the inputs are input into the epigenetic age prediction model. The model used may be, for example, the model described above with respect to
[0117]
[0118] In one implementation, the epigenetic age predictor 157 is communicably linked to the storage subsystem 1010 and the user interface input devices 1038. Epigenetic age predictor 157 can include one or more models that receive a plurality of inputs and output an epigenetic age. In some examples, epigenetic age predictor 157 also outputs a confidence score. In one implementation, input encoder 186 pre-processes and/or normalizes inputs before they are fed into a model of epigenetic age predictor.
[0119] User interface input devices 1038 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 1000.
[0120] User interface output devices 1076 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 1000 to the user or to another machine or computer system.
[0121] Storage subsystem 1010 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by deep learning processors 1078.
[0122] Deep learning processors 1078 can include graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Deep learning processors 1078 can be hosted by a deep learning cloud platform such as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples of deep learning processors 1078 include Google's Tensor Processing Unit (TPU)™, rackmount solutions like GX4 Rackmount Series™, GX36 Rackmount Series™ NVIDIA DGX-1™, Microsoft' Stratix V FPGA™, Graphcore's Intelligent Processor Unit (IPU)™, Qualcomm's Zeroth Platform™ with Snapdragon processors™, NVIDIA's Volta™, NVIDIA's DRIVE PX™, NVIDIA's JETSON TX1/TX2 MODULE™, Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM's DynamicIQ™, IBM TrueNorth™, and others.
[0123] Memory subsystem 1022 used in the storage subsystem 1010 can include a number of memories including a main random-access memory (RAM) 1032 for storage of instructions and data during program execution and a read-only memory (ROM) 1036 in which fixed instructions are stored. A file storage subsystem 1036 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations can be stored by file storage subsystem 1036 in the storage subsystem 1010, or in other machines accessible by the processor.
[0124] Bus subsystem 1055 provides a mechanism for letting the various components and subsystems of computer system 1000 communicate with each other as intended. Although bus subsystem 1055 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
[0125] Computer system 1000 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 1000 depicted in
[0126] We describe various implementations of epigenetic predictors and the training thereof. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
[0127] In one implementation, we disclose a method of generating an epigenetic age prediction. The method includes receiving a first sequence of inputs corresponding to methylation values at a plurality of CpG sites of an individual. The method includes receiving a second sequence of inputs corresponding to phenotypic values relative to the individual. The method includes applying a model on the first sequence of inputs and the second sequence of inputs to predict the epigenetic age.
[0128] In one implementation, the second sequence of inputs in the method corresponds to phenotypic values of a phenotypic input type.
[0129] In one implementation, the phenotypic values of the phenotypic input type include image data obtained from an image of the individual.
[0130] In one implementation, the phenotypic values of the phenotypic input type include biometric data of the individual.
[0131] In one implementation, the biometric data in the method includes a plurality of values corresponding to one or more of the following input subtypes: oxygen-consumption rate, basal metabolic rate, waist circumference, basal metabolic index, fat percentage, muscle percentage, lean muscle mass, visceral fat mass, and bone mass.
[0132] In one implementation, the phenotypic values of the phenotypic input type include cardiovascular data of the individual.
[0133] In one implementation, the cardiovascular data includes a plurality of values corresponding to one or more of the following input subtypes: heart rate variability, maximum heart rate, minimum heart rate, heart rate range, and resting heart rate.
[0134] In one implementation, the phenotypic values of the phenotypic input type in the method include blood data obtained from a blood sample.
[0135] In one implementation, the blood data includes a plurality of values corresponding to one or more of the following input subtypes: blood pressure, blood oxygen levels, lipid profile, sugar profile, and hormone levels.
[0136] In one implementation, the input subtype corresponding to a lipid profile includes values of one or more of the following phenotypic measurements: total cholesterol, low-density lipoproteins, high-density lipoproteins, and triglycerides.
[0137] In one implementation, the input subtype corresponding to a sugar profile includes values of one or more of the following phenotypic measurements: glucose and hemoglobin A1C.
[0138] In one implementation, the input subtype corresponding to hormone levels includes values of one or more of the following phenotypic measurements: cortisol, testosterone, and estrogen.
[0139] In one implementation, the second sequence of inputs corresponds to phenotypic values of a plurality of phenotypic input types.
[0140] In one implementation, the phenotypic values include values from two or more of the following phenotypic input types: image data, cardiovascular data, biometric data, and blood data.
[0141] In another implementation, we disclose a method of generating an epigenetic clock predictor. The method includes receiving a plurality of methylation profiles from a plurality of individuals, the plurality of methylation profiles comprising methylation values for m CpG sites. The method includes receiving a plurality of phenotypic profiles, the plurality of phenotypic profiles comprising phenotypic values for one or more phenotypic input types. The method includes training a model based on the plurality of methylation profiles and the plurality of phenotypic profiles, the model being configured to predict an epigenetic age based on methylation values for n CpG sites and the phenotypic values for the one or more phenotypic input types.
[0142] In one implementation, the plurality of phenotypic profiles in the method includes phenotypic values corresponding to one or more of the following input types: image data, cardiovascular data, biometric data, and blood data.
[0143] In one implementation, the plurality of phenotypic profiles in the method is received from a plurality of different individuals than the plurality of methylation profiles.
[0144] In another implementation, we disclose an epigenetic age predictor. The epigenetic age predictor includes an input component configured to receive a first sequence of inputs corresponding to methylation values at CpG sites of an individual and a second sequence of inputs corresponding to phenotypic values of the individual. The second sequence of inputs in the epigenetic age predictor include phenotypic values corresponding to image data obtained from an image of the individual, the image data being indicative of one or more phenotypic characteristics of the individual. The epigenetic age predictor predicts an epigenetic age of the individual based on the first sequence of inputs and the second sequence of inputs.
[0145] In one implementation, the one or more phenotypic characteristics include characteristics relating to the skin of the individual.
[0146] In one implementation, the image is a facial image of the individual.
[0147] Although the present invention has been described with reference to preferred embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.