INTEGRATED METHOD AND SYSTEM FOR IDENTIFYING FUNCTIONAL PATIENT-SPECIFIC SOMATIC ABERATIONS USING MULTI-OMIC CANCER PROFILES
20180247010 ยท 2018-08-30
Inventors
- Abolfazl Razi (Flagstaff, AZ, US)
- Vinay Varadan (Westlake, OH, US)
- Nevenka Dimitrova (Pelham Manor, NY, US)
- NILANJANA BANERJEE (ARMONK, NY, US)
Cpc classification
G16B20/20
PHYSICS
G06F17/18
PHYSICS
G16B5/00
PHYSICS
International classification
Abstract
A system and method for determining the functional impact of somatic mutations and genomic aberrations on downstream cellular processes by integrating multi-omics measurements in cancer samples with community-curated biological pathways are disclosed. The method comprises the steps of extracting biological pathway information from well-curated biological pathway sources, using the pathway information to generate an upstream regulatory parent sub-network tree for each gene of interest, integrating measurement-based omic data for both cancer and normal samples to determine a nonlinear function for each gene expression level based on the gene's epigenetic information and regulatory network status, using the nonlinear function to predict gene expression levels and compare activation and consistency scores with inputted patient-specific gene expression data, and using the patient-specific gene expression predictions to identify significant deviations and inconsistencies in gene expression levels from expected levels in individual patient samples to identify potential biomarkers in providing predictive information in relation to cancer and cancer treatment.
Claims
1. A method for identifying patient-specific somatic aberrations driving dysregulated genes, comprising the steps of: determining a primary dataset of upstream regulatory parent gene information for each target gene by obtaining biological network pathway information; determining a regulatory sub-network from said primary dataset for each of said target genes, the regulatory sub-network comprising a relationship between said target gene's expression level with said target gene's genomic and epigenetic status, and said gene's upstream transcriptional regulators; determining a second dataset of measurement-based omics data comprising at least one of RNAseq expression data, copy number variation data, and DNA methylation data; integrating said primary dataset and said second dataset; generating a non-linear function for each of said target genes, said non-linear function relating said gene's expression level to measurements associated with the regulatory sub-network, from said integrated primary and second dataset; calculating expected expression levels for each of said target genes using said non-linear function for said target gene; determining a third dataset of patient-specific information relating to observed gene expression levels for said target genes and comprising a sequence of one or more parent genes in the determined regulatory sub-network of one or more of the target genes; calculating patient-specific inconsistency scores between said expected gene expression levels and said observed patient-specific expression levels for each of said target genes; calculating patient-specific activation scores for each of said target genes; evaluating the activation and inconsistency scores for all patient samples to identify the patient-specific target genes whose expression levels are significantly inconsistent with said expected expression levels; identifying statistically significant associations between inconsistencies in the target gene expression level with the somatic mutations in the upstream regulatory network of that particular target gene, comprising the step of calculating for each parent gene in the upstream regulatory network of the particular target gene for which a mutation has been identified, a functional impact score of a somatic mutation based at least in part on the calculated patient-specific inconsistency score, the genes in the upstream regulatory network of the particular target gene, and a set of genes comprising one or more mutations; determining based on the calculated functional impact scores for two or more parent gene in the upstream regulator, network of the particular target gene, a most influential parent gene, wherein the most influential parent gene comprises a mutated parent gene most likely to have impacted the expression of the target gene compared to the other parent genes in the upstream regulatory network of the particular gene; and reporting those target genes that have said significant inconsistencies as aberrant or dysregulated genes, wherein said reporting comprises an identification of the most influential target gene for one or more of the target genes that have said significant inconsistencies.
2. (canceled)
3. (canceled)
4. The method of claim 1, wherein said non-linear function is determined based on the gene's epigenetic information obtained from said measurement-based omics data and the gene's regulatory sub-network status.
5. The method of claim 4, wherein said non-linear function is determined using a global depth penalization mechanism which captures the potentially stronger impact of regulatory genes in the sub-network.
6. The method of claim 1, wherein the patient-specific information includes cancer sample data such as RNA expression data, CNV data, methylation data and somatic mutation data.
7. An integrated, unified network for identifying significant deviations and inconsistencies in gene expression levels in individual patient samples, comprising; a primary dataset of upstream regulatory parent gene information for each target gene obtained from curated biological network pathway information, said primary dataset located on a processor configured to receive said pathway information, and comprising a relationship between said target gene's expression level with said gene's genomic and epigenetic status, and said target gene's upstream transcriptional regulators; a regulatory tree for each specific target gene that captures the relationship between the gene's expression level with said target gene's genomic and epigenetic status, and its upstream transcriptional regulators, said tree determined from said primary dataset; a second dataset of measurement-based omics data comprising at least one of RNAseq expression data, copy number variation data, and DNA methylation data, said second dataset located on a processor configured to receive such data, a non-linear function for each target gene; wherein the parameters of said non-linear function are determined using a modified Bayesian inference method; a third dataset of patient-specific information relating to observed expression levels for the target genes and comprising a sequence of one or more parent genes in the determined regulatory sub-network of one or more of the target genes, said the patient-specific information including new cancer sample data; wherein, expression levels of said target genes are determined utilizing said non-linear function, and relative patient-specific inconsistency scores are determined between said predicted and said observed expression levels for the target genes in a given sample; and wherein activation and inconsistency scores are determined for all test samples whereby statistically significant associations between inconsistencies in said target gene expression level with the somatic mutations in the upstream regulatory network of that particular gene are identified by a process comprising the following steps: (i) calculating for each parent gene in the upstream regulatory network of the particular target gene for which a mutation has been identified, a functional impact score of a somatic mutation based at least in part on the calculated patient-specific inconsistency score, the genes in the upstream regulatory network of the particular target gene, and a set of genes comprising one or more mutations, an d(ii) determining, based on the calculated functional impact scores for two or more parent gene in the upstream regulatory network of the particular target gene, a most influential parent gene, wherein the most influential parent gene comprises a mutated parent gene most likely to have impacted the expression of the target gene compared to the other parent genes in the upstream regulatory network of the particular target gene.
8. (canceled)
9. (canceled)
10. The system of claim 7, wherein said non-linear function is determined based on the gene's epigenetic information obtained from said measurement-based omics data and the gene's regulatory sub-network status.
11. The system of claim 10, wherein said non-linear function determined by said modified Bayesian method incorporates a global depth penalization mechanism which captures the potentially stronger impact of regulatory genes in the sub-network.
12. The system of claim 7, wherein the patient-specific information includes cancer sample data such as RNA expression data, CNV data, methylation data and somatic mutation data.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0038] The methods according to the invention will now be described in more detail with regard to the accompanying figures. The figures showing ways of implementing the present invention and are not to be construed as being limiting to other possible embodiments falling within the scope of the attached claims.
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
DETAILED DESCRIPTION OF THE INVENTION
[0051] The present invention provides a system and method for integrating multi-omic biological information and various molecular measurement data sources into a unified network-based computational method for providing patient-specific gene expression predictions and identifying significant deviations and inconsistencies in gene expression levels from expected levels. The present invention is described in further detail below with reference made to
[0052] According to an embodiment of the present invention, a flowchart presenting the overall block-diagram of the method for providing patient-specific gene expression predictions, identifying significant deviations and inconsistencies in gene expression levels from expected levels and reporting patient-specific biomarkers, is set forth by the steps, or modules, outlined in
[0053] In Module 2, the second step, we determine a non-linear function for each gene in order to relate that particular gene expression level to measurements associated with the regulatory tree leaves. Thus, each tree subnetwork is used to learn a non-linear function to predict the corresponding gene expression level from its own epigenetic information (e.g., DNA Methylation and Copy Number) and its regulatory ancestor gene expression levels. The parameters of the non-linear function are estimated using a Bayesian inference method incorporating a novel depth penalization mechanism to capture the potentially stronger regulatory impact of nodes closer to the root node of the tree. This provides a bank of functions each corresponding to a specific gene in the context of specific tissue type. This function database is learned once and can be used for patient-specific analysis in the two subsequent steps performed by Modules 3 and 4.
[0054] In step 3, Module 3 calculates relative patient-specific inconsistency scores between the predicted and observed expression levels for the desired target genes in a given sample. That is, Module 3 receives information for a given patient and performs prediction of gene expression levels for all genes within the regulatory network using the function bank. This module further calculates the consistency scores for each gene by comparing the actual measurement of gene expression, or observed value, with the predicted value. In the fourth step, Module 4, evaluates the activation and inconsistency scores obtained for all test samples to discover statistically significant associations between the inconsistencies in the target gene expression level with the somatic mutations in the upstream regulatory network of that particular gene. Thus, Module 4 identifies the genes whose expression levels are significantly inconsistent with the prediction values obtained from the regulatory network. These genes are likely malfunctioning due to copy number aberrations in the gene or somatic mutations in its ancestors. Module 4 further provides statistics to evaluate the significance of ancestor gene mutations that potentially are associated with the inconsistencies in the child gene expression level.
[0055] Module 1: Incorporating Pathway NetworksRegulatory Tree Construction
[0056] Gene transcription is a complex biological process, which is regulated at different levels through multiple interacting proteins and complexes as well as the extent of DNA methylation and the harboring DNA segment copy number, as annotated in biological pathway databases. Pathway networks are widely used to present the intra-cellular interactions and gene regulatory networks in a network format. The network builds a directed graph of nodes and edges. The nodes may consist of a diverse range of entities such as genes, proteins, RNAs, miRNAs, protein complexes, signal receptors, and even abstract processes such as apoptosis, meiosis, mitosis and cell proliferation. The network edges determine the pairs of interacting nodes and specify the type of each interaction. Several publicly available pathway networks are developed to model intra cellular activities between various species and tissue types.
[0057] In this invention, we use a comprehensive network that brings together pathways from various well-curated pathway sources including NCI-PID, Biocarta and Reactome. This super pathway network consists of six node types including; proteins or the corresponding genes, RNAs, protein complexes, gene families, miRNAs, and abstracts. These nodes interact with one another in six different ways of; i) positive transcription, ii) negative transcription, iii) positive activation, iv) negative activation, v) gene family membership, and vi) being a component of a protein complex. Usually, transcription is terminated only to genes represented by the corresponding proteins, while activation is applicable to all node types.
[0058] In order to learn a function relating a gene's mRNA expression level to its epigenetic parameters (DNA methylation and copy number variation), as well as its regulatory network, we extract the regulatory network for each gene from the super-pathway network databases and represent it as a tree (
[0059] In developing a regulatory tree for each gene, we start with a specific target gene and traverse the upstream network in the opposite direction of links to collect all upstream nodes and capture the regulatory genes along with their depth, defined as the number of links to the root node, as depicted in
[0060] We first terminate traversing a branch once we reach a predefined maximum depth level, where the depth is defined as the number of links from the visiting node to the root node. We then eliminate all the branches that do not terminate to a gene node; therefore, the leaves of tree are always genes. We pass through all nodes, except abstract nodes that represent the conceptual abstract processes, in order to avoid unnecessary network complications and inclusion of irrelevant interactions. While reaching a gene node, we only pass through the links that are not of type transcription, because the part of the upstream regulatory network which terminates to a gene node via a transcription link, is already accounted for by considering the expression level of this particular gene. The only exception for this rule is the root node, where we do the exact reverse as follows:
[0061] Passing from the root node to the direct neighbors in the first ring of root neighborhood is only allowed if the connecting edge is of transcription type to limit the parents to those who impact the expression level of the gene residing in the tree root. We also keep track of the distances from the leaves to the root node that is further used in the function learning process. Finally, if we meet a node via two disjoint paths, the shortest path is considered. The pseudo-code for the Module 1 selection process is summarized below, and a sample upstream tree extracted for gene PPP3CA from the network is depicted in
TABLE-US-00001 Module 1: Building Regulatory Network for Each Gene using Modified Depth First Traverse Algorithm Inputs: Pathway network, gene id: (g), maxDepth Output: Regulatory Ancestor Gene Set, Depth Information 1 Set root node to input gene (r = g) 2 Set visit node to input gene (v = g), Set depth = 1; 3 Initialize visit_list, depth_list, connType_list and ancestor_list to empty sets. 4 Visit node (v, depth) a. If node v is in the visit_list and depth < depth_list(v) i. Update depth_list(v)=depth ii. return success b. If node v is in the visit_list and depth >= depth_list(v) i. Avoid this node ii. return fail (already visited via a shorter path) c. If node v is not in the visit_list i. Add v to visit_list ii. set depth_list(v) = depth iii. return success 5 If Visit node return fails [node is already visited], exit 6 If node v is a gene a. Add v to the ancestor list (g, depth, ancestor list) i. If node v is in the ancestor_list and depth < depth_list(v) 1. Update depth_list(v)=depth ii. If node v is in the ancestor_list and depth >= depth_list(v) 1. Avoid this node (already visited via shorter path) iii. If node v is not in the ancestor_list 1. Add v to ancestor_list 2. set depth_list(v) = depth 7 If depth < maxDepth [pass through this node to the next level] a. depth = depth+1; b. If node v is the root node (v==r) i. Remove the direct parents not connected via edges of type transcription c. If node v is of type gene and is not the root node (v<>r) i. Remove the direct parents connected via edges of type transcription d. for all nodes u in the parent list of the node v i. Goto step 4, call Visit node(u) e. depth = depth 1; [returns to the previous level] 8 End
[0062]
[0063] As an example, in
[0064] Module 2: Learning a Nonlinear Function for Each Gene
[0065] A second step of the inventive method is to learn a function relating the expression level of the gene residing at root node to its regulatory network and its own epigenetic information (e.g., DNA methylation and CNV). Learning a function means quantifying the influence of a regulatory gene's expression level on the target gene's expression. Also, the within method trains a model for a target gene that assigns different coefficients for parent genes based on their pairwise influence as observed in training data (as described in the Bayesian model estimation below, specifically the methods to estimate .sub.g). Because multiple DNA methylation probes may overlap with a gene's coding or regulatory regions, this invention leverages methylation measurements by including several representative statistics such as minimum, maximum, and weighted mean value, where in calculating the weighted mean we exclude the regions with less than 10 probes for more accuracy. Thus, if gene g, overlaps with n.sub.g.sup.m regions, each having probe number p.sub.1, p.sub.2, . . . , p.sub.n.sub.
where I(.) is the identity function.
[0066] To include copy number variations, this invention uses the segment mean value provided for the region that harbors the particular gene. Most genes fall into a single CNV segment. Otherwise, if a gene falls in the border of two segments, we simply take the mean value of both segment measurements.
[0067] In order to learn a function for each gene, Module 2 uses mRNA expression of its ancestors, somatic copy-number alteration and DNA methylation measurements for n.sub.g samples to form the following classical regression model:
y.sub.g=.sub.g1.sub.n.sub.(0,.sub.g.sup.2I.sub.n.sub.
where y.sub.g is a n1 vector of expression levels for gene g across all n.sub.g samples. X.sub.g=[X.sub.g.sup.(Self),X.sub.g.sup.(Parent)] is a np data matrix composed of two parts including X.sub.g.sup.(Self) (self-methylation and CNV data) and X.sub.g.sup.(Parent) (the expression levels of the ancestor genes), where;
The term 1.sub.n.sub.
[0068] The objective here is to find the optimal model parameters .sub.i, i=1, 2, . . . , p that provides the best prediction power via minimizing the Mean Squared Error (MSE). One may use normal samples in learning phase to avoid model crash due to severely perturbed interactions in the highly contaminated/disordered cancer cells. However, this may lead to a poor predictive power when the number of predictors is large or comparable with respect to the number of samples (n<O(p)). In most studies, the number of cancer samples profiled tend to be significantly higher than the number of normal samples. For instance, in the case of TCGA data for breast cancer, the number of cancer samples exceed the normal samples by a factor of 10. Consequently, excluding all cancer samples is highly inefficient. On the other hand, including cancer samples in the training set, may deteriorate the model performance for specific genes that significantly deviate from the true underlying biological function in some samples due to genomic events as stated above. Therefore, we include all the normal samples and part of the cancer samples that have not impacted by somatic mutations in this particular gene and its ancestors in order to learn the predictive function. This approach leads to a different training set size for each gene, but provides a considerable improvement in model prediction power.
[0069] The Least Squared Error (LSE) solution minimizes the squared error for the training set, when no prior information is available about the model parameters .sub.i.
The LSE solution is not optimal when there is prior information about the model parameters. Here, there is prior knowledge about the model that can be used to enhance the model accuracy. First, it is likely that not all of the ancestor genes may have a substantial impact of a given gene's expression levels. Therefore, a significant number of the model parameters .sub.i could be shrunk towards zero. Therefore, imposing sparsity enhances the model generalization property by avoiding noise over-fitting. Although part of sparsity is already accounted for by using the pathway network and including only ancestor genes instead of using all genes as the input data; but a still higher level of sparsity is expected, when the number of ancestor genes grow higher (in order of tens and hundreds).
[0070] One of the common optimization-based solutions to impose sparsity is regularizing the norm of model parameters. The penalization can be applied to the L.sub.p, p0 norm of coefficient vector =[.sub.1, .sub.2, . . . , .sub.p].sup.T, which is called bridge regression. Important special cases of this approach are Lasso, Ridge, and subset selection for L, L.sub.2, L.sub.0 norm penalization, respectively. In elastic net, the penalty term is the linear combination of L.sub.1 and L.sub.2 penalty;
where .sub.1 and .sub.2 are shrinkage parameters to impose sparsity and generalizability. Efficient algorithms based on convex optimizations, Basis pursuit, LARS, coordinate descent, Dantzig Selector, Orthogonal matching pursuit, and approximate message passing may be used to solve this problem. However, the most limiting drawback of these methods is that it only provides point estimates for the regression coefficients.
[0071] In contrast, this invention employs a Bayesian framework that provides more detailed information about the model parameters through posterior distribution to be used in the subsequent consistency check analysis. It also allows the incorporation of other prior knowledge, in addition to sparsity, as explained below.
[0072] Historically, in analyzing gene expression studies potentially non-linear relations among the biological measurements have been ignored. In order to capture such nonlinear relationships, Module 2 of this invention uses a centered sigmoid function
to capture the sensitivity around the mean value and the soft thresholding function .sub.2(x; c)=sign(x)({square root over (x.sup.2+c.sup.2)}c), to account for the cases in which only extremely high or low values contribute to the model. .sub.2 (x; c) can be considered as a softer version of the commonly used peace-wise linear soft-thresholding function, (x; c)=sign(x)(|x|c).sub.+. These functions are depicted in
[0073] Another important biological consideration in developing the ancestor-set for each gene through upwards traversing the pathway network is the variation in the distance of leaf nodes to the root node. One may expect the closer ancestors contribute more to the descendant downstream gene expression level than farther nodes that are connected via a long chain of intermediate nodes. Hence, the closer nodes tend to pose higher coefficient in the regression model. Module 2 leverages this fact into the method through depth penalization mechanism in the Bayesian framework, as captured by Id in the Bayesian model described below.
[0074] Here, this invention uses the Bayesian framework to predict the gene expression level via a nonlinear transformation/projection of its self-epigenetic data as well as the expression levels of the its regulatory ancestor genes. The Bayesian framework provides the desired statistics (e.g., median, mean, moments and . . . ) via full posterior distributions of the model parameters. Moreover, we incorporate a prior knowledge about the model parameters using hierarchical Bayesian models. The resulting posterior distributions provide significant insights into the functional effects of aberrations in the pathway.
[0075] The invention uses the idea of global and local shrinkages with penalization based on the distance of the ancestor gene (i.e., the number of links from leaves to root in the regulatory network) from the gene whose expression is being predicted. The following model is constructed, where the subscript g is omitted for notation convenience:
The above formulation extends the normal gamma prior construction in order to incorporate the link depth information to the gamma prior construction. This information is leveraged via coefficients k included in the variance of the model parameters. Thus, the variance of .sub.i is chosen to be inversely proportional to the link depth of the corresponding ancestor via setting var(.sub.i)=.sup.2.sub.i.sup.2/k.sub.i.sup.2, where .sup.2 controls the global shrinkage, .sub.i.sup.2 accounts for the local shrinkage and Id enforces the link depth impact. To provide more flexibility, we Use of a Gamma prior distribution for k.sub.i.sup.2 provides more flexibility. Using gamma prior has the advantage of yielding closed-form posterior distribution for k.sub.i and hence facilitates employing computationally efficient Gibbs sampler. Therefore, we use k.sub.i.sup.2Gamma(a.sub.k, b.sub.k) such that the mean of variance is inversely proportional to depth parameter, i.e.,
The constant c is a normalizing term to ensure
which is obtained by setting
Therefore, we only have one free hyperparameter a.sub.k.sub.
[0076] The above hierarchical model yields the following full joint distribution:
which immediately provides the following posterior distributions using the fact that the full conditional posterior distribution for each parameter is simply the product of the terms including that variable with other terms serving as a normalization constant to ensure the resulting probability integrates to one. This method is called completion of terms:
The Woodbury Matrix inversion formula is used to calculate A.sup.1 when n<p to obtain more stable results and save in computations by converting a pp square matrix inversion to a nn one. We apply a Gibbs sampler with burn-in iterations 1000 and computation iterations 5000 to obtain the approximate posterior distributions for the model parameters .sub.i, . This process is repeated for all genes gG using all samples sS, where G and S are the set of gene ids and sample ids, respectively.
Module 3: Predict Gene Level Expression for a New Sample and Report Activation and Consistency Level for all Genes
[0077] In order to evaluate the disruption of a target gene g for any given sample, we obtain activation score A.sub.g.sup.(new) and inconsistency score C.sub.g.sup.(new), where the first shows the level of gene expression, which may be consistent with its regulatory network, and the second shows the deviation from the expected value pointing to deregulation of the gene (potentially associated with somatic mutations).
[0078] Performing Module 2 using training samples from both normal and cancer cohorts provides results in the form of a function bank, where each function corresponds to a specific gene. This function bank is then used in Module 3 to analyze test samples to identify potential inconsistencies. Thus, this module performs gene expression level prediction for all genes. For each gene, we extract the expression levels of the ancestor genes as well as the self-epigenetic information for all samples. Then, we predict expression level of this specific gene for all samples using the corresponding function learned for this gene. The prediction process provides the conditional posterior distribution for the expression level of this gene. We use the maximum a-posteriori (MAP) method to obtain the expected gene expression levels.
[0079] In order to calculate consistency scores for un-isolated target genes for which a function is learned, we note that the predictive distribution for the RNA expression of any gene for each new test sample y.sup.new is obtained by marginalizing out the model parameters from the conditional posterior distribution for given input x.sup.new (self-epigenetic info and ancestors expression levels):
(y.sup.new|x.sup.new)=(y.sup.new|x.sup.new,,.sup.2)(,.sup.2|y,X)dd.sup.2
[0080] While the first term, which is the conditional distribution, is available in-closed form, the second term which is the posterior distribution of model parameters is not. This distribution can be approximated with the following expression, where the realizations of model parameters (.sup.(i), .sup.2.sup.
The above distribution is a Gaussian mixture model (GMM) with large number of equi-probable components of mean ((x.sup.new).sup.T.sup.(i)) and variance (.sup.2.sup.
(y.sup.new|x.sup.new)=N(y.sup.new;.sub.y.sub.
.sub.y.sub.
where ..sub.2 is the matrix induced norm. Based on this distribution, we calculate the z-score or the equivalent likelihood for the observed value as follows:
[0081] Moreover, due to the complexity of the underlying biological process for each gene and different level of inherit randomness, natural regularity and impact of unknown factors, the predictive power of the learned functions maybe significantly different for each gene. Accordingly, we consider the average empirical predictability of each gene for normal samples as a ground level for the consistency check. Hence, only cancer samples having consistency levels far below the average inconsistency of normal samples are reported as inconsistent samples. The following normalized likelihood is used:
where n.sub.0 and n.sub.1 are the number of normal and cancer samples and a is a tuning parameter between 0 and 1 in order to push different emphasis on normal and cancer cohorts. Lower values for are chosen in order to emphasis more on the normal cancers and compensate for the lower number of normal samples. In this invention, we arbitrarily set = 1/10, which is almost equal to the ratio of normal samples to cancer samples in the training set for TCGA Breast Cancer dataset. The inequality becomes equality if the variances of the predictive distribution are equal for all samples. The above process is repeated for all genes in parallel.
[0082] In addition to consistency score, the activation score of each gene is obtained using the gene expression level distribution modeled as a normal distribution;
where and is the mean and standard deviation of the normal distribution learned for each gene expression level after iteratively excluding the outliers. The postscript g is omitted for notation convenience. A similar normalization is used for activation scores.
[0083] As discussed above, the application of this module is to use the trained model on top of the regulatory network to predict a desired target gene expression level for a given sample based on the target gene epigenetics as well as expression levels of the genes playing transcription regulations roles in the utilized regulatory tree. In
[0084] To further illustrate the association between the gene expression level inconsistencies with somatic mutations as established in this module,
[0085] In
Module 4: Association Between Somatic Mutations and Inconsistencies
[0086] Gene expression levels may deviate from predicted values by the presence of somatic mutations in the regulatory network, resulting in loss/gain of regulatory functions. That is, a mutation in any of regulator genes may impact its proper role in regulating gene expression and impose a deviation on the target gene expression. Module 4 of the within method provides a methodology which assesses the impact of somatic mutations in regulatory genes on the inconsistency scores for downstream target genes. Accordingly, this module takes the activation and consistency scores provided by Module 3 and, for each new test sample, identifies the genes that are significantly inconsistent and examines if they are potentially driven by CNV aberrations or somatic mutations in the current gene or in its regulatory subnetwork.
[0087] To begin, the inconsistencies driven by CNV aberration events are identified. If the inconsistency is due to overexpression of the gene and the gene experiences copy number amplifications (CNV>0.5), then CNV amplification is reported as the main cause of the inconsistency. Likewise, if copy number deletion (CNV<0.5) is associated with the down expression of the gene, CNV deletion is considered to be the inconsistency driver.
[0088] For the genes which do not experience relevant copy number aberrations, the inconsistencies are likely to arise from mutations in the upstream regulatory network of the gene that impacts the transcription of the downstream gene. The closer the regulatory gene is to the downstream target gene, the more influence is expected on the downstream gene expression level inconsistency. Therefore, Module 4 assigns a global depth penalization parameter 0<1, such that the impact of mutated gene i with d.sub.19 hops to the root node g is scaled with value ().sup.d.sup.
[0089] To quantify the impact of mutations in the regulatory tree, we count all non-silent mutations affecting either the target gene or its regulators for each of the cancer samples scaled by their absolute inconsistency levels and the depth penalization factor. In general, the functional impact of gene h mutations on the expression of gene g, denoted by .sub.g(h) is calculated as:
where P.sub.g is the set of regulatory ancestor genes of gene g (i.e., the leaves of the corresponding regulatory tree), M(i) is the set of genes that are mutated in sample j, Z.sub.c(g).sup.(j) is the inconsistency score of gene g at sample j and 1(.) is the indicator function. The role of the denominator is to normalize .sub.hP.sub.
[0090] The flowchart in
[0091]
EXAMPLES
[0092] In to illustrate the prediction power of the method of this invention, its performance is compared to several near-optimal state-of-the art point-estimators including LASSO, RIDGE and Elastic-Net Regressions.
[0093] In order to demonstrate the accuracy of the inventive method, we first learn a Gaussian distribution for each gene expression level via maximum likelihood method after iteratively excluding significant outliers. We begin by learning a Gaussian distribution for the samples at each iteration and then we remove the samples which are not in the second standard deviation neighborhood of the mean value. In the subsequent iteration we repeat the process for the remaining samples until the algorithm converges and no more outlier exists. The empirical distribution for a sample gene PTEN and the learned Normal distribution is presented in
[0094] Next, we divide gene expression levels into three states (neutral, over-expressed and under-expressed) based on predefined thresholds. Thresholds are arbitrarily set such that the probability of down-expression, neutral and over-expression states become 10%, 80% and 10%, respectively. Module 3 provides patient-specific gene expression predictions for all 839 un-isolated genes. The state change rate is calculated via averaging state change events over all genes and patients. The results are calculated for each cohort separately. If the observed and predicted expression state for sample i and gene g are s.sub.g.sup.(i) and .sub.g.sup.(i), respectively, the state change rate is calculated as:
[0095] In Table 1, the prediction error is calculated for some important genes that are highly associated with cancer and have a valid set of upstream regulatory genes in the global pathway network. It is seen that the within method outperforms the state of the art sparsity-imposing regression models with the additional advantage of providing full posterior distribution for the gene expression level.
TABLE-US-00002 TABLE 1 Gene state prediction error rate for the within method in comparison with the benchmark optimization-based sparse regression models. The model training and test is the same for all methods. The prediction accuracy for normal and cancer samples are presented separately. Number Test Results on Normal Samples Test Results on Cancer Samples of Elastic Elastic Gene Parents Lasso Ridge Net InFloMut Lasso Ridge Net InFloMut CBFB 57 0.1532 0.1622 0.1441 0.1622 0.1854 0.1909 0.1937 0.1881 CCND1 836 0.2703 0.2883 0.2703 0.2432 0.2919 0.2947 0.2873 0.2614 CDH1 159 0.1802 0.1261 0.1712 0.1622 0.2484 0.2391 0.2456 0.2428 CDKN1B 456 0.2162 0.1892 0.1982 0.1982 0.2994 0.2799 0.2780 0.2530 CTCF 417 0.1261 0.0901 0.1261 0.1171 0.1409 0.1353 0.1474 0.1325 ERBB2 99 0.2252 0.2973 0.0811 0.2523 0.3883 0.4013 0.3818 0.3272 FOXA1 206 0.1712 0.1441 0.2072 0.1261 0.1196 0.1242 0.1242 0.1177 GATA3 223 0.3423 0.2703 0.3333 0.3423 0.1613 0.1529 0.1603 0.1585 MYB 219 0.0901 0.0811 0.0901 0.0901 0.1891 0.1752 0.1900 0.1891 PTEN 226 0.0541 0.0631 0.0721 0.0631 0.1631 0.1585 0.1696 0.1603 RB1 342 0.0721 0.0811 0.0811 0.0901 0.1835 0.1854 0.1724 0.1696 RUNX1 57 0.2162 0.2613 0.2432 0.1982 0.1965 0.1946 0.1993 0.1891 TP53 325 0.2162 0.1441 0.2162 0.1171 0.3309 0.3216 0.3309 0.2956 0.1795 0.1691 0.1719 0.1663 0.2229 0.2195 0.2216 0.2065
[0096] Another important observation is that despite the fact of higher contribution of cancer samples to the model training due to the larger number of cancer samples with respect to normal samples, the normal cohort presents a better predictability. This observation holds for all models and reveals that the functional states of the gene expressions in normal tissues are more consistent with the upstream regulatory network.
[0097] The fact of higher consistency between the predicted and observed values of target gene expression levels in normal samples compared to cancer samples is also observed in
[0098] This figure demonstrates that the nonlinear CNV terms with coefficients obtained from the learning process well define the RNA expression level for ERBB2 with some minor variability due to other terms such as DNA methylation and ancestor gene expression levels. In fact, the coefficients of DNA methylation and majority of ancestors are explicitly removed from the predictor list by LASSO and Elastic Net Methods and also notable is that the within invention assigns negligible coefficients for DNA methylation.
TABLE-US-00003 TABLE 2 Model parameters for two gene: ERBB2. ELAS- Model Parameters for TIC InFlo-Mut [Std (ERBB2) LASSO RIDGE NET Deviation] Methylation.sub.min 0.0000 0.0057 0.0000 0.0015: [0.0120] Methylation.sub.max 0.0000 0.0185 0.0000 0.0027: [0.0134] Methylation.sub.mean 0.0000 0.0215 0.0000 0.0152: [0.0355] f.sub.1 (Methylation.sub.mean) 0.0000 0.0530 0.0054 .sup.0.0220: [0.0247] f.sub.2 (Methylation.sub.mean) 0.0504 0.0734 0.0626 0.0588: [0.0319] CNV.sub.mean 0.0000 0.3685 0.0181 .sup.0.0453: [0.0638] f.sub.1 (CNV.sub.mean) 0.0986 0.1579 0.1268 0.1457: [0.0376] f.sub.2 (CNV.sub.mean) 0.9958 0.6272 0.9951 .sup.0.9858: [0.0524] {square root over (|.sub.i|.sup.2)} for ancestors 0.1232 0.2149 0.1528 0.1663
[0099] On the other hand, the RNA expression level for GATA3 is more influenced by DNA methylations as well as upstream regulatory network. The expected negative sign for DNA methylation coefficients are suggestive of an inverse relationship between the gene expression level and DNA methylation for both genes. Finally, for GATA3, the upstream regulatory network plays a crucial role in regulating the expression of this gene, suggesting that most of the variation of this gene's expression in breast cancer arises primarily due to the activity of transcription factors. The regression coefficients estimated by the method for two genes ERBB2 and GATA3 provided in Table 2 and 3 reveal that the regression coefficients can be significantly different for genes due to high heterogeneity of the gene regulation functionalities.
TABLE-US-00004 TABLE 3 Regression coefficients for gene GATA3. Model Parameters ELASTIC for (GATA3) LASSO RIDGE NET InFlo-Mut Methylation.sub.min 0.0000 0.0120 0.0000 0.0201 [0.0183] Methylation.sub.max 0.0842 0.0510 0.0838 0.0864 [0.0201].sup. Methylation.sub.mean 0.0000 0.0450 0.0142 0.0327 [0.0422].sup. f.sub.1 (Methylation.sub.mean) 0.1888 0.0544 0.1667 0.1221 [0.0426].sup. f.sub.2 (Methylation.sub.mean) 0.0000 0.0342 0.0000 0.0099 [0.0273].sup. CNV.sub.mean 0.0000 0.0009 0.0000 0.0068 [0.0270] f.sub.1 (CNV.sub.mean) 0.0000 0.0015 0.0000 0.0199 [0.0247] f.sub.2 (CNV.sub.mean) 0.0000 0.0033 0.0000 0.0003 [0.0232] {square root over (|.sub.i|.sup.2)} of ancestors 0.3157 0.3329 0.3085 0.4903
[0100] An important source of inconsistency is due to mutations in the target gene's upstream regulatory network. Noting the fact that the impact of expression level of regulatory genes is already captured by the method if the predicted value of target gene expression level is not consistent with the observed value then we infer that the regulatory network does not play its regulatory role properly. This malfunction of the regulatory network most likely arises from somatic mutations in the regulatory network that prevents its genes or product proteins from performing their functions (complex formation, gene transcription, protein activation and . . . ) properly which in turn impacts the downstream target gene expression level.
As an illustrative example, the functional impact of somatic mutations on the dysregulation of gene PTEN is depicted in