INTEGRATED METHOD AND SYSTEM FOR IDENTIFYING FUNCTIONAL PATIENT-SPECIFIC SOMATIC ABERATIONS USING MULTI-OMIC CANCER PROFILES

Abstract

A system and method for determining the functional impact of somatic mutations and genomic aberrations on downstream cellular processes by integrating multi-omics measurements in cancer samples with community-curated biological pathways are disclosed. The method comprises the steps of extracting biological pathway information from well-curated biological pathway sources, using the pathway information to generate an upstream regulatory parent sub-network tree for each gene of interest, integrating measurement-based omic data for both cancer and normal samples to determine a nonlinear function for each gene expression level based on the gene's epigenetic information and regulatory network status, using the nonlinear function to predict gene expression levels and compare activation and consistency scores with inputted patient-specific gene expression data, and using the patient-specific gene expression predictions to identify significant deviations and inconsistencies in gene expression levels from expected levels in individual patient samples to identify potential biomarkers in providing predictive information in relation to cancer and cancer treatment.

Claims

1. A method for identifying patient-specific somatic aberrations driving dysregulated genes, comprising the steps of: determining a primary dataset of upstream regulatory parent gene information for each target gene by obtaining biological network pathway information; determining a regulatory sub-network from said primary dataset for each of said target genes, the regulatory sub-network comprising a relationship between said target gene's expression level with said target gene's genomic and epigenetic status, and said gene's upstream transcriptional regulators; determining a second dataset of measurement-based omics data comprising at least one of RNAseq expression data, copy number variation data, and DNA methylation data; integrating said primary dataset and said second dataset; generating a non-linear function for each of said target genes, said non-linear function relating said gene's expression level to measurements associated with the regulatory sub-network, from said integrated primary and second dataset; calculating expected expression levels for each of said target genes using said non-linear function for said target gene; determining a third dataset of patient-specific information relating to observed gene expression levels for said target genes and comprising a sequence of one or more parent genes in the determined regulatory sub-network of one or more of the target genes; calculating patient-specific inconsistency scores between said expected gene expression levels and said observed patient-specific expression levels for each of said target genes; calculating patient-specific activation scores for each of said target genes; evaluating the activation and inconsistency scores for all patient samples to identify the patient-specific target genes whose expression levels are significantly inconsistent with said expected expression levels; identifying statistically significant associations between inconsistencies in the target gene expression level with the somatic mutations in the upstream regulatory network of that particular target gene, comprising the step of calculating for each parent gene in the upstream regulatory network of the particular target gene for which a mutation has been identified, a functional impact score of a somatic mutation based at least in part on the calculated patient-specific inconsistency score, the genes in the upstream regulatory network of the particular target gene, and a set of genes comprising one or more mutations; determining based on the calculated functional impact scores for two or more parent gene in the upstream regulator, network of the particular target gene, a most influential parent gene, wherein the most influential parent gene comprises a mutated parent gene most likely to have impacted the expression of the target gene compared to the other parent genes in the upstream regulatory network of the particular gene; and reporting those target genes that have said significant inconsistencies as aberrant or dysregulated genes, wherein said reporting comprises an identification of the most influential target gene for one or more of the target genes that have said significant inconsistencies.

2. (canceled)

3. (canceled)

4. The method of claim 1, wherein said non-linear function is determined based on the gene's epigenetic information obtained from said measurement-based omics data and the gene's regulatory sub-network status.

5. The method of claim 4, wherein said non-linear function is determined using a global depth penalization mechanism which captures the potentially stronger impact of regulatory genes in the sub-network.

6. The method of claim 1, wherein the patient-specific information includes cancer sample data such as RNA expression data, CNV data, methylation data and somatic mutation data.

7. An integrated, unified network for identifying significant deviations and inconsistencies in gene expression levels in individual patient samples, comprising; a primary dataset of upstream regulatory parent gene information for each target gene obtained from curated biological network pathway information, said primary dataset located on a processor configured to receive said pathway information, and comprising a relationship between said target gene's expression level with said gene's genomic and epigenetic status, and said target gene's upstream transcriptional regulators; a regulatory tree for each specific target gene that captures the relationship between the gene's expression level with said target gene's genomic and epigenetic status, and its upstream transcriptional regulators, said tree determined from said primary dataset; a second dataset of measurement-based omics data comprising at least one of RNAseq expression data, copy number variation data, and DNA methylation data, said second dataset located on a processor configured to receive such data, a non-linear function for each target gene; wherein the parameters of said non-linear function are determined using a modified Bayesian inference method; a third dataset of patient-specific information relating to observed expression levels for the target genes and comprising a sequence of one or more parent genes in the determined regulatory sub-network of one or more of the target genes, said the patient-specific information including new cancer sample data; wherein, expression levels of said target genes are determined utilizing said non-linear function, and relative patient-specific inconsistency scores are determined between said predicted and said observed expression levels for the target genes in a given sample; and wherein activation and inconsistency scores are determined for all test samples whereby statistically significant associations between inconsistencies in said target gene expression level with the somatic mutations in the upstream regulatory network of that particular gene are identified by a process comprising the following steps: (i) calculating for each parent gene in the upstream regulatory network of the particular target gene for which a mutation has been identified, a functional impact score of a somatic mutation based at least in part on the calculated patient-specific inconsistency score, the genes in the upstream regulatory network of the particular target gene, and a set of genes comprising one or more mutations, an d(ii) determining, based on the calculated functional impact scores for two or more parent gene in the upstream regulatory network of the particular target gene, a most influential parent gene, wherein the most influential parent gene comprises a mutated parent gene most likely to have impacted the expression of the target gene compared to the other parent genes in the upstream regulatory network of the particular target gene.

8. (canceled)

9. (canceled)

10. The system of claim 7, wherein said non-linear function is determined based on the gene's epigenetic information obtained from said measurement-based omics data and the gene's regulatory sub-network status.

11. The system of claim 10, wherein said non-linear function determined by said modified Bayesian method incorporates a global depth penalization mechanism which captures the potentially stronger impact of regulatory genes in the sub-network.

12. The system of claim 7, wherein the patient-specific information includes cancer sample data such as RNA expression data, CNV data, methylation data and somatic mutation data.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0038] The methods according to the invention will now be described in more detail with regard to the accompanying figures. The figures showing ways of implementing the present invention and are not to be construed as being limiting to other possible embodiments falling within the scope of the attached claims.

[0039] FIG. 1 is an overview of the within method illustrating a pathway of steps that integrates gene regulatory and/or signaling pathway networks with measurement-based omics data to provide patient-specific gene expression predictions. The steps of this aspect of the invention are: i) extracting regulatory trees for each un-isolated target gene, ii) learning a non-linear function for each target gene using a training dataset, iii) predicting gene expression values for target genes of interest and calculating activation and consistency scores and iv) functional mutation impact analysis;

[0040] FIG. 2 illustrates a regulatory tree generated using the regulatory interactions derived from pathway databases for a sample gene PPP3CA;

[0041] FIG. 3 is a histogram of ancestors counts for genes, showing the distribution of the number of ancestors up to level 2 for all genes in the pathway networks and illustrating that most genes have somewhere between 10 and 50 upstream regulators;

[0042] FIG. 4 is a graph of a nonlinear function including centered sigmoid and soft thresholding to capture two potential nonlinear effects: i) near mean-sensitivity and ii) near-mean ignorance; the x-axis denotes measured copy-number or DNA methylation levels; the y-axis denotes the extent of influence of the measurement on gene expression. In the case of near mean-sensitivity, small changes in measured DNA Methylation near the mean result in large deviations in gene expression. However, in near-mean ignorance, small changes in copy-number near the mean do not result in major changes in gene expression;

[0043] FIG. 5. illustrates JUN gene expression level prediction versus observation for CRC normal and tumor samples. Cancer samples (*) show widespread inconsistency as compared to normal samples (*). The method prediction is provided in terms of posterior mean (o) and confidence interval up to 3 standard deviations presented by error bars T;

[0044] FIG. 6 illustrates inconsistency scores for all genes for BRC and CRC tumor samples;

[0045] FIG. 7 is a flowchart summarizing a method of this invention for identifying patient-specific malfunctioning genes based on significant inconsistencies between network-based predictions and patient-specific measurement;

[0046] FIG. 8 is a graphical representation of the results of a method of the invention illustrating the impact of somatic mutations on target gene expression in colon cancer samples;

[0047] FIG. 9 is a histogram of the RNA expression for Gene PTEN;

[0048] FIG. 10 illustrates predictions versus observations for sample genes MYB, GATA3, PTEN and ERBB2;

[0049] FIG. 11 illustrates RNA expression level versus copy number variation CNV for gene ERBB2; and

[0050] FIG. 12 illustrates the impact of somatic mutations in the upstream regulatory subnetwork of PTEN on its gene expression inconsistency.

DETAILED DESCRIPTION OF THE INVENTION

[0051] The present invention provides a system and method for integrating multi-omic biological information and various molecular measurement data sources into a unified network-based computational method for providing patient-specific gene expression predictions and identifying significant deviations and inconsistencies in gene expression levels from expected levels. The present invention is described in further detail below with reference made to FIGS. 1-12.

[0052] According to an embodiment of the present invention, a flowchart presenting the overall block-diagram of the method for providing patient-specific gene expression predictions, identifying significant deviations and inconsistencies in gene expression levels from expected levels and reporting patient-specific biomarkers, is set forth by the steps, or modules, outlined in FIG. 1. As illustrated in FIG. 1, the method consists of four main sequential steps or modules to identify and report potential somatic aberrations driving dysregulated genes. In the first step, Module 1, a regulatory tree is extracted for each gene of interest from the pathway network that captures the relationship between the gene's expression level with its own genomic and epigenetic status, as well as its upstream transcriptional regulators. The gene of interest resides in the tree root node and the tree represents a network of upstream regulators of the gene's transcription. The leaves of the tree represent all of the genes that potentially regulate the gene's transcription, either directly or indirectly, through intermediate signaling partners. We use the terms ancestor genes or simply ancestors to refer to these genes.

[0053] In Module 2, the second step, we determine a non-linear function for each gene in order to relate that particular gene expression level to measurements associated with the regulatory tree leaves. Thus, each tree subnetwork is used to learn a non-linear function to predict the corresponding gene expression level from its own epigenetic information (e.g., DNA Methylation and Copy Number) and its regulatory ancestor gene expression levels. The parameters of the non-linear function are estimated using a Bayesian inference method incorporating a novel depth penalization mechanism to capture the potentially stronger regulatory impact of nodes closer to the root node of the tree. This provides a bank of functions each corresponding to a specific gene in the context of specific tissue type. This function database is learned once and can be used for patient-specific analysis in the two subsequent steps performed by Modules 3 and 4.

[0054] In step 3, Module 3 calculates relative patient-specific inconsistency scores between the predicted and observed expression levels for the desired target genes in a given sample. That is, Module 3 receives information for a given patient and performs prediction of gene expression levels for all genes within the regulatory network using the function bank. This module further calculates the consistency scores for each gene by comparing the actual measurement of gene expression, or observed value, with the predicted value. In the fourth step, Module 4, evaluates the activation and inconsistency scores obtained for all test samples to discover statistically significant associations between the inconsistencies in the target gene expression level with the somatic mutations in the upstream regulatory network of that particular gene. Thus, Module 4 identifies the genes whose expression levels are significantly inconsistent with the prediction values obtained from the regulatory network. These genes are likely malfunctioning due to copy number aberrations in the gene or somatic mutations in its ancestors. Module 4 further provides statistics to evaluate the significance of ancestor gene mutations that potentially are associated with the inconsistencies in the child gene expression level.

[0055] Module 1: Incorporating Pathway NetworksRegulatory Tree Construction

[0056] Gene transcription is a complex biological process, which is regulated at different levels through multiple interacting proteins and complexes as well as the extent of DNA methylation and the harboring DNA segment copy number, as annotated in biological pathway databases. Pathway networks are widely used to present the intra-cellular interactions and gene regulatory networks in a network format. The network builds a directed graph of nodes and edges. The nodes may consist of a diverse range of entities such as genes, proteins, RNAs, miRNAs, protein complexes, signal receptors, and even abstract processes such as apoptosis, meiosis, mitosis and cell proliferation. The network edges determine the pairs of interacting nodes and specify the type of each interaction. Several publicly available pathway networks are developed to model intra cellular activities between various species and tissue types.

[0057] In this invention, we use a comprehensive network that brings together pathways from various well-curated pathway sources including NCI-PID, Biocarta and Reactome. This super pathway network consists of six node types including; proteins or the corresponding genes, RNAs, protein complexes, gene families, miRNAs, and abstracts. These nodes interact with one another in six different ways of; i) positive transcription, ii) negative transcription, iii) positive activation, iv) negative activation, v) gene family membership, and vi) being a component of a protein complex. Usually, transcription is terminated only to genes represented by the corresponding proteins, while activation is applicable to all node types.

[0058] In order to learn a function relating a gene's mRNA expression level to its epigenetic parameters (DNA methylation and copy number variation), as well as its regulatory network, we extract the regulatory network for each gene from the super-pathway network databases and represent it as a tree (FIG. 2). Subsequently, we extract a list of regulatory ancestor genes, referred to as regulators or regulatory genes, which collectively capture the impact of all nodes forming the regulatory tree. Some of the regulators are direct parents of the target gene and hence regulate its transcription directly, while the other regulators impact the target gene expression indirectly through protein complexes and post-translational modifications of direct regulators.

[0059] In developing a regulatory tree for each gene, we start with a specific target gene and traverse the upstream network in the opposite direction of links to collect all upstream nodes and capture the regulatory genes along with their depth, defined as the number of links to the root node, as depicted in FIG. 2, using a depth-first traversal algorithm, such as the well-known depth-first search (see pseudocode below), with some modifications based on the biology of gene transcription regulation and the fact that we are interested in predicting target gene expression using the expression of other genes that participate in the regulatory network.

[0060] We first terminate traversing a branch once we reach a predefined maximum depth level, where the depth is defined as the number of links from the visiting node to the root node. We then eliminate all the branches that do not terminate to a gene node; therefore, the leaves of tree are always genes. We pass through all nodes, except abstract nodes that represent the conceptual abstract processes, in order to avoid unnecessary network complications and inclusion of irrelevant interactions. While reaching a gene node, we only pass through the links that are not of type transcription, because the part of the upstream regulatory network which terminates to a gene node via a transcription link, is already accounted for by considering the expression level of this particular gene. The only exception for this rule is the root node, where we do the exact reverse as follows:

[0061] Passing from the root node to the direct neighbors in the first ring of root neighborhood is only allowed if the connecting edge is of transcription type to limit the parents to those who impact the expression level of the gene residing in the tree root. We also keep track of the distances from the leaves to the root node that is further used in the function learning process. Finally, if we meet a node via two disjoint paths, the shortest path is considered. The pseudo-code for the Module 1 selection process is summarized below, and a sample upstream tree extracted for gene PPP3CA from the network is depicted in FIG. 2.

TABLE-US-00001 Module 1: Building Regulatory Network for Each Gene using Modified Depth First Traverse Algorithm Inputs: Pathway network, gene id: (g), maxDepth Output: Regulatory Ancestor Gene Set, Depth Information 1 Set root node to input gene (r = g) 2 Set visit node to input gene (v = g), Set depth = 1; 3 Initialize visit_list, depth_list, connType_list and ancestor_list to empty sets. 4 Visit node (v, depth) a. If node v is in the visit_list and depth < depth_list(v) i. Update depth_list(v)=depth ii. return success b. If node v is in the visit_list and depth >= depth_list(v) i. Avoid this node ii. return fail (already visited via a shorter path) c. If node v is not in the visit_list i. Add v to visit_list ii. set depth_list(v) = depth iii. return success 5 If Visit node return fails [node is already visited], exit 6 If node v is a gene a. Add v to the ancestor list (g, depth, ancestor list) i. If node v is in the ancestor_list and depth < depth_list(v) 1. Update depth_list(v)=depth ii. If node v is in the ancestor_list and depth >= depth_list(v) 1. Avoid this node (already visited via shorter path) iii. If node v is not in the ancestor_list 1. Add v to ancestor_list 2. set depth_list(v) = depth 7 If depth < maxDepth [pass through this node to the next level] a. depth = depth+1; b. If node v is the root node (v==r) i. Remove the direct parents not connected via edges of type transcription c. If node v is of type gene and is not the root node (v<>r) i. Remove the direct parents connected via edges of type transcription d. for all nodes u in the parent list of the node v i. Goto step 4, call Visit node(u) e. depth = depth 1; [returns to the previous level] 8 End

[0062] FIG. 2 is an example of a regulatory tree generated using the regulatory interactions derived from pathway databases for a sample gene PPP3CA. The subnetwork includes ancestor genes with depth 1 up to the 3rd level. Shapes define the node types with genes (ovals), protein complexes (rectangles), gene families (pentagon), abstract concepts (diamonds). The edges are colored according to their regulatory function with positive activation (yellow), negative activation (red), positive transcription (green), negative transcriptional (blue), component of protein complex (black) and gene family member (grey). The root node's epigenetic and sCNA measurements (rounded rectangles), considered as additional regulatory parents, are connected by green arrows. The regulators are collected up to level three (d.sub.max=3). The first level ancestors (direct parents) of the root node PPP3CA are shown to be connected via transcription edges that regulate the gene expression level. For instance, the complex CAM/Ca++ is connected to the root node via an activation link, and hence does not regulate gene expression level. Therefore, all the genes connecting via complex CAM/Ca++ in the left side of FIG. 2 are excluded from the final ancestor list. While passing through other genes, only non-transcriptional links are allowed. For instance, the upstream subnetwork of MYB is limited to the non-transcriptional nodes such as PIAS3 and MAP3K7 genes, whose impact is not already captured via the MYB expression level. The impact of genes GATA3 and E2F1 is implicitly accounted for by the expression level of gene MYB.

[0063] As an example, in FIG. 3 the empirical distribution of the number of ancestors when traversing up to 7 links upstream of the root node is presented in a logarithmic scale. A large number of genes are upstream isolated orphan genes. Only 839 genes have ancestors ranging from only one ancestor for 23 genes up to 1152 ancestors for gene CDKN1A. Genes with zero ancestors were not represented in the pathway network.

[0064] Module 2: Learning a Nonlinear Function for Each Gene

[0065] A second step of the inventive method is to learn a function relating the expression level of the gene residing at root node to its regulatory network and its own epigenetic information (e.g., DNA methylation and CNV). Learning a function means quantifying the influence of a regulatory gene's expression level on the target gene's expression. Also, the within method trains a model for a target gene that assigns different coefficients for parent genes based on their pairwise influence as observed in training data (as described in the Bayesian model estimation below, specifically the methods to estimate .sub.g). Because multiple DNA methylation probes may overlap with a gene's coding or regulatory regions, this invention leverages methylation measurements by including several representative statistics such as minimum, maximum, and weighted mean value, where in calculating the weighted mean we exclude the regions with less than 10 probes for more accuracy. Thus, if gene g, overlaps with n.sub.g.sup.m regions, each having probe number p.sub.1, p.sub.2, . . . , p.sub.n.sub.g.sub.m; and corresponding methylation measurement of m.sub.1, m.sub.2, . . . , m.sub.n.sub.g.sub.m, then the weighted mean is calculated as;

[00001] ${\tilde{m}}_{} = \frac{_{i = 1}^{n_{}^{m}} .Math. p_{i} .Math. m_{i} .Math. I (p_{i} 10)}{_{i = 1}^{n_{}^{m}} .Math. p_{i} .Math. I (p_{i} 10)},$

where I(.) is the identity function.

[0066] To include copy number variations, this invention uses the segment mean value provided for the region that harbors the particular gene. Most genes fall into a single CNV segment. Otherwise, if a gene falls in the border of two segments, we simply take the mean value of both segment measurements.

[0067] In order to learn a function for each gene, Module 2 uses mRNA expression of its ancestors, somatic copy-number alteration and DNA methylation measurements for n.sub.g samples to form the following classical regression model:

y.sub.g=.sub.g1.sub.n.sub.g+X.sub.g.sub.g+, custom-character (0,.sub.g.sup.2I.sub.n.sub.g)

where y.sub.g is a n1 vector of expression levels for gene g across all n.sub.g samples. X.sub.g=[X.sub.g.sup.(Self),X.sub.g.sup.(Parent)] is a np data matrix composed of two parts including X.sub.g.sup.(Self) (self-methylation and CNV data) and X.sub.g.sup.(Parent) (the expression levels of the ancestor genes), where;

[00002] $X_{} = {[X_{}^{(Self)} .Math. .Math. X_{}^{(Parent)}]}_{n p_{}}, X_{}^{(Self)} = {[\begin{matrix} m_{}^{1 .Math. (\min)} & m_{}^{1 .Math. (\max)} & m_{}^{1 .Math. (mean)} & c_{}^{1} \\ .Math. & .Math. & .Math. & .Math. \\ m_{}^{n (\min)} & m_{}^{n (\max)} & m_{}^{n (mean)} & c_{}^{n} \end{matrix}]}_{n 4}, X_{}^{(Parent)} = {[\begin{matrix} x_{1}^{1} & .Math. & x_{p_{}^{}}^{1} \\ .Math. & .Math. \\ x_{1}^{n} & .Math. & x_{p_{}^{}}^{n} \end{matrix}]}_{n p_{}}, y_{} = [\begin{matrix} y_{}^{1} \\ .Math. \\ y_{}^{n} \end{matrix}], p_{} = p_{}^{} + 4$

The term 1.sub.n.sub.g is all one column vector of length n.sub.g and E is the model noise with i.i.d zero-mean unit-variance Gaussian elements. .sub.g is the expected value of gene g expression level.

[0068] The objective here is to find the optimal model parameters .sub.i, i=1, 2, . . . , p that provides the best prediction power via minimizing the Mean Squared Error (MSE). One may use normal samples in learning phase to avoid model crash due to severely perturbed interactions in the highly contaminated/disordered cancer cells. However, this may lead to a poor predictive power when the number of predictors is large or comparable with respect to the number of samples (n<O(p)). In most studies, the number of cancer samples profiled tend to be significantly higher than the number of normal samples. For instance, in the case of TCGA data for breast cancer, the number of cancer samples exceed the normal samples by a factor of 10. Consequently, excluding all cancer samples is highly inefficient. On the other hand, including cancer samples in the training set, may deteriorate the model performance for specific genes that significantly deviate from the true underlying biological function in some samples due to genomic events as stated above. Therefore, we include all the normal samples and part of the cancer samples that have not impacted by somatic mutations in this particular gene and its ancestors in order to learn the predictive function. This approach leads to a different training set size for each gene, but provides a considerable improvement in model prediction power.

[0069] The Least Squared Error (LSE) solution minimizes the squared error for the training set, when no prior information is available about the model parameters .sub.i.

[00003] $= {\underset{}{argmin} (y - X .Math. .Math.)}^{T} .Math. (y - X .Math. .Math.) .Math._{LSE} = {(X^{T} .Math. X)}^{- 1} .Math. X^{T} .Math. y$

The LSE solution is not optimal when there is prior information about the model parameters. Here, there is prior knowledge about the model that can be used to enhance the model accuracy. First, it is likely that not all of the ancestor genes may have a substantial impact of a given gene's expression levels. Therefore, a significant number of the model parameters .sub.i could be shrunk towards zero. Therefore, imposing sparsity enhances the model generalization property by avoiding noise over-fitting. Although part of sparsity is already accounted for by using the pathway network and including only ancestor genes instead of using all genes as the input data; but a still higher level of sparsity is expected, when the number of ancestor genes grow higher (in order of tens and hundreds).

[0070] One of the common optimization-based solutions to impose sparsity is regularizing the norm of model parameters. The penalization can be applied to the L.sub.p, p0 norm of coefficient vector =[.sub.1, .sub.2, . . . , .sub.p].sup.T, which is called bridge regression. Important special cases of this approach are Lasso, Ridge, and subset selection for L, L.sub.2, L.sub.0 norm penalization, respectively. In elastic net, the penalty term is the linear combination of L.sub.1 and L.sub.2 penalty;

[00004] $_{L} = {\underset{}{argmin} (y - X .Math. .Math.)}^{T} .Math. (y - X .Math. .Math.) +_{1} .Math. {.Math. .Math.}_{1} +_{2} .Math. {.Math. .Math.}_{2},$

where .sub.1 and .sub.2 are shrinkage parameters to impose sparsity and generalizability. Efficient algorithms based on convex optimizations, Basis pursuit, LARS, coordinate descent, Dantzig Selector, Orthogonal matching pursuit, and approximate message passing may be used to solve this problem. However, the most limiting drawback of these methods is that it only provides point estimates for the regression coefficients.

[0071] In contrast, this invention employs a Bayesian framework that provides more detailed information about the model parameters through posterior distribution to be used in the subsequent consistency check analysis. It also allows the incorporation of other prior knowledge, in addition to sparsity, as explained below.

[0072] Historically, in analyzing gene expression studies potentially non-linear relations among the biological measurements have been ignored. In order to capture such nonlinear relationships, Module 2 of this invention uses a centered sigmoid function

[00005] $f_{1} (x; c) = \frac{1 - e^{- \frac{x}{c}}}{1 + e^{- \frac{x}{c}}}$

to capture the sensitivity around the mean value and the soft thresholding function .sub.2(x; c)=sign(x)({square root over (x.sup.2+c.sup.2)}c), to account for the cases in which only extremely high or low values contribute to the model. .sub.2 (x; c) can be considered as a softer version of the commonly used peace-wise linear soft-thresholding function, (x; c)=sign(x)(|x|c).sub.+. These functions are depicted in FIG. 4 compared to the linear function. We have applied the element-wise non-linear extension X.sub.g.fwdarw.(X.sub.g)=[X.sub.g.sup.s, .sub.1(x.sub.g.sub.s.sub.),.sub.2(x.sub.g.sub.s.sub.), X.sub.g.sup.p] only to the self data (e.g., Methylation and CNV data), hence the number of predictors increases slightly compared to the number of ancestors for each gene. It is notable, that if the actual underlying function is linear, the coefficients of the nonlinear terms tend to zero in the proposed model, hence no performance degradation is observed while learning a nonlinear function for true linear relations.

[0073] Another important biological consideration in developing the ancestor-set for each gene through upwards traversing the pathway network is the variation in the distance of leaf nodes to the root node. One may expect the closer ancestors contribute more to the descendant downstream gene expression level than farther nodes that are connected via a long chain of intermediate nodes. Hence, the closer nodes tend to pose higher coefficient in the regression model. Module 2 leverages this fact into the method through depth penalization mechanism in the Bayesian framework, as captured by Id in the Bayesian model described below.

[0074] Here, this invention uses the Bayesian framework to predict the gene expression level via a nonlinear transformation/projection of its self-epigenetic data as well as the expression levels of the its regulatory ancestor genes. The Bayesian framework provides the desired statistics (e.g., median, mean, moments and . . . ) via full posterior distributions of the model parameters. Moreover, we incorporate a prior knowledge about the model parameters using hierarchical Bayesian models. The resulting posterior distributions provide significant insights into the functional effects of aberrations in the pathway.

[0075] The invention uses the idea of global and local shrinkages with penalization based on the distance of the ancestor gene (i.e., the number of links from leaves to root in the regulatory network) from the gene whose expression is being predicted. The following model is constructed, where the subscript g is omitted for notation convenience:

[00006] $X -> (X)$ $y | X, K,,^{2} ~ N ((X) .Math.,^{2} .Math. I_{n}), K = diag ([k_{1}^{2}, .Math. .Math., k_{p}^{2}])$ $|^{2},_{1}^{2}, .Math. .Math.,_{p}^{2} ~ N_{p} (0_{p},^{2} .Math. D_{} .Math. K^{- 1}) .Math. .Math. D_{} = diag (_{1}^{2}, .Math. .Math.,_{p}^{2}) .Math. .Math. k_{1}^{2}, k_{2}^{2}, .Math. .Math., k_{p}^{2} ~ {.Math.}_{j = 1}^{p} .Math. \frac{{(a_{k} / d_{j}^{2})}^{a}}{(a_{k})} .Math. k_{j}^{2 .Math. (a_{k} - 1)} .Math. e^{- a_{k} .Math. k_{j}^{2} / d_{j}^{2}}$ $k_{j}^{2} ~ Gamma (a_{k}, a_{k} / d_{j}^{2}) .Math. .Math. j = 1, 2, 3, .Math.$ $_{1}^{2}, .Math. .Math.,_{p}^{2} ~ {.Math.}_{j = 1}^{p} .Math. \frac{^{2}}{2} .Math. e^{-^{2} .Math. r_{j}^{2} / 2} .Math. .Math. (_{j}^{2} ~ Expo (^{} =^{2} / 2,^{} = 2 /^{2}) .Math. .Math. (^{2}) = \frac{b^{a}}{(a)} .Math. {(^{2})}^{- a - 1} .Math. e^{- b /^{2}} = InvGamma (a_{^{2}} > 0, b_{a^{2}} > 0) .Math. d .Math. .Math.^{2}$

The above formulation extends the normal gamma prior construction in order to incorporate the link depth information to the gamma prior construction. This information is leveraged via coefficients k included in the variance of the model parameters. Thus, the variance of .sub.i is chosen to be inversely proportional to the link depth of the corresponding ancestor via setting var(.sub.i)=.sup.2.sub.i.sup.2/k.sub.i.sup.2, where .sup.2 controls the global shrinkage, .sub.i.sup.2 accounts for the local shrinkage and Id enforces the link depth impact. To provide more flexibility, we Use of a Gamma prior distribution for k.sub.i.sup.2 provides more flexibility. Using gamma prior has the advantage of yielding closed-form posterior distribution for k.sub.i and hence facilitates employing computationally efficient Gibbs sampler. Therefore, we use k.sub.i.sup.2Gamma(a.sub.k, b.sub.k) such that the mean of variance is inversely proportional to depth parameter, i.e.,

[00007] $E [k_{i}^{2}] = \frac{a_{k_{i}}}{b_{k_{i}}} = d_{i}^{2} .Math. c .$

The constant c is a normalizing term to ensure

[00008] $\frac{1}{p} .Math. {.Math. K .Math.}_{1} = 1,$

which is obtained by setting

[00009] $c = \frac{p}{{.Math.}_{i = 1}^{p} .Math. d_{i}^{2}} .$

Therefore, we only have one free hyperparameter a.sub.k.sub.i for k.sub.i prior distribution and the second parameter b.sub.k .sub.i is automatically obtained from b.sub.k.sub.i=a.sub.k.sub.i/cd.sub.i.sup.2. We note that var(k.sub.i)=a.sub.k.sub.i/b.sub.k.sub.i.sub.2=c.sup.2d.sub.i.sup.4/a.sub.k.sub.i. Setting a.sub.k.sub.i to small values provides higher variance for k.sub.i and hence is less formative, while large values of a.sub.k.sub.i provide low variance reflecting a high certainty about the network topology and the fact that node pairs with shorter paths are associated with higher influences to one another. In this case, the gamma distribution approaches a Gaussian distribution concentrated around d.sub.i. We chose the relatively large value of a=a.sub.k.sub.i=cd.sub.i.sup.2b.sub.k.sub.i=10 to highlight the significance of the underlying biological network.

[0076] The above hierarchical model yields the following full joint distribution:

[00010] $p (y, X, K,,^{2}) = p (y | X, K,,^{2}) .Math. (|^{2},_{1}^{2}, .Math. .Math.,_{p}^{2}) .Math. (k_{1}, .Math. .Math., k_{p}) .Math. (_{1}^{2}, .Math. .Math.,_{p}^{2}) .Math. (^{2}) = \frac{1}{{(2 .Math.^{2})}^{n / 2}} .Math. e^{- \frac{1}{2 .Math.^{2}} .Math. {(y - X .Math. .Math.)}^{T} .Math. (y - X .Math. .Math.)} .Math. {.Math.}_{j = 1}^{p} .Math. \frac{{(a_{k} / d_{j}^{2})}^{a}}{(a_{k})} .Math. k_{j}^{2 .Math. (a_{k} - 1)} .Math. e^{- a_{k} .Math. k_{j}^{2} / d_{j}^{2}}$ ${.Math.}_{j = 1}^{p} .Math. \frac{1}{\sqrt{2 .Math.^{2} .Math._{j}^{2} / k_{j}^{2}}} .Math. e^{- \frac{_{j}^{2} .Math. k_{j}^{2}}{2 .Math.^{2} .Math._{j}^{2}}} .Math. .Math. .Math. .Math. .Math. {.Math.}_{j = 1}^{p} .Math. \frac{^{2}}{2} .Math. e^{-^{2} .Math._{j}^{2} / 2} .Math. \frac{b^{a}}{(a)} .Math. {(^{2})}^{- a - 1} .Math. e^{- b /^{2}},$

which immediately provides the following posterior distributions using the fact that the full conditional posterior distribution for each parameter is simply the product of the terms including that variable with other terms serving as a normalization constant to ensure the resulting probability integrates to one. This method is called completion of terms:

[00011] $.Math. |, (X), y,^{2},_{1}^{2}, .Math. .Math.,_{p}^{2} ~ N_{p} (A^{- 1} .Math. {(X)}^{T} .Math. \tilde{y},^{2} .Math. A^{- 1}), .Math. .Math. A = {(X)}^{T} .Math. (X) + {KD}_{}^{- 1}$ $_{j} = 1 / r_{j}^{2} |, (X), y,,^{2} ~ inverse .Math. .Math. Gaussian .Math. .Math. (\frac{}{.Math._{j} .Math.},^{2}) .Math. I (_{j} > 0), .Math.^{2} |,, (X), y,_{1}^{2}, .Math. .Math.,_{p}^{2} ~ inverse .Math. .Math. Gamma (a + \frac{n + p}{2}, .Math. .Math. b + \frac{1}{2} .Math. {(\overline{y} - (X) .Math.)}^{T} .Math. (\hat{y} - (X) .Math.) + \frac{1}{2} .Math.^{T} .Math. D_{}^{- 1} .Math. K .Math. .Math.)$ $k_{j}^{2} |,, (X), y,^{2},_{1}^{2}, .Math. .Math.,_{p}^{2} ~ Gamma (a_{k} + \frac{1}{2}, \frac{a_{k}}{d_{j}^{2}} + \frac{_{j}^{2}}{2 .Math.^{2} .Math._{j}^{2}})$

The Woodbury Matrix inversion formula is used to calculate A.sup.1 when n<p to obtain more stable results and save in computations by converting a pp square matrix inversion to a nn one. We apply a Gibbs sampler with burn-in iterations 1000 and computation iterations 5000 to obtain the approximate posterior distributions for the model parameters .sub.i, . This process is repeated for all genes gG using all samples sS, where G and S are the set of gene ids and sample ids, respectively.

Module 3: Predict Gene Level Expression for a New Sample and Report Activation and Consistency Level for all Genes

[0077] In order to evaluate the disruption of a target gene g for any given sample, we obtain activation score A.sub.g.sup.(new) and inconsistency score C.sub.g.sup.(new), where the first shows the level of gene expression, which may be consistent with its regulatory network, and the second shows the deviation from the expected value pointing to deregulation of the gene (potentially associated with somatic mutations).

[0078] Performing Module 2 using training samples from both normal and cancer cohorts provides results in the form of a function bank, where each function corresponds to a specific gene. This function bank is then used in Module 3 to analyze test samples to identify potential inconsistencies. Thus, this module performs gene expression level prediction for all genes. For each gene, we extract the expression levels of the ancestor genes as well as the self-epigenetic information for all samples. Then, we predict expression level of this specific gene for all samples using the corresponding function learned for this gene. The prediction process provides the conditional posterior distribution for the expression level of this gene. We use the maximum a-posteriori (MAP) method to obtain the expected gene expression levels.

[0079] In order to calculate consistency scores for un-isolated target genes for which a function is learned, we note that the predictive distribution for the RNA expression of any gene for each new test sample y.sup.new is obtained by marginalizing out the model parameters from the conditional posterior distribution for given input x.sup.new (self-epigenetic info and ancestors expression levels):

(y.sup.new|x.sup.new)=(y.sup.new|x.sup.new,,.sup.2)(,.sup.2|y,X)dd.sup.2

[0080] While the first term, which is the conditional distribution, is available in-closed form, the second term which is the posterior distribution of model parameters is not. This distribution can be approximated with the following expression, where the realizations of model parameters (.sup.(i), .sup.2.sup.(i)) are obtained using Gibbs sampling method.

[00012] $f (y^{new} | x^{new}) \frac{1}{N} .Math. {.Math.}_{i = 1}^{N} .Math. f (y^{new} | x^{new},^{(i)},^{2 .Math. (i)})$

The above distribution is a Gaussian mixture model (GMM) with large number of equi-probable components of mean ((x.sup.new).sup.T.sup.(i)) and variance (.sup.2.sup.(i)). If the Gibbs sampler converges, .sup.(i) is concentrated around .sub.MAP with covariance matrix .sub.=diag([.sub..sub.1.sub.2,.sub..sub.p.sub.2]), where the entities .sub..sub.i.sub.2 are small compared to .sup.2.sup.(i). Therefore, (x.sup.new).sup.(i) approaches a normal distribution for large number of predictors regardless of .sub.i distribution according to central limit theorem. In order to save in computations and storage, we use the following Normal distribution as a surrogate for the predictive distribution:

(y.sup.new|x.sup.new)=N(y.sup.new;.sub.y.sub.new.sub.|x.sub.new,.sub.y.sub.new.sub.|x.sub.new)

.sub.y.sub.new.sub.|x.sub.new=(x.sup.new).sup.T.sub.MAP,.sub.y.sub.new.sub.|x.sub.new.sup.2=(x.sup.new).sup.T.sub..sub.2.sup.2+.sub.MAP.sup.2)

where ..sub.2 is the matrix induced norm. Based on this distribution, we calculate the z-score or the equivalent likelihood for the observed value as follows:

[00013] $z_{c}^{new} = \frac{y^{new} - {(x^{new})}^{T} .Math._{MAP}}{_{y^{new} | x^{new}}}, .Math. L_{c}^{new} = \log .Math. .Math. f (y^{new} | x^{new}) = const - \log (_{y^{new} | x^{new}}) - .5 .Math. {(z_{c}^{new})}^{2}$

[0081] Moreover, due to the complexity of the underlying biological process for each gene and different level of inherit randomness, natural regularity and impact of unknown factors, the predictive power of the learned functions maybe significantly different for each gene. Accordingly, we consider the average empirical predictability of each gene for normal samples as a ground level for the consistency check. Hence, only cancer samples having consistency levels far below the average inconsistency of normal samples are reported as inconsistent samples. The following normalized likelihood is used:

[00014] ${LN}_{c}^{new} = \log (\frac{f (y^{new} | x^{new})}{{(\sqrt[n_{0}]{{.Math.}_{i = 1}^{n_{0}} .Math. f (y_{i} | x_{i})})}^{1 -} .Math. {(\sqrt[n_{1}]{{.Math.}_{i = 1}^{n_{1}} .Math. f (y_{i} | x_{i})})}^{}}) cnst + L_{c}^{new} - \frac{1 -}{n_{0}} .Math. {.Math.}_{i = 1}^{n_{0}} .Math. L_{c}^{i} - \frac{}{n_{1}} .Math. {.Math.}_{i = 1}^{n_{1}} .Math. L_{c}^{i};$

where n.sub.0 and n.sub.1 are the number of normal and cancer samples and a is a tuning parameter between 0 and 1 in order to push different emphasis on normal and cancer cohorts. Lower values for are chosen in order to emphasis more on the normal cancers and compensate for the lower number of normal samples. In this invention, we arbitrarily set = 1/10, which is almost equal to the ratio of normal samples to cancer samples in the training set for TCGA Breast Cancer dataset. The inequality becomes equality if the variances of the predictive distribution are equal for all samples. The above process is repeated for all genes in parallel.

[0082] In addition to consistency score, the activation score of each gene is obtained using the gene expression level distribution modeled as a normal distribution;

[00015] $y ~ N (,^{2}) .Math. z_{A}^{new} = \frac{y^{new} -}{}, L_{A}^{new} = \log .Math. .Math. f (y^{new};,^{2}) = const - \log () - 0.5 .Math. {(z_{A}^{new})}^{2},$

where and is the mean and standard deviation of the normal distribution learned for each gene expression level after iteratively excluding the outliers. The postscript g is omitted for notation convenience. A similar normalization is used for activation scores.

[0083] As discussed above, the application of this module is to use the trained model on top of the regulatory network to predict a desired target gene expression level for a given sample based on the target gene epigenetics as well as expression levels of the genes playing transcription regulations roles in the utilized regulatory tree. In FIG. 5, an illustrative example is shown to predict the gene JUN expression level across test samples including 42 normal and 42 tumor samples derived from the TCGA colon cancer dataset. The model is trained using 338 normal and 368 cancer samples with 5-fold cross validation, using Module 1 and 2. The gene JUN has 51 upstream regulators up to level 2 in the employed pathway network, as derived using Module 1. In FIG. 5, the predicted values along with the standard deviation around the posterior mean are shown for both normal and tumor samples, as obtained by employed the model learned in Module 2 within Module 3. Presentation of confidence interval shown in this figure is an advantage of the inventive method in predicting the gene expression level compared to the point-estimate methods where only the predicted values are obtained and no statistics about the confidence of prediction is provided. The second observation is that the gene JUN is tightly regulated across normal samples since its predicted value based on the expression level of its regulators is more accurate for normal samples as compared to cancer samples. In fact, only 5 normal samples experience JUN expression levels deviating beyond 3 standard deviations from the predicted value compared to 14 tumor samples with similar levels of deviation.

[0084] To further illustrate the association between the gene expression level inconsistencies with somatic mutations as established in this module, FIG. 6 provides a global statistical analysis for both BRCA and CRC across all genes for which a regulatory network is available. In this regard, for each gene, the tumor samples are divided into two subsets: i) where the gene of interest or some of its first and second level regulators are mutated; and ii) all regulators are wild type. Then, we take the average of absolute inconsistency levels for both mutated and non-mutated subsets (FIGS. 6A, 6C). The histogram of inconsistency scores for the two subsets (FIGS. 6B and 6D) reveals that the inconsistency scores for the mutated subset in both cancers are significantly higher than those of the non-mutated subset.

[0085] In FIGS. 6A and 6C, each stem corresponds to a specific gene, where the red stems are the average absolute inconsistencies for samples with mutations in that target gene or its regulatory network (up to level two), while the green stems are the negative of the average absolute consistency score across all samples where the gene of interest and its close parents are wild-type. The green stems for samples with wild-type regulatory genes are flipped vertically for ease of presentation. The genes are sorted based on their average inconsistency levels in wild-type samples. FIGS. 6B and 6D are the histogram obtained for average inconsistency scores. The top and bottom rows are respectively for breast and colorectal cancers. The results show a higher level of average inconsistencies across samples that the target gene or its close parents in the regulatory network harbor somatic mutations.

Module 4: Association Between Somatic Mutations and Inconsistencies

[0086] Gene expression levels may deviate from predicted values by the presence of somatic mutations in the regulatory network, resulting in loss/gain of regulatory functions. That is, a mutation in any of regulator genes may impact its proper role in regulating gene expression and impose a deviation on the target gene expression. Module 4 of the within method provides a methodology which assesses the impact of somatic mutations in regulatory genes on the inconsistency scores for downstream target genes. Accordingly, this module takes the activation and consistency scores provided by Module 3 and, for each new test sample, identifies the genes that are significantly inconsistent and examines if they are potentially driven by CNV aberrations or somatic mutations in the current gene or in its regulatory subnetwork.

[0087] To begin, the inconsistencies driven by CNV aberration events are identified. If the inconsistency is due to overexpression of the gene and the gene experiences copy number amplifications (CNV>0.5), then CNV amplification is reported as the main cause of the inconsistency. Likewise, if copy number deletion (CNV<0.5) is associated with the down expression of the gene, CNV deletion is considered to be the inconsistency driver.

[0088] For the genes which do not experience relevant copy number aberrations, the inconsistencies are likely to arise from mutations in the upstream regulatory network of the gene that impacts the transcription of the downstream gene. The closer the regulatory gene is to the downstream target gene, the more influence is expected on the downstream gene expression level inconsistency. Therefore, Module 4 assigns a global depth penalization parameter 0<1, such that the impact of mutated gene i with d.sub.19 hops to the root node g is scaled with value ().sup.d.sup.ig. As .fwdarw.1, the impact of depth becomes less significant. We choose = is the results section.

[0089] To quantify the impact of mutations in the regulatory tree, we count all non-silent mutations affecting either the target gene or its regulators for each of the cancer samples scaled by their absolute inconsistency levels and the depth penalization factor. In general, the functional impact of gene h mutations on the expression of gene g, denoted by .sub.g(h) is calculated as:

[00016] $f_{g} (h) = \frac{{.Math.}_{j = 1}^{n} .Math. 1 .Math. (h M^{(j)} .Math. P_{g}) .Math. {()}^{d_{h, g}} .Math. .Math. z_{c (g)}^{(j)} .Math.)}{\underset{l P_{g}}{.Math.} .Math. {.Math.}_{j = 1}^{n} .Math. 1 .Math. (l M^{(j)} .Math. P_{g}) .Math. {()}^{d_{l, g}} .Math. .Math. z_{c (g)}^{(j)} .Math.)}$

where P.sub.g is the set of regulatory ancestor genes of gene g (i.e., the leaves of the corresponding regulatory tree), M(i) is the set of genes that are mutated in sample j, Z.sub.c(g).sup.(j) is the inconsistency score of gene g at sample j and 1(.) is the indicator function. The role of the denominator is to normalize .sub.hP.sub.g.sub.g(h)=1. Therefore, .sub.g(h) quantifies the relative impact of mutations in all genes belonging to the regulatory network hP.sub.g on the target gene g.

[0090] The flowchart in FIG. 7 summarizes the interpretation of per sample inconsistency in this method. Repeating this procedure for all samples, and sorting the genes based on their assigned somatic mutation impact profiles (.sub.g(h), gG, hP.sub.g) filters out the passenger events and determines the most influential parent genes whose mutations functionally impact the downstream transcription factor gene. Thus, the invention allows for the identification of functional mutations that impact downstream gene expression. Give the functional impact of the majority of observed missense mutations across disease contexts are largely unknown, this inventive step allows clinicians and/or researchers to focus in on the most likely functional disease-associated mutations in a given context, thus enabling the identification of novel biomarkers as well as potential therapeutic targets.

[0091] FIG. 8 is an example of the results generated in Module 4 illustrated in graphical form. Specifically, FIG. 8A displays the relative impact of somatic mutations in APC on Wnt pathway target gene expression for genes identified with colon cancer. Plotted are the log 10(Pvalue) of the significance of association of target gene activation and inconsistency with the mutations affecting APC in colon cancer samples. Genes highlighted in green are significantly affected (FDR15%). In FIG. 8B, the impact of somatic mutations in the upstream regulatory subnetwork of PTEN on its gene expression inconsistency is displayed. Depth penalization parameter is set to =. The modulation effect of combinations of somatic mutations in parents of PTEN on its regulations are shown, where the mutations in gene set {PTEN, DYRK2, E4F1 and ATF2} show significant association with down-expression of PTEN. Therefore these genes modulate the impact of somatic mutations in PTEN. Thus, mutations in DYRK2, E4F1 and ATF2 jointly affect the expression of PTEN, so the combination of these mutations provides a more accurate functional status of PTEN in tumors. Given that the disruption of PTEN results in the oncogenic activation of the AKT pathway, mutations in these genes are prognostic and/or biomarkers for selection of therapies.

EXAMPLES

[0092] In to illustrate the prediction power of the method of this invention, its performance is compared to several near-optimal state-of-the art point-estimators including LASSO, RIDGE and Elastic-Net Regressions.

[0093] In order to demonstrate the accuracy of the inventive method, we first learn a Gaussian distribution for each gene expression level via maximum likelihood method after iteratively excluding significant outliers. We begin by learning a Gaussian distribution for the samples at each iteration and then we remove the samples which are not in the second standard deviation neighborhood of the mean value. In the subsequent iteration we repeat the process for the remaining samples until the algorithm converges and no more outlier exists. The empirical distribution for a sample gene PTEN and the learned Normal distribution is presented in FIG. 9. We also learn a Student t distribution for comparison purpose. Student t distribution has the advantage of robustness to outliers and is very close to the normal distribution after outlier exclusion as shown in FIG. 9.

[0094] Next, we divide gene expression levels into three states (neutral, over-expressed and under-expressed) based on predefined thresholds. Thresholds are arbitrarily set such that the probability of down-expression, neutral and over-expression states become 10%, 80% and 10%, respectively. Module 3 provides patient-specific gene expression predictions for all 839 un-isolated genes. The state change rate is calculated via averaging state change events over all genes and patients. The results are calculated for each cohort separately. If the observed and predicted expression state for sample i and gene g are s.sub.g.sup.(i) and .sub.g.sup.(i), respectively, the state change rate is calculated as:

[00017] $SER = \frac{1}{.Math. G .Math. .Math. .Math. S .Math.} .Math. {.Math.}_{g = 1}^{.Math. G .Math.} .Math. {.Math.}_{i = 1}^{.Math. S .Math.} .Math. I (s_{g}^{(i)} {\hat{s}}_{g}^{(i)})$

[0095] In Table 1, the prediction error is calculated for some important genes that are highly associated with cancer and have a valid set of upstream regulatory genes in the global pathway network. It is seen that the within method outperforms the state of the art sparsity-imposing regression models with the additional advantage of providing full posterior distribution for the gene expression level.

TABLE-US-00002 TABLE 1 Gene state prediction error rate for the within method in comparison with the benchmark optimization-based sparse regression models. The model training and test is the same for all methods. The prediction accuracy for normal and cancer samples are presented separately. Number Test Results on Normal Samples Test Results on Cancer Samples of Elastic Elastic Gene Parents Lasso Ridge Net InFloMut Lasso Ridge Net InFloMut CBFB 57 0.1532 0.1622 0.1441 0.1622 0.1854 0.1909 0.1937 0.1881 CCND1 836 0.2703 0.2883 0.2703 0.2432 0.2919 0.2947 0.2873 0.2614 CDH1 159 0.1802 0.1261 0.1712 0.1622 0.2484 0.2391 0.2456 0.2428 CDKN1B 456 0.2162 0.1892 0.1982 0.1982 0.2994 0.2799 0.2780 0.2530 CTCF 417 0.1261 0.0901 0.1261 0.1171 0.1409 0.1353 0.1474 0.1325 ERBB2 99 0.2252 0.2973 0.0811 0.2523 0.3883 0.4013 0.3818 0.3272 FOXA1 206 0.1712 0.1441 0.2072 0.1261 0.1196 0.1242 0.1242 0.1177 GATA3 223 0.3423 0.2703 0.3333 0.3423 0.1613 0.1529 0.1603 0.1585 MYB 219 0.0901 0.0811 0.0901 0.0901 0.1891 0.1752 0.1900 0.1891 PTEN 226 0.0541 0.0631 0.0721 0.0631 0.1631 0.1585 0.1696 0.1603 RB1 342 0.0721 0.0811 0.0811 0.0901 0.1835 0.1854 0.1724 0.1696 RUNX1 57 0.2162 0.2613 0.2432 0.1982 0.1965 0.1946 0.1993 0.1891 TP53 325 0.2162 0.1441 0.2162 0.1171 0.3309 0.3216 0.3309 0.2956 0.1795 0.1691 0.1719 0.1663 0.2229 0.2195 0.2216 0.2065

[0096] Another important observation is that despite the fact of higher contribution of cancer samples to the model training due to the larger number of cancer samples with respect to normal samples, the normal cohort presents a better predictability. This observation holds for all models and reveals that the functional states of the gene expressions in normal tissues are more consistent with the upstream regulatory network.

[0097] The fact of higher consistency between the predicted and observed values of target gene expression levels in normal samples compared to cancer samples is also observed in FIG. 10, where the observation and prediction values for sample genes MYB, GATA3, PTEN and ERBB2 are presented. Here, the gene expression levels in normal samples are more consistent with the predictions obtained from the gene self-epigenetic data and its upstream transcription regulation network. This figure shows the importance of inconsistency analysis for cancer samples which may arise from different sources and reveals additional information about the pathway perturbations and gene dysregulations with respect to the methods that only analyze the expression levels of genes. The inconsistency may arise due to various sources such as copy number amplification and deletion in the target gene as well as the mutations in the regulatory network that disrupts the normal behavior of the regulatory network role and consequently impacts the expression level of the target gene resides in the root of the regulatory network. In order to gain more insight about the model coefficients, the model parameters obtained for two genes ERBB2 and GATA3 are presented in Table 2 and Table 3. Each row presents the corresponding coefficient value obtained by different learning methods and for the within non-linear Bayesian method. The standard deviation for the posterior distribution is also presented in brackets in the last column. It is shown that the expression level of ERBB2 is highly dependent on copy number aberration events affecting its locus as seen in the model parameter of the proposed nonlinear soft-thresholding function. This nonlinearity reflects the ignorance of the model to small turbulences around zero which is likely measurement noise. Therefore, the copy-number associated log Ratio values derived from SNP-arrays can be used directly in the model without the need to discretize the log Ratios into amplified/neutral/deleted states. The relevance of the nonlinear function is interestingly picked up by all learning methods. FIG. 11 verifies this relevance, where the relation between the observed RNA as well as the predicted RNA versus CNV is depicted for gene ERBB2. In FIG. 11, the blue and red points correspond to the observations and the predictions obtained from the model. The black curve is the nonlinear RNA CNV relation obtained by the model parameters in Table 2.

[0098] This figure demonstrates that the nonlinear CNV terms with coefficients obtained from the learning process well define the RNA expression level for ERBB2 with some minor variability due to other terms such as DNA methylation and ancestor gene expression levels. In fact, the coefficients of DNA methylation and majority of ancestors are explicitly removed from the predictor list by LASSO and Elastic Net Methods and also notable is that the within invention assigns negligible coefficients for DNA methylation.

TABLE-US-00003 TABLE 2 Model parameters for two gene: ERBB2. ELAS- Model Parameters for TIC InFlo-Mut [Std (ERBB2) LASSO RIDGE NET Deviation] Methylation.sub.min 0.0000 0.0057 0.0000 0.0015: [0.0120] Methylation.sub.max 0.0000 0.0185 0.0000 0.0027: [0.0134] Methylation.sub.mean 0.0000 0.0215 0.0000 0.0152: [0.0355] f.sub.1 (Methylation.sub.mean) 0.0000 0.0530 0.0054 .sup.0.0220: [0.0247] f.sub.2 (Methylation.sub.mean) 0.0504 0.0734 0.0626 0.0588: [0.0319] CNV.sub.mean 0.0000 0.3685 0.0181 .sup.0.0453: [0.0638] f.sub.1 (CNV.sub.mean) 0.0986 0.1579 0.1268 0.1457: [0.0376] f.sub.2 (CNV.sub.mean) 0.9958 0.6272 0.9951 .sup.0.9858: [0.0524] {square root over (|.sub.i|.sup.2)} for ancestors 0.1232 0.2149 0.1528 0.1663

[0099] On the other hand, the RNA expression level for GATA3 is more influenced by DNA methylations as well as upstream regulatory network. The expected negative sign for DNA methylation coefficients are suggestive of an inverse relationship between the gene expression level and DNA methylation for both genes. Finally, for GATA3, the upstream regulatory network plays a crucial role in regulating the expression of this gene, suggesting that most of the variation of this gene's expression in breast cancer arises primarily due to the activity of transcription factors. The regression coefficients estimated by the method for two genes ERBB2 and GATA3 provided in Table 2 and 3 reveal that the regression coefficients can be significantly different for genes due to high heterogeneity of the gene regulation functionalities.

TABLE-US-00004 TABLE 3 Regression coefficients for gene GATA3. Model Parameters ELASTIC for (GATA3) LASSO RIDGE NET InFlo-Mut Methylation.sub.min 0.0000 0.0120 0.0000 0.0201 [0.0183] Methylation.sub.max 0.0842 0.0510 0.0838 0.0864 [0.0201].sup. Methylation.sub.mean 0.0000 0.0450 0.0142 0.0327 [0.0422].sup. f.sub.1 (Methylation.sub.mean) 0.1888 0.0544 0.1667 0.1221 [0.0426].sup. f.sub.2 (Methylation.sub.mean) 0.0000 0.0342 0.0000 0.0099 [0.0273].sup. CNV.sub.mean 0.0000 0.0009 0.0000 0.0068 [0.0270] f.sub.1 (CNV.sub.mean) 0.0000 0.0015 0.0000 0.0199 [0.0247] f.sub.2 (CNV.sub.mean) 0.0000 0.0033 0.0000 0.0003 [0.0232] {square root over (|.sub.i|.sup.2)} of ancestors 0.3157 0.3329 0.3085 0.4903

[0100] An important source of inconsistency is due to mutations in the target gene's upstream regulatory network. Noting the fact that the impact of expression level of regulatory genes is already captured by the method if the predicted value of target gene expression level is not consistent with the observed value then we infer that the regulatory network does not play its regulatory role properly. This malfunction of the regulatory network most likely arises from somatic mutations in the regulatory network that prevents its genes or product proteins from performing their functions (complex formation, gene transcription, protein activation and . . . ) properly which in turn impacts the downstream target gene expression level.

As an illustrative example, the functional impact of somatic mutations on the dysregulation of gene PTEN is depicted in FIG. 12, revealing that the inconsistencies in PTEN expression are highly associated with mutations in TP53, PTEN, PIK3CA, MAP3K1 and MAP2K4. The higher impact of TP53 mutations versus PIK3CA is particularly interesting given that PIK3CA is mutated more often than TP53 (387 samples versus 333 samples respectively). We observe that MAP3K1 and MAP2K4 mutations, previously shown to be associated with luminal breast cancers, impact PTEN inactivation, thus providing an intriguing nexus between these genes in driving a key subtype of breast cancers. We also calculate the relative impact of protein-truncating and other non-synonymous mutations on the inconsistency score for PTEN. The model determines that the two kinds of mutations have similar impacts when they affect any of the regulatory genes of PTEN while the protein-truncating mutations in PTEN have a higher impact on its deregulation, consistent with nonsense-mediated decay of PTEN mRNA. Depth penalization parameter is set to =.

INTEGRATED METHOD AND SYSTEM FOR IDENTIFYING FUNCTIONAL PATIENT-SPECIFIC SOMATIC ABERATIONS USING MULTI-OMIC CANCER PROFILES

Inventors

Cpc classification

Classification Explorer

G16B20/20

PHYSICS

Classification Explorer

G06F17/18

PHYSICS

Classification Explorer

G16B5/00

PHYSICS

Classification Explorer

G16B20/00

PHYSICS

International classification

Classification Explorer

G06F19/12

PHYSICS

Classification Explorer

G06F19/18

PHYSICS

Abstract

Claims

Description