ASSESSMENT OF CELLULAR SIGNALING PATHWAY ACTIVITY USING PROBABILISTIC MODELING OF TARGET GENE EXPRESSION
20230105263 · 2023-04-06
Assignee
Inventors
- WILHELMUS FRANCISCUS JOHANNES VERHAEGH (Heusden Gem Asten, NL)
- Anja VAN DE STOLPE (Vught, NL)
- HENDRIK JAN VAN OOIJEN (Wuk En Aalburg, NL)
- KALYANA CHAKRAVARTHI DULLA (Eindhoven, NL)
- MARCIA ALVES DE INDA (Rosmalen, NL)
- RALF HOFFMAN (Brueggen, DE)
Cpc classification
G16B40/00
PHYSICS
G16B25/00
PHYSICS
C12Q1/6809
CHEMISTRY; METALLURGY
C12Q2537/165
CHEMISTRY; METALLURGY
G16B20/20
PHYSICS
C12Q2537/165
CHEMISTRY; METALLURGY
C12Q2600/106
CHEMISTRY; METALLURGY
C12Q1/6809
CHEMISTRY; METALLURGY
G16B5/00
PHYSICS
G16B20/00
PHYSICS
International classification
C12Q1/6809
CHEMISTRY; METALLURGY
G16B20/00
PHYSICS
G16B20/20
PHYSICS
G16B40/00
PHYSICS
Abstract
The present application mainly relates to specific methods for inferring activity of one or more cellular signaling pathway(s) in tissue of a medical subject based at least on the expression level(s) of one or more target gene(s) of the cellular signaling pathway(s) measured in an extracted sample of the tissue of the medical subject, an apparatus comprising a digital compressor configured to perform such methods and a non-transitory storage medium storing instructions that are executable by a digital processing device to perform such methods.
Claims
1-7. (canceled)
8. A kit for determining abnormal operation of an AR cellular signaling pathway in a sample isolated from a subject suffering from a disease associated with an activated AR cellular signaling pathway, comprising: a set of primers directed to a plurality of AR cellular signaling pathway genes; and a set of probes directed to the plurality of AR cellular signaling pathway target genes; wherein the kit is configured to enable measurement of expression levels of a plurality of mRNA direct target genes of the AR cellular signaling pathway, the plurality of mRNA direct target genes of the AR cellular signaling pathway comprising at least nine target genes selected from the group consisting of KLK2, PMEPA1, TMPRSS2, NKX3-1, ABCC4, KLK3, FKBP5, ELL2, UGT2B15, DHCR24, PPAP2A, NDRG1, LRIG1, CREB3L4, LCP1, GUCY1A3, AR, and EAF2.
9. The kit of claim 8, wherein the plurality of mRNA direct target genes of the AR cellular signaling pathway comprises of KLK2, PMEPA1, TMPRSS2, NKX3-1, ABCC4, KLK3, FKBP5, ELL2, UGT2B15, DHCR24, PPAP2A, NDRG1, LRIG1, CREB3L4, LCP1, GUCY1 A3, AR, and EAF2.
10. The kit of claim 8, wherein the plurality of mRNA direct target genes of the AR cellular signaling pathway consists of KLK2, PMEPA1, TMPRSS2, NKX3-1, ABCC4, KLK3, FKBP5, ELL2, UGT2B15, DHCR24, PPAP2A, NDRG1, LRIG1, CREB3L4, LCP1, GUCY1A3, AR, and EAF2.
11. The kit of claim 8, wherein the disease is a cancer.
12. The kit of claim 8, further comprising a probabilistic model of the AR cellular signaling pathway, the probabilistic model representing the AR cellular signaling pathway for a set of inputs including at least the expression levels of the plurality of mRNA direct target genes of the AR cellular signaling pathway.
13. The kit of claim 8, further comprising a processor configured to: calculate activity of the AR cellular signaling pathway in the sample by evaluating at least a portion of a probabilistic model of the AR cellular signaling pathway, the probabilistic model representing the AR cellular signaling pathway for a set of inputs including at least the expression levels of the plurality of mRNA direct target genes of the AR cellular signaling pathway; determine a level in the sample of at least one transcription factor (TF) element, the at least one TF element controlling transcription of the plurality of mRNA direct target genes of the AR cellular signaling pathway, and the determining being based at least in part on conditional probabilities relating the at least one TF element and the expression levels of the plurality of mRNA direct target genes of the AR cellular signaling pathway; determine the activity of the AR cellular signaling pathway based on the determined level in the sample of the at least one TF element; and determine that the AR cellular signaling pathway is operating abnormally in the subject based on the determined activity of the AR cellular signaling pathway.
14. The kit of claim 13, wherein the determined abnormal operation of the AR cellular signaling pathway is active or overactive.
15. The kit of claim 13, wherein the determined abnormal operation of the AR cellular signaling pathway is active or overactive, and wherein the processor is further configured to: select, based on the determined abnormal operation of the AR cellular signaling pathway, a specific treatment configured to remedy the determined abnormal operation of the AR pathway.
16. The kit of claim 15, wherein the selected specific treatment is an antagonist of the AR cellular signaling pathway.
17. The kit of claim 13, wherein the determined abnormal operation of the AR cellular signaling pathway is inactive or reduced active, and wherein the processor is further configured to: select, based on the determined abnormal operation of the AR cellular signaling pathway, a specific treatment configured to remedy the determined abnormal operation of the AR pathway.
18. The kit of claim 17, wherein the selected specific treatment is an agonist of the AR cellular signaling pathway.
19. The kit of claim 13, wherein determining comprises: estimating the level in the sample of the subject of the at least one TF element represented by a TF node of the probabilistic model, the TF element controlling transcription of the plurality of mRNA direct target genes of the AR cellular signaling pathway, the estimating being based at least in part on conditional probabilities of the probabilistic model relating the TF node and nodes in the probabilistic model representing the plurality of mRNA direct target genes of the AR cellular signaling pathway; wherein the determining by evaluating at least a portion of a probabilistic model is performed by using a Bayesian network comprising nodes representing information about the AR cellular signaling pathway and conditional probability relationships between connected nodes of the Bayesian network.
20. A method for detecting abnormal operation of an AR cellular signaling pathway in a sample isolated from a subject suffering from a disease associated with an activated AR cellular signaling pathway, comprising: receiving a biological sample from the subject; analyzing, using the kit of claim 8, expression levels of the plurality of mRNA direct target genes of the AR cellular signaling pathway by contacting the biological sample with the set of set of primers directed to the plurality of AR cellular signaling pathway genes and the set of probes directed to the plurality of AR cellular signaling pathway target genes.
21. The method of claim 20, wherein the disease is a cancer.
22. The method of claim 20, wherein detecting abnormal operation of an AR cellular signaling pathway further comprises: calculating activity of the AR cellular signaling pathway in the sample by evaluating at least a portion of a probabilistic model of the AR cellular signaling pathway, the probabilistic model representing the AR cellular signaling pathway for a set of inputs including at least the expression levels of the plurality of mRNA direct target genes of the AR cellular signaling pathway; determining a level in the sample of at least one transcription factor (TF) element, the at least one TF element controlling transcription of the plurality of mRNA direct target genes of the AR cellular signaling pathway, and the determining being based at least in part on conditional probabilities relating the at least one TF element and the expression levels of the plurality of mRNA direct target genes of the AR cellular signaling pathway; determining the activity of the AR cellular signaling pathway based on the determined level in the sample of the at least one TF element; and determining that the AR cellular signaling pathway is operating abnormally in the subject based on the determined activity of the AR cellular signaling pathway.
23. The method of claim 20, wherein the determined abnormal operation of the AR cellular signaling pathway is active or overactive.
24. The method of claim 20, wherein the determined abnormal operation of the AR cellular signaling pathway is active or overactive, and wherein detecting abnormal operation of an AR cellular signaling pathway further comprises selecting, based on the determined abnormal operation of the AR cellular signaling pathway, a specific treatment configured to remedy the determined abnormal operation of the AR pathway.
25. The method of claim 24, wherein the selected specific treatment is an antagonist of the AR cellular signaling pathway.
26. The method of claim 20, wherein the determined abnormal operation of the AR cellular signaling pathway is inactive or less active.
27. The method of claim 20, wherein the determined abnormal operation of the AR cellular signaling pathway is inactive or less active, and wherein detecting abnormal operation of an AR cellular signaling pathway further comprises selecting, based on the determined abnormal operation of the AR cellular signaling pathway, a specific treatment configured to remedy the determined abnormal operation of the AR pathway.
28. The method of claim 27, wherein the selected specific treatment is an agonist of the AR cellular signaling pathway.
29. The method of claim 20, wherein determining comprises: estimating the level in the sample of the subject of the at least one TF element represented by a TF node of the probabilistic model, the TF element controlling transcription of the plurality of mRNA direct target genes of the AR cellular signaling pathway, the estimating being based at least in part on conditional probabilities of the probabilistic model relating the TF node and nodes in the probabilistic model representing the plurality of mRNA direct target genes of the AR cellular signaling pathway; wherein the determining by evaluating at least a portion of a probabilistic model is performed by using a Bayesian network comprising nodes representing information about the AR cellular signaling pathway and conditional probability relationships between connected nodes of the Bayesian network.
30. The method of claim 20, further comprising one or more of: diagnosing based on the determined activity of the AR cellular signaling pathway; preparing a prognosis based on the determined activity of the AR cellular signaling pathway; drug prescribing based on the determined activity of the AR cellular signaling pathway; predicting drug efficacy based on the determined activity of AR cellular signaling pathway; predicting adverse effects based on the determined activity of the AR cellular signaling pathway; monitoring of drug efficacy; developing one or more drugs; developing one or more assays; researching one or more cellular pathways; cancer staging; enrolling the subject in a clinical trial based on the determined activity of the AR cellular signaling pathway; selecting a subsequent test to be performed, and selecting one or more companion diagnostics tests.
Description
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064]
[0065]
[0066]
[0067]
[0068]
[0069]
[0070]
[0071]
[0072]
[0073]
[0074]
[0075]
[0076]
[0077]
[0078]
[0079]
[0080]
[0081]
[0082]
[0083] The following examples merely illustrate particularly preferred methods and selected aspects in connection therewith. The teaching provided therein may be used for constructing several tests and/or kits, e.g. to detect, predict and/or diagnose the abnormal activity of one or more cellular signaling pathways. Furthermore, upon using methods as described herein drug prescription can advantageously be guided, drug prediction and monitoring of drug efficacy (and/or adverse effects) can be made, drug resistance can be predicted and monitored, e.g. to select subsequent test(s) to be performed (like a companion diagnostic test). The following examples are not to be construed as limiting the scope of the present invention.
Example 1: Bayesian Network Construction
[0084] As disclosed herein, by constructing a probabilistic model (e.g., the illustrative Bayesian model shown in
[0085] One of the simplest Bayesian network models for representing a cellular signaling pathway would be a two level model including the transcription factor element and the associated target genes (see
[0086] The levels of the TF element and target genes may be variously represented. One option is to use a binary discretization, into states “absent” and “present” for the TF element, and “down” and “up” for a target gene’s mRNA level (see
[0087] The foregoing illustration of a simple Bayesian network is just an illustrative embodiment of the Bayesian network model (
[0088] Additional “upstream” levels representing regulatory proteins (in active or inactive state) of the pathway are typically added if knowledge of the level of such a protein could be probative for determining the clinical decision support recommendation. For example, the inclusion of the proteins elementary to the transcription factor or essential proteins upstream of the transcription factor in the Bayesian network (see
[0089] Additional information nodes further downstream of the target genes may be included in the Bayesian network as well. An illustrative example of this is the translation of target gene’s mRNA into proteins (
[0090] The expression level of a target gene may be computed based on the measured intensity of corresponding probesets of a microarray, for example by averaging or by other means of other techniques (e.g. RNA sequencing). In some embodiments this computation is integrated into the Bayesian network, by extending the Bayesian network with a node for each probeset that is used and including an edge running to each of these “measurement” nodes from the corresponding target gene node, as described herein with reference to
[0091] The probabilistic model may optionally also incorporate additional genomic information, such as information on mutations, copy number variations, gene expression, methylation, translocation information, or so forth, which change genomic sequences which are related to the signaling cascade of the pathway to infer the pathway activity and to locate the defect in the Wnt pathway which causes the aberrant functioning (either activation or inactivity), as described by illustrative reference to
[0092] Moreover, it is to be understood that while examples as described later herein pertain to the Wnt, ER, AR and Hedgehog pathway are provided as illustrative examples, the approaches for cellular signaling pathway analysis disclosed herein are readily applied to other cellular signaling pathways besides these pathways, such as to intracellular signaling pathways with receptors in the cell membrane (e.g., the Notch, HER2/PI3K, TGFbeta, EGF, VEGF, and TNF-NFkappaB cellular signaling pathways) and intracellular signaling pathways with receptors inside the cell (e.g., progesterone, retinoic acid, and vitamin D cellular signaling pathways).
Example 2: Comparison of Machine Learning Methods
[0093] Here the performance of two types of machine learning techniques are compared to each other with the Wnt pathway taken as an example case: the prediction of Wnt activity by means of a nearest centroid method is compared to the method of choice according to the present invention, which e.g. uses a Bayesian network.
[0094] As discussed above the Bayesian network approach was selected based on its advantages residing in the probabilistic approach being able to incorporate the available information in either “soft”, e.g. percentages of study subjects exhibiting probative characteristics, and “hard” form, using conditional probabilistic relationships. In addition, the probabilistic model also enables information to be incorporated based on partial (rather than comprehensive) knowledge of the underlying cellular signaling pathway, again through the use of conditional probability tables.
[0095] Here it is demonstrated that the inventors added value in the way they included known biological properties and the availability of soft evidence using a Bayesian network compared to other machine learning methods, e.g. nearest centroid classification, a well-known method. Nearest centroid classification is a machine learning method where for each class of training samples an average profile (= centroid) is computed, and next, for a sample to be classified, the label is predicted based on the centroid that is closest (the closest centroid’s label is then the prediction result). The two centroids are calculated on the same list of probesets used in the Bayesian network, and for the ‘Wnt on’ and ‘Wnt off’ centroid they are based on the adenoma samples and the normal colon samples, respectively, of the same fRMA processed data of GSE8671. The log2-ratio of the two Euclidean distances between a sample and the two centroids was subsequently used to classify samples from various data sets to infer the classification of the samples. This means that a log2-ratio of 0 corresponds to an equal distance of the sample to the two centroids, a value > 0 corresponds to a sample classified as active Wnt signaling whereas a value < 0 corresponds to a sample identified as having an inactive Wnt signaling pathway.
[0096] The Bayesian network was constructed similar to
[0097] The trained Bayesian network and nearest centroid model were then tested on various fRMA processed microarray data sets to infer the probability that the Wnt pathway is “on”, measured by P(Wnt On) and log2-ratio of the distances. Summaries of the results of the Bayesian network and the nearest centroid model are shown in
[0098] The vast majority of the colon (cancer) samples (GSE20916, GSE4183) are classified equally between the active and inactive Wnt pathway, except for GSE15960 that had a high fraction of wrongfully classified negative samples in the nearest centroid method (false negatives). This perception of a higher fraction of false negatives is maintained in the other cancer types as well. This is especially true for breast cancer samples (GSE12777, GSE21653) and liver cancer (GSE9843); except for a few exceptions all samples are predicted to have an inactive Wnt pathway which is known to be incorrect in case of basal-type breast cancer and the CTNNB1 liver cancer samples. In some cases, evident in for example GSE15960, the classification could be corrected by lowering and increasing the threshold of the nearest centroid classification. The idea behind this would be that the threshold of Wnt activity might be altered in different tissue-types. However, this would involve additional training of the nearest centroid method to be applicable to other tissue types. One of the strengths of the Bayesian network model is that this tissue-specific training is not required as it is established to be nonspecific regarding tissue-type.
Example 3: Selection of Target Genes
[0099] A transcription factor (TF) is a protein complex (that is, a combination of proteins bound together in a specific structure) or a protein that is able to regulate transcription from target genes by binding to specific DNA sequences, thereby controlling the transcription of genetic information from DNA to mRNA. The mRNA directly produced due to this action of the transcription complex is herein referred to as a “direct target gene”. Pathway activation may also result in more secondary gene transcription, referred to as “indirect target genes”. In the following, Bayesian network models (as exemplary probabilistic models) comprising or consisting of direct target genes, as direct links between pathway activity and mRNA level, are preferred, however the distinction between direct and indirect target genes is not always evident. Here a method to select direct target genes using a scoring function based on available literature data is presented. Nonetheless, accidently selection of indirect target genes cannot be ruled out due to limited information and biological variations and uncertainties
[0100] Specific pathway mRNA target genes were selected from the scientific literature, by using a ranking system in which scientific evidence for a specific target gene was given a rating, depending on the type of scientific experiments in which the evidence was accumulated. While some experimental evidence is merely suggestive of a gene being a target gene, like for example a mRNA increasing on an microarray of an embryo in which it is known that the Hedgehog pathway is active, other evidence can be very strong, like the combination of an identified pathway transcription factor binding site and retrieval of this site in a chromatin immunoprecipitation (ChIP) assay after stimulation of the specific pathway in the cell and increase in mRNA after specific stimulation of the pathway in a cell line.
[0101] Several types of experiments to find specific pathway target genes can be identified in the scientific literature: [0102] 1. ChIP experiments in which direct binding of a pathway-transcription factor to its binding site on the genome is shown. Example: By using chromatin-immunoprecipitation (ChIP) technology subsequently putative functional TCF4 transcription factor binding sites in the DNA of colon cell lines with and without active Wnt pathway were identified, as a subset of the binding sites recognized purely based on nucleotide sequence. Putative functionality was identified as ChIP-derived evidence that the transcription factor was found to bind to the DNA binding site. [0103] 2. Electrophoretic Mobility Shift (EMSA) assays which show in vitro binding of a transcription factor to a fragment of DNA containing the binding sequence. Compared to ChIP-based evidence EMSA-based evidence is less strong, since it cannot be translated to the in vivo situation. [0104] 3. Stimulation of the pathway and measuring mRNA profiles on a microarray or using RNA sequencing, using pathway-inducible cell lines and measuring mRNA profiles measured several time points after induction - in the presence of cycloheximide, which inhibits translation to protein, thus the induced mRNAs are assumed to be direct target genes. [0105] 4. Similar to 3, but using quantitative PCR to measure the amounts of mRNAs. [0106] 5. Identification of transcription factor binding sites in the genome using a bioinformatics approach. Example for the Wnt pathway: Using the known TCF4- beta catenin transcription factor DNA binding sequence, a software program was run on the human genome sequence, and potential binding sites were identified, both in gene promoter regions and in other genomic regions. [0107] 6. Similar as 3, only in the absence of cycloheximide. [0108] 7. Similar to 4, only in the absence of cycloheximide. [0109] 8. mRNA expression profiling of specific tissue or cell samples of which it is known that the pathway is active, however in absence of the proper negative control condition.
[0110] In the simplest form one can give every potential target mRNA 1 point for each of these experimental approaches in which the target mRNA was identified. Alternatively, points can be given incrementally, meaning one technology 1 point, second technology adds a second point, and so on. Using this relatively ranking strategy, one can make a list of most reliable target genes.
[0111] Alternatively, ranking in another way can be used to identify the target genes that are most likely to be direct target genes, by giving a higher number of points to the technology that provides most evidence for an in vivo direct target gene, in the list above this would mean 8 points for experimental approach 1), 7 to 2), and going down to one point for experimental approach 8. Such a list may be called “general target gene list”.
[0112] Despite the biological variations and uncertainties, the inventors assumed that the direct target genes are the most likely to be induced in a tissue-independent manner. A list of these target genes may be called “evidence curated target gene list”. These curated target lists have been used to construct computational models that can be applied to samples coming from different tissue sources.
[0113] The “general target gene list” probably contains genes that are more tissue specific, and can be potentially used to optimize and increase sensitivity and specificity of the model for application at samples from a specific tissue, like breast cancer samples.
[0114] The following will illustrate exemplary how the selection of an evidence curated target gene list specifically was constructed for the ER pathway.
[0115] For the purpose of selecting ER target genes used as input for the “model”, the following three criteria were used: [0116] 1. Gene promoter/enhancer region contains an estrogen response element (ERE) motif: a. The ERE motif should be proven to respond to estrogen, e.g., by means of a transient transfection assay in which the specific ERE motif is linked to a reporter gene, and b. The presence of the ERE motif should be confirmed by, e.g., an enriched motif analysis of the gene promoter/enhancer region. [0117] 2. ER (differentially) binds in vivo to the promoter/enhancer region of the gene in question, demonstrated by, e.g., a ChIP/CHIP experiment or a chromatin immunoprecipitation assay: a. ER is proven to bind to the promoter/enhancer region of the gene when the ER pathway is active, and b. (preferably) does not bind (or weakly binds) to the gene promoter/enhancer region of the gene if the ER pathway is not active. [0118] 3. The gene is differentially transcribed when the ER pathway is active, demonstrated by, e.g., a. fold enrichment of the mRNA of the gene in question through real time PCR, or microarray experiment, or b. the demonstration that RNA Pol II binds to the promoter region of the gene through an immunoprecipitation assay.
[0119] The selection was done by defining as ER target genes the genes for which enough and well documented experimental evidence was gathered proving that all three criteria mentioned above were met. A suitable experiment for collecting evidence of ER differential binding is to compare the results of, e.g., a ChIP/CHIP experiment in a cancer cell line that responds to estrogen (e.g., the MCF-7 cell line), when exposed or not exposed to estrogen. The same holds for collecting evidence of mRNA transcription.
[0120] The foregoing discusses the generic approach and a more specific example of the target gene selection procedure that has been employed to select a number of target genes based upon the evidence found using above mentioned approach. The lists of target genes used in the Bayesian network models for exemplary pathways, namely the Wnt, ER, Hedgehog and AR pathways are shown in Table 1, Table 2, Table 3 and Table 4, respectively.
[0121] The target genes of the ER pathway used for the Bayesian network model of the ER pathway described herein (shown in Table 2) contain a selection of target genes based on their literature evidence score; only the target genes with the highest evidence scores (preferred target genes according to the invention) were added to this short list. The full list of ER target genes, including also those genes with a lower evidence score, is shown in Table 5.
[0122] A further subselection or ranking of the target genes of the Wnt, ER, Hedgehog and AR pathways shown in Table 1, Table 2, Table 3 and Table 4 was performed based on a combination of the literature evidence score and the odds ratios calculated using the trained conditional probability tables linking the probeset nodes to the corresponding target gene nodes. The odds ratio is an assessment of the importance of the target gene in inferring activity of the pathways. In general, it is expected that the expression level of a target gene with a higher odds ratio is likely to be more informative as to the overall activity of the pathway as compared with target genes with lower odds ratios. However, because of the complexity of cellular signaling pathways it is to be understood that more complex interrelationships may exist between the target genes and the pathway activity — for example, considering expression levels of various combinations of target genes with low odds ratios may be more probative than considering target genes with higher odds ratios in isolation. In Wnt, ER, Hedgehog and AR modeling reported herein, it has been found that the target genes shown in Table 6, Table 7, Table 8 and Table 9 are of a higher probative nature for predicting the Wnt, ER, Hedgehog and AR pathway activities as compared with the lower-ranked target genes (thus, the target genes shown in Tables 6 to 9 are particularly preferred according to the present invention). Nonetheless, given the relative ease with which acquisition technology such as microarrays can acquire expression levels for large sets of genes, it is contemplated to utilize some or all of the target genes of Table 6, Table 7, Table 8 and Table 9, and to optionally additionally use one, two, some, or all of the additional target genes of ranks shown in Table 1, Table 2, Table 3 and Table 4, in the Bayesian model as depicted in
TABLE-US-00001 Evidence curated list of target genes of the Wnt pathway used in the Bayesian network and associated probesets used to measure the mRNA expression level of the target genes (# = sequence number in accompanying sequence listing) Target gene Probeset # Target gene Probeset # ADRA2C 206128_at 4 HNF1A 210515_at 102 ASCL2 207607_at 10 216930_at 229215_at IL8 202859_x_at 110 AXIN2 222695_s_at 13 211506_s_at 222696_at KIAA1199 1554685_a_at 119 224176_s_at 212942_s_at 224498_x_at KLF6 1555832_s_at 121 BMP7 209590_at 17 208960_s_at 209591_s_at 20896l_s_at 211259_s_at 211610_at 211260_at 224606_at CCND1 208711_s_at 27 LECT2 207409_at 129 208712_at LEF1 210948_s_at 130 214019_at 221557_s_at CD44 1557905_s_at 30 221558_s_at 1565868_at LGR5 210393_at 131 204489_s_at 213880_at 204490_s_at MYC 202431_s_at 142 209835_x_at 244089_at 210916_s_at NKD1 1553115_at 150 212014_x_at 229481_at 212063_at 232203_at 216056_at OAT 201599_at 157 217523_at PPARG 208510_s_at 173 229221_at REG1B 205886_at 184 234411_x_at RNF43 218704_at 189 234418_x_at SLC1A2 1558009_at 200 COL18A1 209081_s_at 40 1558010_s_at 209082_s_at 208389_s_at DEFA6 207814_at 52 225491_at DKK1 204602_at 54 SOX9 202935_s_at 209 EPHB2 209588_at 67 202936_s_at 209589_s_at SP5 235845_at 210 210651_s_at TBX3 219682_s_at 215 211165_x_at 222917_s_at EPHB3 1438_at 68 225544_at 204600_at 229576_s_at FAT1 201579_at 72 TCF7L2 212759_s_at 219 FZD7 203705_s_at 90 212761_at 203706_s_at 212762_s_at GLUL 200648_s_at 95 216035_x_at 215001_s_at 216037_x_at 217202_s_at 216511_s_at 217203_at 236094_at 242281_at TDGF1 206286_s_at 220 ZNRF3 226360_at 248
TABLE-US-00002 Evidence curated list of target genes of the ER pathway used in the Bayesian network and associated probesets used to measure the mRNA expression level of the target genes (# = sequence number in accompany sequence listing). Target gene Probeset # Target gene Probeset # AP1B1 205423_at 5 RARA 1565358_at 183 ATP5J 202325_s_at 12 203749_s_at COL18A1 209081_s_at 40 203750_s_at 209082_s_at 211605_s_at COX7A2L 201256_at 41 216300_x_at CTSD 200766_at 46 SOD1 200642_at 205 DSCAM 211484_s_at 59 TFF1 205009_at 221 237268_at TRIM25 206911_at 230 240218_at 224806_at EBAG9 204274_at 61 XBP1 200670_at 244 204278_s_at 242021_at ESR1 205225_at 70 GREB1 205862_at 97 211233_x_at 210562_at 211234_x_at 210855_at 211235_s_at IGFBP4 201508_at 106 211627_x_at MYC 202431_s_at 142 215551_at 244089_at 215552_s_at SGK3 227627_at 196 217163_at 220038_at 217190_x_at WISP2 205792_at 241 207672_at ERBB2 210930_s_at 69 HSPB1 201841_s_at 103 216836_s_at KRT19 201650_at 124 234354_x_at 228491_at CA12 203963_at 22 NDUFV3 226209_at 148 204508_s_at 226616_s_at 204509_at NRIP1 202599_s_at 154 210735_s_at 202600_s_at 214164_x_at PGR 208305_at 162 215867_x_at 228554_at 241230_at PISD 202392_s_at 164 CDH26 232306_at 32 PRDM15 230553_at 174 233391_at 230777_s_at 233662_at 231931_at 233663_s_at 234524_at CELSR2 204029_at 36 236061_at 36499_at PTMA 200772_x_at 179 200773_x_at 208549_x_at 211921_x_at
TABLE-US-00003 Evidence curated list of target genes of the Hedgehog pathway used in the Bayesian network and associated probesets used to measure the mRNA expression level of the target genes (# = sequence number in accompany sequence listing). Target gene Probeset # Target gene Probeset # GLI1 206646_at 93 CTSL1 202087_s_at 47 PTCH1 1555520_at 177 TCEA2 203919_at 216 208522_s_at 238173_at 209815_at 241428_x_at 209816_at MYLK 1563466_at 145 238754_at 1568770_at PTCH2 221292_at 178 1569956_at HHIP 1556037_s_at 101 202555_s_at 223775_at 224823_at 230135_at FYN 1559101_at 88 237466_s_at 210105_s_at SPP1 1568574_x_at 212 212486_s_at 209875_s_at 216033_s_at TSC22D1 215111_s_at 232 PITRM1 205273_s_at 165 235315_at 239378_at 243133_at CFLAR 208485_x_at 37 239123_at 209508_x_at CCND2 200951_s_at 28 209939_x_at 200952_s_at 210563_x_at 200953_s_at 210564_x_at 231259_s_at 211316_x_at H19 224646_x_at 253 211317_s_at 224997_x_at 211862_x_at IGFBP6 203851_at 107 214486_x_at TOM1 202807_s_at 229 214618_at JUP 201015_s_at 117 217654_at FOXA2 210103_s_at 82 235427_at 214312_at 237367_x_at 40284_at 239629_at MYCN 209756_s_at 144 224261_at 209757_s_at IL1R2 205403_at 108 211377_x_at 211372_s_at 234376_at S100A7 205916_at 254 242026_at S100A9 203535_at 255 NKX2_2 206915_at 249 CCND1 208711_s_at 27 NKX2_8 207451_at 250 208712_at RAB34 1555630_a_at 182 214019_at 224710_at JAG2 209784_s_at 115 MIF 217871_s_at 134 32137_at GLI3 1569342_at 94 FOXM1 202580_x_at 85 205201_at FOXF1 205935_at 83 227376_at FOXL1 216572_at 84 FST 204948_s_at 87 243409_at 207345_at 226847_at BCL2 203684_s_at 14 203685_at 207004_at 207005_s_at
TABLE-US-00004 Evidence curated list of target genes of the AR pathway used in the Bayesian network and associated probesets used to measure the mRNA expression level of the target genes (# = sequence number in accompany sequence listing). Target gene Probeset # Target gene Probeset # ABCC4 1554918_a_at 2 LCP1 208885_at 128 1555039_a_at LRIG1 211596_s_at 132 203196_at 238339_x_at APP 200602_at 7 NDRG1 200632_s_at 147 211277_x_at NKX3_1 209706_at 251 214953_s_at 211497_x_at AR 211110_s_at 8 211498_s_at 211621_at NTS 206291_at 155 226192_at PLAU 205479_s_at 167 226197_at 211668_s_at CDKN1A 1555186_at 34 PMEPA1 217875_s_at 169 202284_s_at 222449_at CREB3L4 226455_at 42 222450_at DHCR24 200862_at 53 PPAP2A 209147_s_at 171 DRG1 202810_at 58 210946_at EAF2 1568672_at 60 PRKACB 202741_at 175 1568673_s_at 202742_s_at 219551_at 235780_at ELL2 214446_at 65 KLK3 204582_s_at 123 226099_at 204583_x_at 226982_at PTPN1 202716_at 180 FGF8 208449_s_at 75 217686_at FKBP5 204560_at 77 SGK1 201739_at 195 224840_at TACC2 1570025_at 214 224856_at 1570546_a_at GUCY1A3 221942_s_at 99 202289_s_at 227235_at 211382_s_at 229530_at TMPRSS2 1570433_at 225 239580_at 205102_at IGF1 209540_at 105 211689_s_at 209541_at 226553_at 209542_x_at UGT2B15 207392_x_at 236 211577_s_at 216687_x_at KLK2 1555545_at 122 209854_s_at 209855_s_at 210339_s_at
TABLE-US-00005 Gene symbols of the ER target genes found to have significant literature evidence (= ER target genes longlist) (# = sequence number in accompanying sequence listing). Gene symbol # Gene symbol # Gene symbol # Gene symbol # AP1B1 5 SOD1 205 MYC 142 ENSA 66 COX7A2L 41 TFF1 221 ABCA3 1 KIAA0182 118 CTSD 46 TRIM25 230 ZNF600 247 BRF1 19 DSCAM 59 XBP1 245 PDZK1 160 CASP8AP2 25 EBAG9 61 GREB1 97 LCN2 127 CCNH 29 ESR1 70 IGFBP4 106 TGFA 222 CSDE1 43 HSPB1 103 SGK3 196 CHEK1 38 SRSF1 213 KRT19 124 WISP2 241 BRCA1 18 CYP1B1 48 NDUFV3 148 ERBB2 69 PKIB 166 FOXA1 81 NRIP1 154 CA12 22 RET 188 TUBA1A 235 PGR 162 CELSR2 36 CALCR 23 GAPDH 91 PISD 164 CDH26 32 CARD10 24 SFI1 194 PRDM15 174 ATP5J 12 LRIG1 132 ESR2 258 PTMA 179 COL18A1 40 MYB 140 MYBL2 141 RARA 183 CCND1 27 RERG 187
TABLE-US-00006 Shortlist of Wnt target genes based on literature evidence score and odds ratio (# = sequence number in accompanying sequence listing). Target gene # KIAA1199 119 AXIN2 13 CD44 30 RNF43 189 MYC 142 TBX3 215 TDGF1 220 SOX9 209 ASCL2 10 IL8 110 SP5 210 ZNRF3 248 EPHB2 67 LGR5 131 EPHB3 68 KLF6 121 CCND1 27 DEFA6 52 FZD7 90
TABLE-US-00007 Shortlist of ER target genes based on literature evidence score and odds ratio (# = sequence number in accompanying sequence listing). Target gene # CDH26 32 SGK3 196 PGR 162 GREB1 97 CA12 22 XBP1 244 CELSR2 36 WISP2 241 DSCAM 59 ERBB2 69 CTSD 46 TFF1 221 NRIP1 154
TABLE-US-00008 Shortlist of HH target genes based on literature evidence score and odds ratio (# = sequence number in accompanying sequence listing). Target gene # GLI1 93 PTCH1 177 PTCH2 178 IGFBP6 107 SPP1 212 CCND2 28 FST 87 FOXL1 84 CFLAR 37 TSC22D1 232 RAB34 182 S100A9 255 S100A7 254 MYCN 144 FOXM1 85 GLI3 94 TCEA2 216 FYN 88 CTSL1 47
TABLE-US-00009 Shortlist of AR target genes based on literature evidence score and odds ratio (# = sequence number in accompanying sequence listing). Target gene # KLK2 122 PMEPA1 169 TMPRSS2 225 NKX3_1 251 ABCC4 2 KLK3 123 FKBP5 77 ELL2 65 UGT2B15 236 DHCR24 53 PPAP2A 171 NDRG1 147 LRIG1 132 CREB3L4 42 LCP1 128 GUCY1A3 99 AR 8 EAF2 60
Example 4: Comparison of Evidence Curated List and Broad Literature List
[0123] The list of Wnt target genes constructed based on literature evidence following the procedure described herein (Table 1) is compared to another list of target genes not following above mentioned procedure. The alternative list is a compilation of genes indicated by a variety of data from various experimental approaches to be a Wnt target gene published in three public sources by renowned labs, known for their expertise in the area of molecular biology and the Wnt pathway. The alternative list is a combination of the genes mentioned in table S3 from Hatzis et al. (Hatzis P, 2008), the text and table S1A from de Sousa e Melo (de Sousa E Melo F, 2011) and the list of target genes collected and maintained by Roel Nusse, a pioneer in the field of Wnt signaling (Nusse, 2012). The combination of these three sources resulted in a list of 124 genes (= broad literature list, see Table 10). Here the question whether the performance in predicting Wnt activity in clinical samples by the algorithm derived from this alternative list is performing similarly or better compared to the model constructed on the basis of the existing list of genes (= evidence curated list, Table 1) is discussed.
TABLE-US-00010 Alternative list of Wnt target genes (= broad literature list) (# = sequence number in accompanying sequence listing). Target gene Reference # Target gene Reference # ADH6 de Sousa e Melo et al. 3 L1CAM Nusse 125 ADRA2C Hatzis et al. 4 LBH Nusse 126 APCDD1 de Sousa e Melo et al. 6 LEF1 Hatzis et al., de Sousa e Melo et al., Nusse 130 ASB4 de Sousa e Melo et al. 9 LGR5 de Sousa e Melo et al., Nusse 131 ASCL2 Hatzis et al., de Sousa e Melo et al. 10 LOC283859 de Sousa e Melo et al. 260 ATOH1 Nusse 11 MET Nusse 133 AXIN2 Hatzis et al., de Sousa e Melo et al., Nusse 13 MMP2 Nusse 135 BIRC5 Nusse 15 MMP26 Nusse 136 BMP4 Nusse 16 MMP7 Nusse 137 BMP7 Hatzis et al. 17 MMP9 Nusse 138 BTRC Nusse 20 MRPS6 Hatzis et al. 139 BZRAP1 de Sousa e Melo et al. 21 MYC Hatzis et al., Nusse 142 SBSPON de Sousa e Melo et al. 259 MYCBP Nusse 143 CCL24 de Sousa e Melo et al. 26 MYCN Nusse 144 CCND1 Nusse 27 NANOG Nusse 146 CD44 Nusse 30 NKD1 de Sousa e Melo et al. 150 CDH1 Nusse 31 NOS2 Nusse 151 CDK6 Hatzis et al. 33 NOTUM de Sousa e Melo et al. 152 CDKN2A Nusse 35 NRCAM Nusse 153 CLDN1 Nusse 39 NUAK2 Hatzis et al. 156 COL18A1 Hatzis et al. 40 PDGFB Hatzis et al. 159 CTLA4 Nusse 44 PFDN4 Hatzis et al. 161 CYP4X1 de Sousa e Melo et al. 49 PLAUR Nusse 168 CYR61 Nusse 50 POU5F1 Nusse 170 DEFA5 de Sousa e Melo et al. 51 PPARD Nusse 172 DEFA6 de Sousa e Melo et al. 52 PROX1 de Sousa e Melo et al. 176 DKK1 de Sousa e Melo et al., Nusse 54 PTPN1 Hatzis et al. 180 DKK4 de Sousa e Melo et al. 55 PTTG1 Nusse 181 DLL1 Nusse 56 REG3A de Sousa e Melo et al. 185 DPEP1 de Sousa e Melo et al. 57 REG4 de Sousa e Melo et al. 186 EDN1 Nusse 62 RPS27 Hatzis et al. 190 EGFR Nusse 64 RUNX2 Nusse 191 EPHB2 Hatzis et al., de Sousa e Melo et al., Nusse 67 SALL4 Nusse 192 EPHB3 Hatzis et al., Nusse 68 SLC1A1 de Sousa e Melo et al. 199 ETS2 Hatzis et al. 71 SLC7A5 Hatzis et al. 201 FAT1 Hatzis et al. 72 SNAI1 Nusse 202 FGF18 Nusse 73 SNAI2 Nusse 203 FGF20 Nusse 74 SNAI3 Nusse 204 FGF9 Nusse 76 SIK1 Hatzis et al. 261 FLAD1 Hatzis et al. 78 SOX17 Nusse 206 AK122582 Hatzis et al. 262 SOX2 de Sousa e Melo et al. 207 FN1 Nusse 79 SOX4 Hatzis et al. 208 FOSL1 Nusse 80 SOX9 Nusse 209 FOXN1 Nusse 86 SP5 Hatzis et al., de Sousa e Melo et al. 210 FST Nusse 87 SP8 Hatzis et al. 211 FZD2 de Sousa e Melo et al. 89 TCF3 Nusse 217 FZD7 Nusse 90 TDGF1 Hatzis et al. 220 GAST Nusse 92 TIAM1 Nusse 224 GMDS Hatzis et al. 96 TNFRSF19 Nusse 227 GREM2 Nusse 98 TNFSF11 Nusse 228 HES6 Hatzis et al. 100 TRIM29 de Sousa e Melo et al. 231 HNF1A Nusse 102 TSPAN5 de Sousa e Melo et al. 233 ID2 Nusse 104 TTC9 de Sousa e Melo et al. 234 IL22 de Sousa e Melo et al. 109 VCAN Nusse 237 IL8 Nusse 110 VEGFA Nusse 238 IRX3 de Sousa e Melo et al. 111 VEGFB Nusse 239 IRX5 de Sousa e Melo et al. 112 VEGFC Nusse 240 ISL1 Nusse 113 WNT10A Hatzis et al. 242 JAG1 Nusse 114 WNT3A Nusse 243 JUN Nusse 116 ZBTB7C de Sousa e Melo et al. 246 KIAA1199 de Sousa e Melo et al. 119 PATZ1 Hatzis et al. 263 KLF4 Hatzis et al. 120 ZNRF3 Hatzis et al. 248
[0124] The next step consisted of finding the probesets of the Affymetrix® GeneChip Human Genome U133 Plus 2.0 array that corresponds with the genes. This process was performed using the Bioconductor plugin in R and manual curation for the probesets relevance based on the UCSC genome browser, similar to the (pseudo-)linear models described herein, thereby removing e.g. probesets on opposite strands or outside gene exon regions. For two of the 124 genes there are no probesets available on this microarray-chip and therefore could not be inserted in the (pseudo-)linear model, these are LOC283859 and WNT3A. In total 287 probesets were found to correspond to the remaining 122 genes (Table 11).
TABLE-US-00011 Probesets associated with the Wnt target genes in the broad literature gene list (# = sequence number in accompanying sequence listing). Gene symbol Probeset # Gene symbol Probeset # Gene symbol Probeset # ADH6 207544_s_at 3 FAT1 201579_at 72 PFDN4 205360_at 161 214261_s_at FGF18 206987_x_at 73 205361_s_at ADRA2C 206128_at 4 211029_x_at 205362_s_at APCDD1 225016_at 6 211485_s_at PLAUR 210845_s_at 168 ASB4 208481_at 9 231382_at 211924_s_at 217228_s_at FGF20 220394_at 74 214866_at 217229_at FGF9 206404_at 76 POU5F1 208286_x_at 170 235619_at 239178_at PPARD 208044_s_at 172 237720_at FLAD1 205661_s_at 78 210636_at 237721_s_at 212541_at 37152_at ASCL2 207607_at 10 AK122582 235085_at 262 242218_at 229215_at FN1 1558199_at 79 PROX1 207401_at 176 ATOH1 221336_at 11 210495_x_at 228656_at AXIN2 222695_s_at 13 211719_x_at PTPN1 202716_at 180 222696_at 212464_s_at 217686_at 224176_s_at 214701_s_at 217689_at 224498_x_at 214702_at PTTG1 203554_x_at 181 BIRC5 202094_at 15 216442_x_at REG3A 205815_at 185 202095_s_at FOSL1 204420_at 80 234280_at 210334_x_at FOXN1 207683_at 86 REG4 1554436_a_at 186 BMP4 211518_s_at 16 FST 204948_s_at 87 223447_at BMP7 209590_at 17 207345_at RPS27 200741_s_at 190 209591_s_at 226847_at RUNX2 216994_s_at 191 211259_s_at FZD2 210220_at 89 221282_x_at 211260_at 238129_s_at 232231_at BTRC 1563620_at 20 FZD7 203705_s_at 90 236858_s_at 204901_at 203706_s_at 236859_at 216091_s_at GAST 208138_at 92 SALL4 229661_at 192 222374_at GMDS 204875_s_at 96 SLC1A1 206396_at 199 224471_s_at 214106_s_at 213664_at BZRAP1 205839_s_at 21 GREM2 220794_at 98 SLC7A5 201195_s_at 201 SBSPON 214725_at 259 235504_at SNAI1 219480_at 202 235209_at 240509_s_at SNAI2 213139_at 203 235210_s_at HES6 226446_at 100 SNAI3 1560228_at 204 CCL24 221463_at 26 228169_s_at SIK1 208078_s_at 261 CCND1 208711_s_at 27 HNF1A 210515_at 102 232470_at 208712_at 216930_at SOX17 219993_at 206 214019_at ID2 201565_s_at 104 230943_at CD44 1557905_s_at 30 201566_x_at SOX2 213721_at 207 204489_s_at 213931_at 213722_at 204490_s_at IL22 221165_s_at 109 228038_at 209835_x_at 222974_at SOX4 201416_at 208 210916_s_at IL8 202859_x_at 110 201417_at 212014_x_at 211506_s_at 201418_s_at 212063_at IRX3 229638_at 111 213668_s_at 217523_at IRX5 210239_at 112 SOX9 202935_s_at 209 229221_at ISL1 206104_at 113 202936_s_at CDH1 201130_s_at 31 JAG1 209097_s_at 114 SP5 235845_at 210 201131_s_at 209098_s_at SP8 237449_at 211 208834_x_at 209099_x_at 239743_at CDK6 207143_at 33 216268_s_at TCF3 209151_x_at 217 214160_at JUN 201464_x_at 116 209152_s_at 224847_at 201465_s_at 209153_s_at 224848_at 201466_s_at 210776_x_at 224851_at KIAA1199 1554685_a_at 119 213730_x_at 231198_at 212942_s_at 213811_x_at 235287_at KLF4 220266_s_at 120 215260_s_at 243000_at 221841_s_at 216645_at CDKN2A 207039_at 35 L1CAM 204584_at 125 TDGF1 206286_s_at 220 209644_x_at 204585_s_at TIAM1 206409_at 224 211156_at LBH 221011_s_at 126 213135_at CLDN1 218182_s_at 39 LEF1 210948_s_at 130 TNFRSF19 223827_at 227 222549_at 221557_s_at 224090_s_at COL18A1 209081_s_at 40 221558_s_at TNFSF11 210643_at 228 209082_s_at LGR5 210393_at 131 211153_s_at CTLA4 221331_x_at 44 213880_at TRIM29 202504_at 231 231794_at MET 203510_at 133 211001_at 234362_s_at 211599_x_at 211002_s_at 236341_at 213807_x_at TSPAN5 209890_at 233 CYP4X1 227702_at 49 213816_s_at 213968_at CYR61 201289_at 50 MMP2 1566678_at 135 225387_at 210764_s_at 201069_at 225388_at DEFA5 207529_at 51 MMP26 220541_at 136 TTC9 213172_at 234 DEFA6 207814_at 52 MMP7 204259_at 137 213174_at DKK1 204602_at 54 MMP9 203936_s_at 138 VCAN 204619_s_at 237 DKK4 206619_at 55 MRPS6 224919_at 139 204620_s_at DLL1 224215_s_at 56 MYC 202431_s_at 142 211571_s_at 227938_s_at MYCBP 203359_s_at 143 215646_s_at DPEP1 205983_at 57 203360_s_at 221731_x_at EDN1 218995_s_at 62 203361_s_at VEGFA 210512_s_at 238 222802_at MYCN 209756_s_at 144 210513_s_at EGFR 1565483_at 64 209757_s_at 211527_x_at 1565484_x_at 211377_x_at 212171_x_at 201983_s_at 234376_at VEGFB 203683_s_at 239 201984_s_at NANOG 220184_at 146 VEGFC 209946_at 240 210984_x_at NKD1 1553115_at 150 WNT10A 223709_s_at 242 211550_at 229481_at 229154_at 211551_at 232203_at ZBTB7C 217675_at 246 211607_x_at NOS2 210037_s_at 151 ZBTB7C 227782_at 246 EPHB2 209588_at 67 NOTUM 228649_at 152 PATZ1 209431_s_at 263 209589_s_at NRCAM 204105_s_at 153 211391_s_at 210651_s_at 216959_x_at 210581_x_at 211165_x_at NUAK2 220987_s_at 156 209494_s_at EPHB3 1438_at 68 PDGFB 204200_s_at 159 ZNRF3 226360_at 248 204600_at 216061_x_at ETS2 201328_at 71 217112_at 201329_s_at
[0125] Subsequently, the Bayesian network was constructed similar to
[0126] The trained Bayesian networks were then tested on various data sets to infer the probability P(Wnt On) that the Wnt pathway is “on”, i.e., active, which is taken equal to the inferred probability that the Wnt pathway transcription complex is “present”. Summarized results of the trained broad literature model and the evidence curated model are shown in
[0127] Evidently, it could be deduced that the broad literature model generally predicts more extreme probabilities for Wnt signaling being on or off. In addition, the alternative model predicts similar results for the colon cancer data sets (GSE20916, GSE4183, GSE15960), but more than expected samples with predicted active Wnt signaling in breast cancer (GSE12777), liver cancer (GSE9843) and medulloblastoma sample (GSE10327) data sets.
[0128] In conclusion, the broad literature target genes list results in approximately equally well predictions of Wnt activity in colon cancer on the one hand, but worse predictions (too many false positives) in other cancer types on the other hand. This might be a result of the alternative list of targets genes being too much biased towards colon cells specifically, thus too tissue specific; both de Sousa E Melo et al. and Hatzis et al. main interest was colorectal cancer although non-colon-specific Wnt target genes may be included. In addition, non-Wnt-specific target genes possibly included in these lists may be a source of the worsened predictions of Wnt activity in other cancer types. The alternative list is likely to contain more indirectly regulated target genes, which probably makes it more tissue specific. The original list is tuned towards containing direct target genes, which are most likely to represent genes that are Wnt sensitive in all tissues, thus reducing tissue specificity.
Example 5: Training and Using the Bayesian Network
[0129] Before the Bayesian network can be used to infer pathway activity in a test sample, the parameters describing the probabilistic relationships between the network elements have to be determined. Furthermore, in case of discrete states of the input measurements, thresholds have to be set that describe how to do the discretization.
[0130] Typically, Bayesian networks are trained using a representative set of training samples, of which preferably all states of all network nodes are known. However, it is impractical to obtain training samples from many different kinds of cancers, of which it is known what the activation status is of the pathway to be modeled. As a result, available training sets consist of a limited number of samples, typically from one type of cancer only. To allow the Bayesian network to generalize well to other types of samples, one therefore has to pay special attention to the way the parameters are determined, which is preferably done as follows in the approach described herein.
[0131] For the TF node, the (unconditional) probability of being in state “absent” and “present” is given by the expected occurrence on a large set of samples. Alternatively, one can set them to 0.5, as is done in
[0132] For the target gene nodes, the conditional probabilities are set as in
[0133] For the Bayesian network model as given in
[0134] After the Bayesian network has been trained, it can be applied on a test sample as follows, considering the Bayesian network of
[0135] Next, this hard or soft evidence is supplied to a suitable inference engine for Bayesian networks, for instance based on a junction tree algorithm (see (Neapolitan, 2004)). Such an engine can then infer the updated probability of the TF element being “absent” or “present”, given the provided evidence. The inferred probability of the TF element being “present” is then interpreted as the estimated probability that the respective pathway is active.
[0136] Preferably, the training of the Bayesian network models of the Wnt, ER, Hedgehog and AR pathways is done using public data available on the Gene Expression Omnibus (accessible at ncbi.n1m.nih.gov/geo/. cf above).
[0137] The Wnt Bayesian network was exemplary trained using 32 normal colon samples considered to have an inactive Wnt pathway and 32 confirmed adenoma samples known to have an active Wnt pathway (GSE8671 data set).
[0138] The Bayesian network model of the ER pathway was exemplary trained using 4 estrogen-deprived MCF7 samples, known to have an inactive ER pathway, and 4 estrogen-stimulated MCF7 samples, regarded to have an active ER pathway, from the GSE8597 data set also accessible at the Gene expression Omnibus.
[0139] The Bayesian network model of the Hedgehog pathway was exemplary trained using 15 basal cell carcinoma samples confirmed to have an active Hedgehog pathway and 4 normal skin cells samples representing samples with an inactive Hedgehog pathway available in the GSE7553 data set.
[0140] The Bayesian network model of the AR pathway was exemplary trained using 3 samples with positive AR activity, LNCaP cell lines stimulated with Dihydrotestosterone (DHT), a potent AR pathway activator, and 3 non-stimulated LNCaP cell lines representing the inactive AR pathway case.
[0141] With reference to
[0142]
[0143] In
[0144] Further details and examples for using trained Bayesian networks (e.g. of Wnt, ER, AR and Hedgehog pathway) to predict the respective pathway activities are explained in Example 6 below.
[0145] The above mentioned training process can be employed to other Bayesian networks of clinical applications. Here it is shown and proven to work for the Bayesian network models constructed using herein disclosed method representing cellular signaling pathways, more specifically the Wnt, ER, AR and Hedgehog pathways.
Example 6: Diagnosis of (abnormal) Pathway Activity
[0146] The following will provide an exemplary illustration of how to use e.g. Bayesian network models to diagnose the activity of a cellular signaling pathway.
[0147] The Bayesian networks of the Wnt, ER, Hedgehog and AR pathway, constructed using a node for the transcription factor presence, a layer of nodes representing the target genes’ mRNA and a layer of nodes representing the probesets’ intensities corresponding to the target genes (Table 1, Table 2, Table 3 and Table 4), analogous to
[0148] With reference to
[0149]
[0150] The Bayesian network model used in the experiments reported herein was trained using the colon samples data set GSE8671. However, the Wnt pathway is present (albeit possibly inactive) in other cell types. It was therefore considered possible that the Bayesian network might be applicable to infer abnormally high Wnt pathway activity correlative with other types of cancers. The rationale for this is that, although the Bayesian network model was trained using colon samples, it is based on first principles of the operation of the Wnt pathway present (albeit possibly inactive) in other cell types.
[0151]
[0152] The test results using the Wnt Bayesian network model in a data set containing liver cancer samples (GSE9843) is shown in
[0153] About one in five of the samples labeled “Proliferation” have P(Wnt On)>0.5. Proliferation suggests a state of rapid cellular multiplication. Such a state may be associated with abnormally high Wnt pathway activity, but may also be associated with numerous other possible causes of cell proliferation. Accordingly, about one in five of these samples having abnormally high Wnt pathway activity is not an unreasonable result.
[0154] About one half of the samples of the “CTNNB1” group are inferred by the Bayesian network to have abnormally high Wnt pathway activity. The CTNNB1 gene encodes the beta-catenin protein, which is a regulatory protein of the Wnt pathway, and activating mutations in this gene cause abnormal Wnt activation. Thus, a correlation between the “CTNNB1” group and high Wnt pathway activity is conform expectation.
[0155]
[0156]
[0157] The test results of the predictions of the ER Bayesian network trained on breast cancer cell lines for a set of cancer samples (GSE12276) are shown in
[0158] The ER Bayesian network model constructed and trained as described herein is used to predict the ER pathway activity in a large panel of cell lines of various cancers, the results are shown in
[0159] The Bayesian network model constructed and trained for the Hedgehog pathway as described herein is used to predict the activity of the Hedgehog pathway for cell lines of various cancer types in the GSE34211-data set. The Hedgehog activity predictions are shown in
[0160]
[0161] The predicted Hedgehog activity in the GSE12276 breast cancer samples, earlier used to predict the ER activity using the ER Bayesian network model, using the Hedgehog Bayesian network model is shown in
[0162] In summary, the test results for various cancerous tissue samples and cells presented in
[0163] Although the results of
[0164] The test results of the AR Bayesian network model constructed and trained as described herein was exemplary used to predict the AR activity in LNCaP prostate cancer cell lines treated with different treatment regimes (GSE7708) (see
[0165] The trained Bayesian network of the AR pathway as described herein was also used to predict the probability the AR pathway is active in prostate cancer samples from the GSE17951 data set (results are shown in
[0166] The AR Bayesian network model was also applied to a cross-tissue test, viz. the breast cancer samples included in the GSE12276 data set. Results for this test are shown in
[0167] The above mentioned AR Bayesian network model was also used to predict the AR pathway’s activity in two sets of cell lines samples of various cancer types (GSE36133 and GSE34211) as depicted in
TABLE-US-00012 Known and predicted AR activity in prostate cancer cell lines in GSE36133 and GSE34211 data sets Data set Sample identifier Prostate cell line Known to be active? P(AR on) 36133 GSM886837 22Rv1 YES 0.698127 GSM886988 DU 145 NO 0.001279 GSM887271 LNCaP clone FGC YES 1 GSM887302 MDA PCa 2b YES 1 GSM887440 NCl—H660 NO 1.25E-05 GSM837506 PC-3 NO 0.009829 GSM887731 VCaP YES 1 34211 GSM843494 DU145T NO 0.005278 GSM844559 HPET11 NO 0.005602 GSM844560 HPET13 replicate 1 NO 0.003382 GSM844561 HPET13 replicate 2 NO 0.000501 GSM844562 HPETS NO 0.007673 GSM844579 LNCAP YES 1 GSM844674 PC3 PFIZER NO 0.004066 GSM844675 PC3 Good_NCl50_WYETH NO 0.006163
Example 7: Prognosis Based on Pathway Activity
[0168] Early developmental pathways, like Wnt and Hedgehog, are thought to play a role in metastasis caused by cancer cells which have reverted to a more stem cell like phenotype, called cancer stem cells. Indeed, sufficient evidence is available for the early developmental pathways, such as Wnt pathway, to play a role in cancer metastasis, enabling metastatic cancer cells to start dividing in the seeding location in another organ or tissue. Metastasis is associated with a bad prognosis, thus activity of early developmental pathways, such as the Wnt and Hedgehog pathway, in cancer cells is expected to be predictive for a bad prognosis. This is supported by the fact that breast cancer patients, from the GSE 12276 data set, that were identified having an active ER pathway but not having an active Wnt or Hedgehog pathway using the Bayesian network models described herein had a better prognosis than patients identified having either an active Hedgehog or Wnt pathway or both, as illustrated by the Kaplan-Meier plot in
Example 8: Therapy Planning, Prediction of Drug Efficacy, Prediction of Adverse Effects and Monitoring of Drug Efficacy
[0169] The following example illustrates how to use the probabilistic models, in particular Bayesian network models, for therapy planning, prediction of drug efficacy, monitoring of drug efficacy and related activities.
[0170] The Bayesian network model of the ER pathway, constructed using a node for the transcription factor presence, a layer of nodes representing the target genes' mRNA levels (Table 2) and a layer of nodes representing the probesets' intensities corresponding to the target genes (Table 2), analogous to
[0171] Tamoxifen is a drug currently used for the treatment of ER+ (estrogen receptor positive) breast cancer. It acts as a partial antagonist of the estrogen receptor inhibiting the uncontrolled cell proliferation which is thought to be induced by ER signaling. Unfortunately, not every breast cancer responds to treatment with Tamoxifen, despite the demonstration of the presence of ER protein in cancer cells by routine histopathology analysis of cancer tissue slides. Many studies have been conducted to investigate this so-called Tamoxifen resistance. The publicly available GSE21618 data set is the result of one of such study and contains microarray data of Tamoxifen resistant and wildtype MCF7 cell lines under different treatment regimes. The ER Bayesian network model constructed and trained as described herein is used to analyze the Tamoxifen resistant and MCF7 cell lines under different treatment regimes, the results are depicted in
[0172] The control Tamoxifen resistant cell line, indicated by TamR.Ctrl, is predicted to have an inactive ER pathway for every time point after Tamoxifen addition (1, 2, 3, 6, 12, 24, and 48 h). It is not surprising that treatment of the Tamoxifen resistant cell line, that is insensitive to Tamoxifen treatment, with Tamoxifen, indicated by TamR.Tam, is ineffective, which is also illustrated by the predicted inactivity of the ER pathway for this group over the same time points. According to analysis of the Tamoxifen resistant cell line (TamR.Ctrl) the driving force of the uncontrolled cell proliferation is not due to active ER signaling; therefore treating it with an ER antagonist will not inhibit cell proliferation. This illustrates that treatment with Tamoxifen is not recommended in case of a negative predicted ER pathway activity.
[0173] On the other hand, the wild type MCF7 cell line, known to be Tamoxifen sensitive, treated with 17 beta-estradiol (wtl.E2) slowly reacts to the hormone treatment which is visible in the increasing ER positive activity predictions. Treating such a cell line with aromatase inhibitors that are known to inhibit estrogen production will inhibit the ER pathway which is illustrated by the decreasing ER pathway prediction in time. Supporting this are the ER pathway predictions made based on the microarray data from MCF7 samples treated with estrogen for increasing time in the GSE11324 data set, results shown in
[0174] The above mentioned illustrates the ability of the probabilistic models, in particular the Bayesian network models, to be used for therapy planning, drug efficacy prediction, and monitoring of drug efficacy. However it is to be understood, the same methodology would also apply to the prediction and monitoring of adverse effects.
Example 9: Drug Development
[0175] In a manner similar to therapy response monitoring, a pathway model can be used in drug development to assess the effectiveness of various putative compounds. For instance, when screening many compounds for a possible effect on a certain pathway in a cancer cell line, the respective pathway model can be used to determine whether the activity of the pathway goes up or down after application of the compound or not. Often, this check is done using only one or a few of putative markers of the pathway’s activity, which increases the chance of ineffective monitoring of the treatment effect. Furthermore, in follow-up studies on animal or patient subjects, the pathway models can be used similarly to assess the effectiveness of candidate drugs, and to determine an optimal dose to maximally impact pathway activity.
[0176] An example of ineffective monitoring of new drug compounds is illustrated by the predicted AR pathway activity in the GSE7708 samples as shown in
Example 10: Assay Development
[0177] Instead of applying mentioned Bayesian networks on mRNA input data coming from microarrays or RNA sequencing, it may be beneficial in clinical applications to develop dedicated assays to perform the sample measurements, for instance on an integrated platform using qPCR to determine mRNA levels of target genes. The RNA/DNA sequences of the disclosed target genes can then be used to determine which primers and probes to select on such a platform.
[0178] Validation of such a dedicated assay can be done by using the microarray-based Bayesian networks as a reference model, and verifying whether the developed assay gives similar results on a set of validation samples. Next to a dedicated assay, this can also be done to build and calibrate similar Bayesian network models using mRNA-sequencing data as input measurements.
Example 11: Pathway Research and Cancer Pathophysiology Research
[0179] The following will illustrate how Bayesian network models can be employed in (clinical) pathway research, that is research interested to find out which pathways are involved in certain diseases, which can be followed up for more detailed research, e.g. to link mutations in signaling proteins to changes in pathway activation (measured with the model). This is relevant to investigate the initiation, growth and evolution and metastasis of specific cancers (the pathophysiology).
[0180] The Bayesian network models of the Wnt, ER, Hedgehog and AR pathway, constructed using a node for the transcription factor presence, a layer of nodes representing the target genes' mRNA levels (Table 1, Table 2, Table 3 and Table 4) and a layer of nodes representing the probesets' intensities corresponding to the target genes (Table 1, Table 2, Table 3 and Table 4), analogous to
[0181] Suppose the researcher is interested in looking into the cellular signaling pathway or pathways and the specific deregulation(s) that drive(s) the uncontrolled cell proliferation. The researcher can analyze the microarray data using the above mentioned probabilistic models, in particular the Bayesian network models, to find which pathways are presumably the cause of uncontrolled cell proliferation. Shown in
[0182] With reference to
[0183] Another example is given in
[0184] In summary, the illustrations described herein indicate the ability of trained Bayesian network models (as described above) to support the process of finding the cause of uncontrolled cell proliferation in a more directed method. By employing the Bayesian networks to screen the samples for pathway activities, the predicted pathway activities can pinpoint the possible pathways for the cell proliferation, which can be followed up for more detailed research, e.g. to link mutations in signaling proteins or other known deregulations to changes in activation (as measured with the model).
[0185] As described herein, the process to develop and train a Bayesian network of cellular signaling pathways can be used to construct a Bayesian network model for other pathways that could also be employed in connection with the present invention.
Example 12: Enrollment of Subject in a Clinical Trial Based on Predicted Activity
[0186] If a candidate drug is developed to, for instance, block the activity of a certain pathway that drives tumor growth, and this drug is going into clinical trial, then a proper selection of the subjects to enroll in such a trial is essential to prove potential effectiveness of the drug. In such a case, patients that do not have the respective pathway activated in their tumors should be excluded from the trial, as it is obvious that the drug cannot be effective if the pathway is not activated in the first place. Hence, a pathway model that can predict pathway activity can be used as a selection tool, to only select those patients that are predicted to have the respective pathway activated.
Example 13: Selection of Subsequent Test(s) to Be Performed
[0187] If a tumor is analyzed using different pathway models, and the models predict deregulation of a certain pathway, then this may guide the selection of subsequent tests to be performed. For instance, one may run a proximity ligation assay (PLA) to confirm the presence of the respective transcription complex (Soderberg O, 2006). Such a PLA can be designed to give a positive result if two key proteins in a TF complex have indeed bound together, for instance beta-catenin and TCF4 in the TF complex of the Wnt pathway.
[0188] Another example is that the pathway predicted to be deregulated is analyzed in more detail with respect to the signaling cascade. For instance, one may analyze key proteins in this pathway to determine whether there are mutations in the DNA regions encoding for their respective genes, or one may test for the abundance of these proteins to see whether they are higher or lower than normal. Such tests may indicate what the root cause is behind the deregulation of the pathway, and give insights on which available drugs could be used to reduce activity of the pathway.
[0189] These tests are selected to confirm the activity of the pathway as identified using the Bayesian model. However selection of companion diagnostic tests is also possible. After identification of the pathway using the model, for targeted therapy choice only those companion diagnostics tests need to be performed (the selection), which are applicable to the identified pathway.
Example 14: Selection of Companion Diagnostics Tests
[0190] Similar to the previous example, if a tumor is analyzed and the pathway models predict deregulation of a certain pathway, and optionally a number of additional tests have been performed to investigate the cause of deregulation, then an oncologist may select a number of candidate drugs to treat the patient. However, treatment with such a drug may require a companion diagnostic test to be executed first, for instance to comply with clinical guidelines or to ensure reimbursement of the treatment costs, or because a regulatory agency (FDA) requires that the companion diagnostic test be performed prior to giving the drug. An example of such a companion diagnostic test is the Her2 test for treatment of breast cancer patients with the drug Herceptin (Trastuzumab). Hence, the outcome of the pathway models can be used to select the candidate drugs and the respective companion diagnostic tests to be performed.
Example 15: CDS Application
[0191] With reference to
[0192] The CDS system (10) receives as input information pertaining to a medical subject (e.g., a hospital patient, or an outpatient being treated by an oncologist, physician, or other medical personnel, or a person undergoing cancer screening or some other medical diagnosis who is known or suspected to have a certain type of cancer such as colon cancer, breast cancer, or liver cancer, or so forth). The CDS system (10) applies various data analysis algorithms to this input information in order to generate clinical decision support recommendations that are presented to medical personnel via the display device (14) (or via a voice synthesizer or other device providing human-perceptible output). In some embodiments, these algorithms may include applying a clinical guideline to the patient. A clinical guideline is a stored set of standard or “canonical” treatment recommendations, typically constructed based on recommendations of a panel of medical experts and optionally formatted in the form of a clinical “flowchart” to facilitate navigating through the clinical guideline. In various embodiments the data processing algorithms of the CDS (10) may additionally or alternatively include various diagnostic or clinical test algorithms that are performed on input information to extract clinical decision recommendations, such as machine learning methods disclosed herein.
[0193] In the illustrative CDS systems disclosed herein (e.g., CDS system (10)), the CDS data analysis algorithms include one or more diagnostic or clinical test algorithms that are performed on input genomic and/or proteomic information acquired by one or more medical laboratories (18). These laboratories may be variously located “on-site”, that is, at the hospital or other location where the medical subject is undergoing medical examination and/or treatment, or “off-site”, e.g. a specialized and centralized laboratory that receives (via mail or another delivery service) a sample of tissue of the medical subject that has been extracted from the medical subject (e.g., a sample obtained from a breast lesion, or from a colon of a medical subject known or suspected of having colon cancer, or from a liver of a medical subject known or suspected of having liver cancer, or so forth, via a biopsy procedure or other sample extraction procedure). The tissue of which a sample is extracted may also be metastatic tissue, e.g. (suspected) malignant tissue originating from the colon, breast, liver, or other organ that has spread outside of the colon, breast, liver, or other organ. In some cases, the tissue sample may be circulating tumor cells, that is, tumor cells that have entered the bloodstream and may be extracted as the extracted tissue sample using suitable isolation techniques. The extracted sample is processed by the laboratory to generate genomic or proteomic information. For example, the extracted sample may be processed using a microarray (also variously referred to in the art as a gene chip, DNA chip, biochip, or so forth) or by quantitative polymerase chain reaction (qPCR) processing to measure probative genomic or proteomic information such as expression levels of genes of interest, for example in the form of a level of messenger ribonucleic acid (mRNA) that is transcribed from the gene, or a level of a protein that is translated from the mRNA transcribed from the gene. As another example, the extracted sample may be processed by a gene sequencing laboratory to generate sequences for deoxyribonucleic acid (DNA), or to generate an RNA sequence, copy number variation, or so forth. Other contemplated measurement approaches include immunohistochemistry (IHC), cytology, fluorescence in situ hybridization (FISH), proximity ligation assay or so forth, performed on a pathology slide. Other information that can be generated by microarray processing, mass spectrometry, gene sequencing, or other laboratory techniques includes methylation information. Various combinations of such genomic and/or proteomic measurements may also be performed.
[0194] In some embodiments, the medical laboratories (18) perform a number of standardized data acquisitions on the extracted sample of the tissue of the medical subject, so as to generate a large quantity of genomic and/or proteomic data. For example, the standardized data acquisition techniques may generate an (optionally aligned) DNA sequence for one or more chromosomes or chromosome portions, or for the entire genome of the tissue. Applying a standard microarray can generate thousands or tens of thousands of data items such as expression levels for a large number of genes, various methylation data, and so forth. This plethora of genomic and/or proteomic data, or selected portions thereof, are input to the CDS system (10) to be processed so as to develop clinically useful information for formulating clinical decision support recommendations.
[0195] The disclosed CDS systems and related methods relate to processing of genomic and/or proteomic data to assess activity of various cellular signaling pathways. However, it is to be understood that the disclosed CDS systems (e.g., CDS system (10)) may optionally further include diverse additional capabilities, such as generating clinical decision support recommendations in accordance with stored clinical guidelines based on various patient data such as vital sign monitoring data, patient history data, patient demographic data (e.g., gender, age, or so forth), patient medical imaging data, or so forth. Alternatively, in some embodiments the capabilities of the CDS system (10) may be limited to only performing genomic and/or proteomic data analyses to assess cellular signaling pathways as disclosed herein.
[0196] With continuing reference to exemplary
[0197] Measurement of mRNA expression levels of genes that encode for regulatory proteins of the cellular signaling pathway, such as an intermediate protein that is part of a protein cascade forming the cellular signaling pathway, is an indirect measure of the regulatory protein expression level and may or may not correlate strongly with the actual regulatory protein expression level (much less with the overall activity of the cellular signaling pathway). The cellular signaling pathway directly regulates the transcription of the target genes — hence, the expression levels of mRNA transcribed from the target genes is a direct result of this regulatory activity. Hence, the CDS system (10) infers activity of the cellular signaling pathway (e.g., the Wnt, ER, AR and Hedgehog pathways) based at least on expression levels of target genes (mRNA or protein level as a surrogate measurement) of the cellular signaling pathway. This ensures that the CDS system (10) infers the activity of the pathway based on direct information provided by the measured expression levels of the target genes.
[0198] However, although, as disclosed herein, being effective for assessing activity of the overall pathways, the measured expression levels (20) of target genes of the pathways are not especially informative as to why the pathways are operating abnormally (if indeed that is the case). Said another way, the measured expression levels (20) of target genes of a pathway can indicate that the pathway is operating abnormally, but do not indicate what portion of the pathway is malfunctioning (e.g., lacks sufficient regulation) in order to cause the overall pathway to operate abnormally.
[0199] Accordingly, if the CDS system (10) detects abnormal activity of a particular pathway, the CDS system (10) then optionally makes use of other information provided by the medical laboratories (18) for the extracted sample, such as aligned genetic sequences (22) and/or measured expression level(s) for one or more regulatory genes of the pathway (24), or select the diagnostic test to be performed next in order to assess what portion of the pathway is malfunctioning. To maximize efficiency, in some embodiments this optional assessment of why the pathway is malfunctioning is performed only if the analysis of the measured expression levels (20) of target genes of the pathway indicates that the pathway is operating abnormally. In other embodiments, this assessment is integrated into the probabilistic analysis of the cellular signaling pathway described herein.
[0200] In embodiments in which the CDS system (10) assesses what portion of the pathway is malfunctioning, and is successful in doing so, the additional information enables the CDS system (10) to recommend prescribing a drug targeting for the specific malfunction (recommendation (26) shown in
Example 16: A Kit and Analysis Tools to Measure Pathway Activity
[0201] The set of target genes which are found to best indicate specific pathway activity, based on microarray/RNA sequencing based investigation using the Bayesian model, can be translated into a multiplex quantitative PCR assay to be performed on a tissue or cell sample. To develop such an FDA-approved test for pathway activity, development of a standardized test kit is required, which needs to be clinically validated in clinical trials to obtain regulatory approval.
[0202] In general, it is to be understood that while examples pertaining to the Wnt, the ER, the AR and/or the Hedgehog pathway(s) are provided as illustrative examples, the approaches for cellular signaling pathway analysis disclosed herein are readily applied to other cellular signaling pathways besides these pathways, such as to intracellular signaling pathways with receptors in the cell membrane (cf above) and intracellular signaling pathways with receptors inside the cell (cf. above). In addition: This application describes several preferred embodiments. Modifications and alterations may occur to others upon reading and understanding the preceding detailed description. It is intended that the application be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof
Literature
[0203] de Sousa E Melo F, C. S. (2011). Methylation of cancer-stem-cell-associated Wnt target genes predicts poor prognosis in colorectal cancer patients. Cell Stem Cell., 476-485 [0204] Hatzis P, v. d. (2008). Genome-wide pattern of TCF7L2/TCF4 chromatin occupancy in colorectal cancer cells. Mol Cell Biol., 2732-2744 [0205] Neapolitan, R. (2004). Learning Bayesian networks. Pearson Prentice Hall [0206] Nusse, R. (2012, May 1). Wnt target genes. Retrieved from The Wnt homepage: stanford.edu/group/nusselab/cgi-bin-wnt-targetgenes [0207] Soderberg O, G. M. (2006). Direct observation of individual endogenous protein complexes in situ by proximity ligation. Nat Methods., 995-1000 [0208] van de Wetering M, S. E.-P.-F. (2002). The beta-catenin/TCF-4 complex imposes a crypt progenitor phenotype on colorectal cancer cells. Cell, 241-250.