Set of tumor-markers
10100364 ยท 2018-10-16
Assignee
Inventors
- Klemens Vierlinger (Vienna, AT)
- Martin Lauss (Altenfelden, AT)
- Albert Kriegner (Vienna, AT)
- Christa NOEHAMMER (Vienna, AT)
Cpc classification
G16B25/10
PHYSICS
G16B40/00
PHYSICS
G01N33/57484
PHYSICS
G01N2800/60
PHYSICS
G16B25/00
PHYSICS
G16B25/20
PHYSICS
International classification
Abstract
The present invention provides a set of moieties specific for tumor markers, in particular of follicular thyroid carcinoma (FTC) and papillary thyroid carcinoma (PTC) as well as a method for identifying markers of any genetic disease.
Claims
1. A set of moieties comprising moieties specific for at least 3 tumor markers, wherein the three tumor markers are selected from the group consisting of SCEL and CD36 and either MDK or DPP6, wherein the moieties are immobilized on a solid support, wherein the moieties are oligonucleotides specific for nucleic acids of tumor marker genes, and wherein the set of moieties comprises 1000 moieties or less.
2. The set of claim 1, wherein at least one of the three tumor markers is MDK.
3. The set of claim 1, wherein at least one of the three tumor markers is DPP6.
4. The set of claim 1, further comprising at least a fourth moiety specific for a tumor marker selected from the markers of tables 1 to 6.
5. The set for claim 1, further comprising at least 10 moieties specific for the tumor markers selected from the markers of tables 1 to 6.
6. The set of claim 1, wherein the solid support is a microarray.
7. The set of claim 1, wherein at least 10% of all moieties immobilized on the solid support are oligonucleotides specific for nucleic acids of one or more tumor marker genes selected from BBS9, C13orf1, CBFA2T3, CDT1, CRK, CTPS, DAPK2, EIF5, EREG, GK, GPATCH8, HDGF, IRF2BP1, KRT83, MYOD1, NME6, POLE3, PPP1R13B, PRPH2, RASSF7, ROCK2, RTN1, S100B, SLIT2, SNRPB2, SPAG7, STAU1, SUPT5H, TBX10, TLK1, TM4SF4, TXN, UFD1L, ADH1B, AGR2, AGTR1, ALDH1A1, ALDH1A3, AMIGO2, ATP2C2, BID, C7orf24, CA4, CCL21, CD55, CDH16, CDH3, CFI, CHI3L1, CHST2, CITED2, CLCNKB, COMP, CTSH, DIO2, DIRAS3, DUSP4, DUSP5, EDN3, ENTPD1, FHL1, GDF15, GPM6A, HBA1, IRS1, KCNJ2, KCNN4, KLK10, LAMB3, LCN2, LMOD1, MATN2, MPPED2, MVP, NELL2, NFE2L3, NPC2, NRCAM, NRIP1, PAPSS2, PDLIM4, PDZK1IP1, PIP3-E, PLAU, PRSS2, PRSS23, RAP1GAP, S100A11, SFTPB, SLPI, SOD3, SPINT1, SYNE1, TACSTD2, UPP1, WASF3, APOE, ATIC, BASP1, C9orf61, CCL13, CDH6, CFB, CFD, CLDN10, COL11A1, COL13A1, CORO2B, CRLF1, CXorf6, DDB2, DPP6, ECM1, EFEMP1, ESRRG, ETHE1, FAS, FMOD, GABBR2, GALE, GATM, GDF10, GHR, GPC3, ICAM1, ID3, IER2, IGFBP6, IQGAP2, ITGA2, ITGA3, ITM2A, KIAA0746, LRIG1, LRP2, LY6E, MAPK13, MDK, MLLT11, MMRN1, MTMR11, MXRA8, NAB2, NMU, OCA2, PDE5A, PLAG1, PLP2, PLXNC1, PRKCQ, PRUNE, RAB27A, RYR2, SELENBP1, SORBS2, STMN2, TBC1D4, TNC, TPD52L1, TSC22D1, TTC30A, VLDLR, WFS1, AATF, ACOX3, AHDC1, ALAS2, ALKBH1, ANGPTL2, AP2A2, APOBEC3G, APRIN, ARNT, AZGP1, BAT2D1, BATF, BPHL, C14orf1, C2orf3, CBFB, CBR3, CBX5, CCNE2, CD46, CHPF, CHST3, CLCN2, CLCN4, CLIC5, CNOT2, COPS6, CPZ, CSK, CTDP1, DDEF2, DKFZP586H2123, DLG2, DPAGT1, DSCR1, DUSP8, EI24, ENOSF1, ERCC1, ERCC3, ERH, F13A1, FAM20B, FBP1, FCGR2A, FGF13, FGFR1OP, FLNC, FMO5, FRY, GADD45G, GCH1, GFRA1, GLB1, GOLGA8A, HCLS1, HRC, ICMT, IFNA5, IGF2BP3, IL12A, ITIH2, ITPKC, JMJD2A, KCNJ15, KCTD12, KIAA0652, KIAA0913, KLKB1, KRT37, LPHN3, LSR, MANBA, MAP7, MAPKAPK5, MET, MMP14, MX1, MYL9, MYO9B, NCOR1, NDRG4, NDUFA5, NEUROD2, NFKB2, NPY1R, NUP50, PDGFRA, PDHX, PDLIM1, PEX1, PEX13, PIB5PA, PICK1, PLEC1, POLE2, PPIF, PPP2R5A, PSCD2, PSMA5, PTPN12, PTPN3, PTPRCAP, QKI, RASAL2, RBM10, RBM38, RER1, RGL2, RHOG, RNASE1, RTN4, SCC-112, SDS, SF3B2, SH3PXD2A, SIX6, SLC10A1, SLC6A8, SMG6, SOX11, SPI1, SRGAP3, STX12, SYK, TAF4, TCN2, TGOLN2, TIA1, TOMM40, TXN2, UGCG, USP11, VDR, VEGFC, YWHAQ, ZNF140, WAS, LRP4, TFF3, ST3GAL6, STK39, DPP4, FABP4, GPR4, STAM2, QPCT, CDK7, SFTPD, CYB5R1, VWF, and HOXA4.
8. The set of claim 1, wherein the set comprises 700 moieties or less.
9. The set of claim 8, wherein the set comprises 500 moieties or less.
10. The set of claim 8, wherein the set comprises 300 moieties or less.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
DETAILED DESCRIPTION
EXAMPLES
Example 1
Datasets
(6) Datasets were downloaded either from websites or from public repositories (GEO, ArrayExpress). Table 7 shows a summary of the datasets used in this study (He et al, PNAS USA 102(52): 19075-80 (2005); Huang et al. PNAS USA 98(26): 15044-49 (2001); Jarzab Cancer Res 65(4): 1587-97 (2005); Lacroix Am J Pathol 167(1): 223-231 (2005); J Clin Endocrinol Metab 90(5): 2512-21 (2005)). Here, three different categories of non-cancer tissues are used: contralateral (c.lat) for healthy surrounding tissue paired with a tumor sample, other disease (o.d.) for thyroid tissue operated for other disease and SN (Struma nodosa) for benign thyroid nodules. For all subsequent analysis these were combined as healthy.
(7) TABLE-US-00007 TABLE 7 Microarray Data used for Meta Analysis Published FTA FTC PTC SN o.d. c.lat Platform He PNAS 2005 0 0 9 0 0 9 Affy U133plus Huang PNAS 2001 0 0 8 8 0 0 Affy U133A Jarzab Cancer Res 2005 0 0 23 0 11 17 Affy U133A Lacroix Am J Path 2005 4 8 0 11 0 0 Agilent Custom Reyes not published? 0 0 7 0 0 7 Affy U133A Weber J Clin Endocr 12 12 0 0 0 0 Affy U95A Metabol 2005
Example 2
Finding the Gene Overlap
(8) The first step in any MetaAnalysis of microarray data is to find the set of genes which is shared by all microarray platforms used in the analysis. Traditionally, overlap is assessed by finding common UniGene identifiers. This, however, disregards all possible splice variations in the genes under investigation. For example, if a gene had 2 splice variants, one of which was differentially expressed in the experiment and the other not and if one platform would contain an oligo specific only to the differentially expressed variant and the other platform only an oligo to the other variant, then a matching based on UniGene would merge probes which measure different things.
(9) To overcome this problem, the approach adopted here merges only probes which annotate to the same set of RefSeq identifiers. To this end all matching RefSeqs were downloaded for each probe(set), either via the Bioconductor annotation packages (hgu133a, hgu95a and hgu133plus2; available at the Bioconductor web page or by a BLAST search of the sequences at NCBI Database. Then, for each probe the RefSeqs were sorted and concatenated. This is the most accurate representation of the entity measured on the array. The median value was used, if one set of RefSeqs was represented by multiple probes on the array. 5707 different sets of RefSeqs were present on all arrays.
Example 3
Preprocessing and Data Integration
(10) First each dataset was background-corrected and normalised separately, as recommended for each platform (lowess for dual color and quantile normalisation for single color experiments) (Bolstad et al. Bioinformatics 19(2): 185-193(2003); Smyth et al. Methods 31(4): 265-273 (2003)), then they were merged and quantile normalised collectively. Despite all preprocessing, it has been shown that data generated on different microarray platforms or on different generations of the same platform may not be comparable due to platform specific biases (Eszlinger et al. Clin Endocrinol Metab 91(5): 1934-1942 (2006)). This is also evident from principal component analysis of the merged data as shown in
(11) Data Integration by DWD is illustrated in
Example 4
Classification
(12) For probe selection, classification and cross-validation a nearest shrunken centroid method was chosen (Tibshirani et al. PNAS USA 99(10):105-114 (2004)) (implemented in the Bioconductor package pamr). It was chosen for several reasons: it allows multiclass classification and it runs features selection, classification and cross-validation in one go. Briefly, it calculates several different possible classifiers using different shrinkage thresholds (i.e. different number of genes) and finds the best threshold from crossvalidation. The classifier was picked with the smallest number of genes (largest threshold), if more than one threshold yielded the same crossvalidation results.
Example 5
Papillary Thyroid Carcinoma (PTC)
(13) First, and as a quality measure for each study, each dataset was taken separately (before DWD-integration) and a pamr classification and leave-one-out cross-validation (loocv) was performed. The results of the cross-validation are near perfect with single samples classifying wrongly. However, with the exception of the classifier from the He dataset, none of these classifiers can be applied to any of the other dataset. Classification results are rarely ever higher than expected by chance. If, however, one uses the DWD-integrated data (below), the classifiers already fit much better (see table 8).
(14) TABLE-US-00008 TABLE 8 Classification results when applying classifiers from one study on another study. Before data integration (left) and after DWD integration (right) test test train he huang jarzab reyes train he huang jarzab reyes he 1.00 1.00 0.98 1.00 he 1.00 1.00 0.96 1.00 huang 0.50 1.00 0.55 0.50 huang 0.50 1.00 0.90 0.71 jarzab 0.50 0.81 1.00 0.57 jarzab 0.89 1.00 1.00 1.00 reyes 0.78 0.50 0.92 1.00 reyes 0.89 0.88 0.90 1.00
(15) Then a pamr-classifier was built for the complete DWD-integrated dataset and validated in a leave-one-out cross validation. This identified a one (!) gene classifier, which classifies 99% of samples correctly in loocv. The discriminative gene is SERPINA1.
(16) However, similar results are obtained doing the same analysis on the non-integrated data. Taking into account the results of PCA (
(17) TABLE-US-00009 TABLE 9 Genes in classifier2 (after leaving out SERPINA1) Symbol Title Cluster Accession WAS Wiskott-Aldrich syndrome Hs.2157 BC012738 (eczema-thrombocytopenia) LRP4 Low density lipoprotein receptor- Hs.4930 BM802977 related protein 4 TFF3 Trefoil factor 3 (intestinal) Hs.82961 BC017859 ST3GAL6 ST3 beta-galactoside alpha-2,3- Hs.148716 BC023312 sialyltransferase 6 STK39 Serine threonine kinase 39 Hs.276271 BM455533 (STE20/SPS1 homolog, yeast) DPP4 Dipeptidyl-peptidase 4 (CD26, Hs.368912 BC065265 adenosine deaminase complexing protein 2) CHI3L1 Chitinase 3-like 1 (cartilage Hs.382202 BC038354 glycoprotein-39) FABP4 Fatty acid binding protein 4, Hs.391561 BC003672 adipocyte LAMB3 Laminin, beta 3 Hs.497636 BC075838
Example 6
Follicular Carcinoma
(18) A similar analysis was also performed for the FTC data, but cross-validation was hampered, due to the very limited availability of data. Again, a classifier was built for each dataset (Lacroix and Weber). They achieved a loocv-accuracy of 96% (Weber) and 100% (Lacroix) on 25 and 3997 genes. The number of genes in the Lacroix-data already suggests overfitting, which was confirmed by cross-classification with the other dataset (25 and 35% accuracy, respectively). Also, the gene-overlap between the two classifiers is low (between 0 and 10% depending on the threshold). If, however the 2 datasets are combined using DWD, a 147-gene classifier (table 4 above) could be built which was able to correctly identify samples (with a 92% accuracy).
Example 7
Discussion
(19) The present invention represents the largest cohort of thyroid carcinoma microarray data analysed to date. It makes use of the novel combinatory method using the latest algorithms for microarray data integration and classification. Nevertheless, meta-analysis of microarray data still poses a challenge, mainly because single microrarray investigations are aimed at at least partly different questions and hence use different experimental designs. Moreover, the number of thyroid tumor microarray data available to date is still comparably low (compared to breast cancer, e.g.). Therefore, when doing meta-analysis, one is forced to use all data available, even if the patient cohorts represent a rather heterogeneous and potentially biased population. More specifically, it is difficult to obtain a homogenous collection of control material (from healthy patients). These are usually taken from patients who were operated for other thyroid disease which is in turn very likely to cause a change in gene expression as measured on microarrays. The generation of homogeneous patient cohorts is further hampered by limited availability of patient data like age, gender, genetic background, etc.
(20) When doing meta-analysis of microarray data, many researchers have based their approach on comparing gene lists from published studies (Griffith et al, cited above). This is very useful, as one can include all studies in the analysis and is not limited to the studies where raw data is available. However, the studies generally follow very different analysis strategies, some more rigorous than others. It is not under the control of the meta-analyst how the authors arrived at the gene lists. Therefore these analyses may be biased.
(21) Regarding data integration, according to the original DWD paper, DWD performs best when at least 25-30 samples per dataset are present. In the present study, 4 out of 6 datasets contained less than 20 samples. Still DWD performed comparably well for removing platform biases (see Table 8).
(22) DWD greatly improved the results of PCA (
(23) There is an apparent discrepancy when one looks at
(24) A recent Meta-Analysis and Meta-Review by Griffith et. al. (cited above) has summarised genes with a diagnostic potential in the context of thyroid disease. They published lists of genes which appeared in more than one high-throughput study (Microarray, SAGE) analysing thyroid disease and applied a ranking system. In their analysis SERPINA1 scored the third highest, and TFF3, which is part of classifier2 (when leaving out SERPINA1), scored second. Four out of nine genes from classifier2 appeared in the list from Griffith et. al. (LRP4, TFF3, DPP4 and FABP4).
(25) Most of these lists were generated from microarray analysis. However, even when comparing the genes in the classifiers to gene lists generated with independent technologies, like cDNA library generation, there is substantial overlap. SERPINA1 appears in their lists as well as four out of the nine genes from classifier2 (TFF3, DPP4, CHI3L1 and LAMB3).
(26) For the case of follicular thyroid disease, building a robust classifier is much more difficult. This is mainly down to the limited availability of data. Also, the two datasets were very different in terms of the platforms used; while all other datasets were generated on Affymetrix GeneChips arrays of different generations, the Lacroix data was generated on a custom Agilent platform. Nevertheless the classifier (set) of table 4 was able to identify most samples correctly in loocv.
(27) The power of the meta analysis approach adopted here is demonstrated by a 99% loocv-accuracy (97.9% weighted average accuracy in the study cross-validation) for the distinction between papillary thyroid carcinoma and benign nodules. This has been achieved on the largest and most diverse dataset so far (99 samples from 4 different studies).
(28) One sample was classified wrongly, and although it is not possible to correctly map the samples from this analysis to the original analysis, the misclassified sample is from the same group (PTC, validation group) as the sample which was wrongly classified in the original analysis. According to Jarzab et. al. the sample was an outlier because it contained only 20% tumor cells.