METHOD FOR DIAGNOSING COLORECTAL CANCER

20220244261 · 2022-08-04

Assignee

Inventors

Cpc classification

International classification

Abstract

The present invention provides a non-invasive method of diagnosing colorectal cancer in a subject. The method comprises determining the blood concentrations of the proteins TRIM28, PLOD1 and CEACAM5 (and optionally P4HA1) in a subject. An analysis of the concentrations is performed to determine whether the subject has colorectal cancer.

Claims

1. A method of diagnosing colorectal cancer in a subject, comprising determining the concentrations of the proteins TRIM28, PLOD1 and CEACAM5 in a blood-derived sample from the subject, and based on the summed concentrations and/or relative concentrations of said proteins determining whether the subject is suffering from colorectal cancer.

2. The method of claim 1, further comprising determining the concentration of the protein P4HA1 in said sample, wherein the P4HA1 concentration is included in the determination of whether the subject is suffering from colorectal cancer.

3. The method of claim 1, wherein: i) if the concentration of TRIM28 is greater than 0.3 ng/ml, the subject does not have colorectal cancer; ii) if the concentration of TRIM28 is less than or equal to 0.3 ng/ml and the concentration of PLOD1 is greater than or equal to 1.7 ng/ml, the subject has colorectal cancer; iii) if the concentration of TRIM28 is less than or equal to 0.3 ng/ml, the concentration of PLOD1 is less than 1.7 ng/ml and the concentration of CEACAM5 is greater than or equal to 0.13 pg/ml, the subject has colorectal cancer; and iv) if the concentration of TRIM28 is less than or equal to 0.3 ng/ml, the concentration of PLOD1 is less than 1.7 ng/ml and the concentration of CEACAM5 is less than 0.13 pg/ml, the subject does not have colorectal cancer.

4. The method of claim 2, wherein: i) if (6.7×[TRIM28])−(0.7×[PLOD1])−(13.1×[CEACAM5])+(0.4×[P4HA1])−2.2 is less than zero, the subject has colorectal cancer; and ii) if (6.7×[TRIM28])−(0.7×[PLOD1])−(13.1×[CEACAM5])+(0.4×[P4HA1])−2.2 is greater than or equal to zero, the subject does not have colorectal cancer; wherein the concentrations of TRIM28, PLOD1 and P4HA1 are measured in ng/ml, and the concentration of CEACAM5 is measured in pg/ml.

5. The method of any one of claims 1 to 4, further comprising taking a blood sample from the subject.

6. The method of any one of claims 1 to 5, wherein the blood-derived sample is a plasma sample.

7. The method of any one of claims 1 to 6, wherein the protein concentrations are measured by an immunoassay.

8. The method of claim 7, wherein said immunoassay is quantitative ELISA.

9. The method of any one of claims 1 to 8, wherein the steps of measuring the protein concentrations and determining whether the subject is suffering from colorectal cancer are performed by a computer processor programmed to perform said steps.

10. A method of diagnosing and treating colorectal cancer in a subject, comprising: a) measuring the concentrations of the proteins TRIM28, PLOD1 and CEACAM5 in a blood-derived sample; b) based on said concentrations, determining whether the subject is suffering from colorectal cancer; wherein: i) if the concentration of TRIM28 is greater than 0.3 ng/ml, the subject does not have colorectal cancer; ii) if the concentration of TRIM28 is less than or equal to 0.3 ng/ml and the concentration of PLOD1 is greater than or equal to 1.7 ng/ml, the subject has colorectal cancer; iii) if the concentration of TRIM28 is less than or equal to 0.3 ng/ml, the concentration of PLOD1 is less than 1.7 ng/ml and the concentration of CEACAM5 is greater than or equal to 0.13 pg/ml, the subject has colorectal cancer; and iv) if the concentration of TRIM28 is less than or equal to 0.3 ng/ml, the concentration of PLOD1 is less than 1.7 ng/ml and the concentration of CEACAM5 is less than 0.13 pg/ml, the subject does not have colorectal cancer; and c) if the subject is diagnosed with colorectal cancer, administering treatment for colorectal cancer to the subject.

11. A method of diagnosing and treating colorectal cancer in a subject, comprising: a) measuring the concentrations of the proteins TRIM28, PLOD1, CEACAM5 and P4HA1 in a blood-derived sample, wherein the concentrations of TRIM28, PLOD1 and P4HA1 are measured in ng/ml, and the concentration of CEACAM5 is measured in pg/ml; and b) based on said concentrations, determining whether the subject is suffering from colorectal cancer; wherein: i) if (6.7×[TRIM28])−(0.7×[PLOD1])−(13.1×[CEACAM5])+(0.4×[P4HA1]) −2.2 is less than zero, the subject has colorectal cancer; and ii) if (6.7×[TRIM28])−(0.7×[PLOD1])−(13.1×[CEACAM5])+(0.4×[P4HA1]) −2.2 is greater than or equal to zero, the subject does not have colorectal cancer; and c) if the subject is diagnosed with colorectal cancer, administering treatment for colorectal cancer to the subject.

12. The method of claim 10 or 11, wherein said treatment comprises surgery, chemotherapy, radiotherapy and/or immunotherapy.

13. A kit comprising a set of reagents for determining the presence or concentration of TRIM28, PLOD1 and CEACAM5 in a sample.

14. The kit of claim 13, further comprising a reagent for determining the presence or concentration of P4HA1 in a sample.

15. The kit of claim 13, said kit comprising: i) a specific binding agent which binds TRIM28, or a fragment thereof; ii) a specific binding agent which binds PLOD1, or a fragment thereof; and iii) a specific binding agent which binds CEACAM5, or a fragment thereof.

16. The kit of claim 15, said kit further comprising a specific binding agent which binds P4HA1, or a fragment thereof.

17. The kit of claim 15 or 16, wherein each specific binding agent is an antibody.

18. The kit of any one of claims 13 to 17, wherein said kit is for performing an ELISA.

19. A computer programme product comprising instructions that, when executed, will cause a processor to perform a method as defined in any one of claims 1 to 4.

20. Use of a kit as defined in any one of claims 13 to 18 in the diagnosis of colorectal cancer, wherein said diagnosis is performed using a method as defined in any one of claims 1 to 9.

Description

[0106] FIG. 1 shows the Classification accuracy of 22 paired samples: colorectal cancer tumour samples (CRC) and adjacent tissue (AT). The boxplot presents the discriminatory function score for CRC and AT groups (sum of the unit of parts per million [ppm] reported in Hao et al. (Sci Rep 7:42436, 2017)) for the selected proteins: PLOD1, P4HA1, LCN2, GNS, C12orf10, P3H1, TRIM28, CEACAM5, MAD1L1. p values were calculated using the double-sided Wilcoxon Signed Rank test. The bars in the boxes represent the median, 25.sup.th and 75.sup.th percentiles, while whiskers extend to ±2.7σ.

[0107] FIG. 2 shows the classification accuracy of 96 paired samples: colorectal cancer tumour samples (CRC) and adjacent tissue (AT). The boxplot presents discriminatory function scores for the CRC and AT groups (sum of the two to the power of the Unshared Log Ratio scores reported in the CPTAC study (see below) for the selected proteins: PLOD1, P4HA1, LCN2, GNS, C12orf10, P3H1, TRIM28, CEACAM5, MAD1L1). Significance p values were calculated using the double-sided Wilcoxon Signed Rank test for paired samples and the Wilcoxon Rank Sum test for unpaired samples. The bars in the boxes represent the median, 25th and 75th percentiles, while whiskers extend to ±2.7σ. Numbers below the boxplots denote the number of observations per category. (A) Discrimination between samples derived from tumours (CRC) and AT. (B) Discrimination between CRC and AT samples in males and females separately. (C) Discrimination between CRC and AT samples depending on histological subtype. (D) Discrimination between CRC and AT samples depending on tumour stage. (E) Discrimination between CRC and AT samples depending on race. (F) Discrimination between CRC and AT samples depending on prior colon polyp history.

[0108] FIG. 3 shows the classification accuracy of colorectal cancer tumour samples (CRC) and adjacent tissue (AT). The boxplot presents discriminatory function score for the CRC and AT groups (sum of the 2 to the power of the Unshared Log Ratio scores reported in the CPTAC study for the selected proteins: PLOD1, P4HA1, LCN2, GNS, C12orf10, P3H1, TRIM28, CEACAM5, MAD1L1). The significance p value was calculated using the double-sided Wilcoxon Signed Rank test for paired samples and the Wilcoxon Rank Sum test for unpaired samples. The bars in the boxes represent the median, 25.sup.th and 75.sup.th percentiles, while whiskers extend to ±2.7σ. Numbers below the boxplots denote the number of observations per category. Discrimination between CRC and AT samples divided by (A) BRAF gene analyses results; (B) Ethnicity; (C) Presence of NRAS mutation (only one sample with no NRAS mutation is reported in the study); (D) History of other cancers; (E) MLH1 expression; (F) MSH2 expression; (G) MSH6 expression; (H) Presence of KRAS mutation (only one sample is reported in the study with no KRAS mutation).

[0109] FIG. 4 shows the classification accuracy of colorectal cancer tumour samples (CRC) and adjacent tissue (AT). The boxplot presents discriminatory function scores for CRC and AT groups (sum of the 2 to the power of the Unshared Log Ratio scores reported in the CPTAC study for the selected proteins: PLOD1, P4HA1, LCN2, GNS, C12orf10, P3H1, TRIM28, CEACAM5, MAD1L1). The significance p value was calculated using the double-sided Wilcoxon Signed Rank test for paired samples and the Wilcoxon Rank Sum test for unpaired samples. The bars in the boxes represent the median, 25.sup.th and 75.sup.th percentiles, while whiskers extend to ±2.7σ. Numbers below the boxplots denote the number of observations per category. Discrimination between CRC and AT samples divided by (A) Number of first degree relatives with history of colorectal cancer; (B) PMS2 expression.

[0110] FIG. 5 shows the classification accuracy of 16 samples: early colorectal cancer tumour samples (CRC), normal and inflamed tissue. (A) Boxplot presenting the discriminatory function score (based on the sum of the four proteins present in the dataset out of the nine of interest: P4HA1, LCN2, C12orf10 & TRIM28). (B) C12orf10 expression differences in normal, inflamed and early CRC tissues. (C) LCN2 expression differences in normal, inflamed and early CRC tissues. (D) The sum of LCN2 and C12orf10 protein concentrations separates normal and early CRC tissue samples. Significance p values were calculated using the double-sided Wilcoxon Rank Sum test. The bars in the boxes represent the median, 25.sup.th and 75.sup.th percentiles, while whiskers extend to ±2.7σ. Numbers below the boxplots denote the number of observations per category.

[0111] FIG. 6 shows the classification accuracy of colorectal tumour (CRC) samples and adjacent tissue (AT) in transcriptomic profiling datasets. (A) Transcriptome profiling of colorectal samples from 6 normal surface epithelium samples, 7 normal crypt epithelium samples, 17 CRC samples, 11 metastases and 17 adenomas (in total 19 subjects). (B) 54 normal colon tissue samples, 186 CRC samples and 49 polyps. (C) 74 normal samples, CRC samples from three different studies (n=4, 288 and 52, respectively), 30 adenomas, 4 familial hyperplastic polyposis samples, 47 ulcerative colitis samples and 37 Crohn's disease samples. For each plot significance p values were calculated using the double-sided Wilcoxon Rank Sum test. The bars in the boxes represent the median, 25.sup.th and 75.sup.th percentiles, while whiskers extend to ±2.7σ.

[0112] FIG. 7 shows based on Ingenuity Pathway Analysis that the nine selected proteins are highly interconnected and form a network module that regulates cell death and proliferation.

[0113] FIG. 8 shows the classification accuracy of 72 patients and 72 control samples in the training group. (A) Boxplot presenting the discriminatory function score based on protein concentrations in plasma samples measured with ELISA. The discriminatory function score was calculated as the sum of the nine proteins. The bars in the boxes represent the median, 25.sup.th and 75.sup.th percentiles, while whiskers extend to ±2.7σ. (B) Receiver operating characteristic curve (Kinsella et al., Database 2011:bar030) obtained for the classifier based on the sum of the concentrations of the nine proteins measured by ELISA in plasma samples of patients and controls. The optimal operating point (discriminatory function cut-off value) is marked with a circle, and it corresponds to the score value of −0.34. (C) Boxplot showing the discriminatory function score for patients and controls in the training set. The selected discriminatory function cut-off value is presented with a solid, horizontal line. Each dot represents an individual sample. The bars in the boxes represent the median, 25.sup.th and 75.sup.th percentiles, while whiskers extend to ±2.7σ.

[0114] FIG. 9 shows the classification accuracy of different combinations of two, three or four proteins in a test set of 8 patients and control samples. The boxplots present discriminatory function scores based on the concentrations of combinations of two, three or four proteins in plasma samples of patients and controls in the test group, measured with ELISA. The results of the following protein combinations are shown: (A) TRIM28 and PLOD1; (B) TRIM28, PLOD1 and CEACAM5; and (C) TRIM28, PLOD1, CEACAM5 and P4HA1. The discriminatory function score was calculated as the sum of the concentrations of the respective proteins. The discriminatory cut-off value is indicated with a solid horizontal line. The dots denote values obtained for patients and controls in the test group respectively. The bars in the boxes represent the median, 25.sup.th and 75.sup.th percentiles, while whiskers extend to ±2.7σ. For the combination of two proteins the best results were obtained when summing the concentrations of TRIM28 and PLOD1 (sensitivity of 88%, specificity of 75%). The best combination of three proteins yielded sensitivity of 100% and specificity of 88% (sum of TRIM28, PLOD1, and CEACAM5). However, the best sample separation was possible when using four proteins, namely TRIM28, PLOD1, CEACAM5, and P4HA1 (both sensitivity and specificity of 100%).

EXAMPLES

[0115] Protein biomarkers for CRC that can be measured in blood were identified by a process based on meta-analysis of published genome- and proteome-wide analyses of CRC tissues and adjacent tissues (AT). For the published analyses see Hao et al. (Sci Rep 7:42436, 2017), Shiromizu et al. (supra), Quesada-Calvo et al. (Clin Proteomics 14:9, 2017) and Torrente et al. (PLoS ONE 11(6): e0157484, 2016).

[0116] The inventors focused on differentially expressed genes whose protein products were potentially released extracellularly, according to the Human Protein Atlas (Petryszak et al., Nucleic Acids Res 44:D746-D752, 2016). In order to identify optimal combinations of a limited number of proteins (and diagnostic methods based on these), the inventors used their classification algorithm (Hellberg et al., Cell Reports 16:2928-2939, 2016). Finally, the inventors tested the identified biomarkers using plasma from patients with newly diagnosed CRC and healthy controls.

[0117] Materials & Methods

[0118] Protein Prioritisation Using Randomised Elastic Net

[0119] In order to rank proteins the inventors analysed proteome profiling data of 22 CRC patients—paired samples taken from tumour and AT (Hao et al., supra). For each identified protein fold-change was calculated as the average protein expression in colorectal tumour samples divided by the average protein expression in AT. Differential expression was obtained from Hao et al. (supra), analysed by paired t-test. For biomarker prioritisation using random elastic net the inventors pre-selected those proteins that were: a) differentially expressed (adjusted for multiple correction using procedure described by Storey) with a p value <0.01; b) upregulated in CRC tumour samples (fold change more than 2) and c) predicted to be secreted according to the Human Protein Atlas (https://www.proteinatlas.org, data downloaded in July 2017).

[0120] 113 proteins fulfilling the above criteria were rank-ordered by their predictive value using randomised elastic net. Randomised elastic net was implemented as a modification of randomised lasso as described by Meinshausen et al. (Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72: 417-473, 2010). Here, lasso was replaced with elastic net. For selected λ in cross-validation and for α=0.5, the inventors permuted data by adding random penalty factors for each predictor (protein) from the interval [1/α, 1]. Next, model coefficients were estimated (elastic net). 100,000 permutations were performed. Predictors with non-zero coefficients in at least one of the 100,000 permutations were selected. For downstream analyses proteins selected in at least 45% of permutations were chosen (corresponding to 9 proteins).

[0121] Sample Classification

[0122] Based on the selected proteins the inventors built a classifier with discriminative function being a sum of expression of all nine proteins. All zeros in proteomic data were replaced with NaN and treated as missing values (nansum( ) was used). To calculate Area Under ROC curve—AUC values—MATLAB function perfcurve( ) was used, having colorectal tumour (CRC) samples as the positive group and adjacent tissue (AT) samples as the negative group. In order to obtain significance values, the inventors performed the Wilcoxon Signed Rank test for paired samples and the Wilcoxon Rank Sum test for non-paired samples on the scores calculated based on the discriminative function score for both CRC and AT groups.

[0123] In order to test if selected proteins were better CRC biomarkers than would be obtained by chance, the inventors randomly selected nine upregulated proteins from the dataset and repeated the calculation of AUC scores 10,000 times. The permutation p value was calculated by comparing random AUC values with the original one.

[0124] Validation in Independent Datasets

[0125] To test the selected nine proteins, the inventors repeated the classification and classification tests as described above in two independent, publicly available proteomics datasets: [0126] 1) A set obtained from 101 individuals (Clinical Proteomic Tumor Analysis Consortium, CPTAC). The set contains samples taken from tumour sites (CRC) and adjacent tissue (AT). For classification the inventors used 2 to the power of the reported “Unshared Log Ratio” score. Unpaired samples were removed from the analyses; [0127] 2) A set obtained from four normal mucosa samples, four inflamed mucosa samples and eight early cancer samples (ProteomeXchange Dataset PXD005735). Additionally, the inventors analysed whether the transcriptome profiling of the nine genes encoding the selected proteins could discriminate between CRC and AT. The inventors analysed the following datasets: [0128] 1) EGEOD-77955: 6 normal surface epithelium, 7 normal crypt epithelium, 17 CRC, 11 metastases, 17 adenoma samples (In total 19 subjects); [0129] 2) E-GEOD-41258: 54 normal colon tissues, 186 CRC, 49 polyp samples; [0130] 3) E-MTAB-3732, which aggregated and normalised microarray datasets from different studies of healthy and diseased colorectal tissues. This included 74 healthy colon tissues, CRC from three studies with 4, 288 and 52 patients, 30 adenomas, 4 familial hyperplastic polyposis and 47 ulcerative colitis.

[0131] Discriminative Score and Clinical Data

[0132] In the publicly available dataset consisting of 100 individuals (CPTAC) the inventors also tested if the discriminative score was influenced by sex, histological subtype, history of prior polyps or race. Analysis was performed using the Wilcoxon signed rank test where samples were paired and the Wilcoxon rank sum test otherwise between specific subgroups of samples.

[0133] ELISA of Plasma Samples from Patients with CRC and Healthy Controls 80 CRC patients (40 females and 40 males, mean age of 71.8 years (range 34-89)) from south-eastern Sweden, who had undergone surgical resections for primary CRC at the Department of Surgery, Division of Surgical Care, Region Jönköping County, Jönköping, Sweden, were recruited. The CRC patients had tumours localized in the colon (n=37) or rectum (n=43) with TNM stages I-IV (I=13, II=34, III=29 and IV=4). The control group consisted of 80 healthy blood donors (40 females and 40 males, mean age of 55.9 years (range 33-67)) with no known history of CRC and from the same geographical region as the cancer patients.

[0134] Venous blood samples were collected and centrifuged within 1 hour of collection to separate plasma and blood cells. Plasma samples were stored at −80° C. in the Biobank of Laboratory Services, registration number 868, Region Jönköping County, Jönköping, Sweden until analysis. The study was reviewed and approved by the Regional Ethical Review Board in Linköping, Linköping, Sweden (98113 and 2013/271-31). All patients included in this study gave an informed written consent for utilisation of their material in research. Plasma levels of the nine potential biomarkers were analysed using commercial Enzyme-Linked Immunosorbent Assays (ELISAs) according to the manufacturer's instructions: C12orf10 (MyBiosource, Inc., San Diego, Calif., United States), CEACAM5 (LifeSpan BioSciences, Inc., Seattle, Wash., United States), GNS (MyBiosource, Inc.), LCN2 (Aviva Systems Biology Corp., San Diego, Calif., United States), MAD1L1 (Abbexa Ltd., Cambridge, United Kingdom), P3H1 (Abbexa Ltd.), P4HA1 (Signalway Antibody LLC, Baltimore, Md., United States), PLOD1 (Aviva Systems Biology Corp.), and TRIM28 (MyBiosource, Inc.). Protein levels of the nine potential biomarkers were determined using the Sunrise Tecan Microplate reader (Tecan Austria GmbH, Salzburg, Austria) along with the Magellan 7.x 2010 software (Tecan Austria GmbH). Protein levels of C12orf10, GNS, LCN2, MAD1L1, P4HA1, PLOD1 and TRIM28 were expressed as nanograms per millilitre (ng/mL). Protein levels of CEACAM5 and P3H1 were expressed as picograms per millilitre (pg/mL). In case protein values were out of the ELISA kit detection limit, for calculations we assumed either the ELISA kit maximum detection limit or a value of 0, as appropriate.

[0135] Network Analysis of the Identified Nine Proteins

[0136] The Ingenuity Pathway Analysis (IPA) software (Qiagen, Hilden, Germany) was used to test if the nine proteins were part of the same network module. Sub-network formation by IPA was performed as follows: first, all genes having direct or indirect interactions with the nine proteins served as “seeds” to generate networks. Such focus genes were combined into networks to maximise their specific interconnectivity (the connectivity between focus genes in comparison to the number of their interactions with other genes within the IPA global network). Additional genes from the IPA global network were added in order to connect smaller networks formed by the focus genes. Finally, the resulting networks were scored based on the number of focus genes they contained—the higher score the lower the probability of finding this number of focus genes within a given network by random chance.

[0137] Sample Separation into Training and Testing Sets

[0138] For each of the nine individual proteins, the inventors tested whether their expression differed significantly between patients and controls using the double-sided Wilcoxon Rank Sum test. Next, the inventors summed the expression of all nine proteins in order to obtain a discriminatory function score and calculated AUC (as before the MATLAB perfcurve( ) function was used for that purpose). Even though based on the tissue samples (proteome profiling) all nine proteins should be upregulated in patient samples, in the plasma samples some of the proteins were in fact downregulated in patients compared to controls. Therefore, concentrations of those proteins were subtracted from the discriminatory function score instead of being summed. The P value was calculated using the double-sided Wilcoxon Rank Sum test. Next, the inventors aimed to reduce the number of proteins necessary for the classifier to perform well. For these analyses, ten percent of patient samples (n=8) and ten percent of control samples (n=8) were randomly selected as a test group and excluded from the initial analyses, in which the inventors tried different combinations of the measured proteins. The optimal operating point (discriminatory cutoff value) was selected using MATLAB function perfcurve( ). This is based on moving a straight line from the point (0,1) with a slope defined as:


S=Cost(FP)−Cost ADDIN EN.CITE ADDIN EN.CITE.DATA 2/Cost ADDIN EN.CITE ADDIN EN.CITE.DATA 21−Cost ADDIN EN.CITE ADDIN EN.CITE.DATA 2.Math.TN+FP/TP+FN

that crosses the ROC curve. Here TP, FP, TN, and FN are the true positives, false positives, true negatives and false negatives respectively, and classification cost is denoted as Cost( ) Since it is most important to reduce probability of false negatives (classification of patients as healthy controls) the inventors assumed zero cost for classification of TN and TP, 0.8 cost for misclassification of positive class, and 0.2 for misclassification of negative class.

[0139] Classification of the Test Group

[0140] Finally, using the classifier constructed as described above, the inventors classified the excluded samples of patients and controls, calculated accuracy, sensitivity and specificity, and p value (p value was calculated based on the discriminatory function score obtained for the excluded samples between patients and controls using the double-sided Wilcoxon Rank Sum test).

[0141] Results

[0142] Biomarker Selection

[0143] In order to rank proteins as putative biomarkers of CRC the inventors analysed publicly available proteome profiling of CRC samples and paired AT samples collected from 22 subjects (Hao et al., supra). First, the inventors preselected proteins that were: [0144] a) differentially expressed (corrected for multiple testing, defined based on having paired t-test p value <0.01); [0145] b) upregulated in CRC samples compared to AT samples (fold change >2); and [0146] c) predicted by the Human Protein Atlas to be secreted.

[0147] The inventors identified 113 such proteins. Using random elastic net (Materials and Methods) the inventors ranked the proteins based on their predictive value to discriminate CRC from AT. For further analyses the inventors selected the top nine proteins: PLOD1 (Q02809), P4HA1 (P13674), LCN2 (P80188), GNS (P15586), C12orf10 (Q9HB07), P3H1 (Q32P28), TRIM28 (Q13263), CEACAM5 (P06731) and MAD1L1 (Q9Y6D9); (randomized elastic net frequency >0.45; Materials and Methods). The Inventors found that the sum of the expression values of those nine proteins discriminated between CRC and AT with high accuracy (Area Under receiver operating characteristic Curve AUC=1, Wilcoxon Signed Rank test p=4.0×10.sup.−5; FIG. 1). This was significantly higher than when using nine random proteins (permutation test p<1.0×10.sup.−4, OR=1.66). Out of those nine proteins, three were reported before as possible CRC biomarkers; namely CEACAM5 (commonly known as CEA), LCN2 and TRIM28 (Shiromizu et al., supra).

[0148] Biomarker Tests in Independent Proteomic Datasets

[0149] In order to assess the reproducibility of our results the inventors analysed two independent publicly available proteome profiling datasets of CRC. First, the inventors analysed a dataset consisting of 101 individuals—that data was generated by the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). The data consists of 96 paired samples obtained from tumour sites (CRC) and AT. Secondly, the inventors validated that the selected nine proteins could separate tumour samples from AT. The inventors obtained a nearly perfect classification accuracy (AUC=0.99, Wilcoxon Signed Rank test p=1.8×10.sup.−17, FIG. 2A), which was higher than obtained by chance for nine randomly selected proteins (permutation test p=1.0×10.sup.−4, OR=2.21).

[0150] In the CPTAC study, the authors report clinical data including sex, race, histological subtype and history of prior colon polyps. The inventors therefore tested whether any of those covariates had an impact on the sample classification (FIG. 2B-F, FIG. 3 and FIG. 4). The inventors found significant differences in the discriminatory function score only between mucinous and non-mucinous tumours (Wilcoxon Rank Sum test p=0.046; FIG. 2C); pathological tumour stage I and III (double-sided Wilcoxon Rank sum test p=0.049; FIG. 2D); AT samples from patients with tumour stage II and IV (Wilcoxon Rank sum test p=0.02, FIG. 2D); between races—black/African-American patients versus Asian or Caucasian, p value=0.008, and 0.002 respectively, FIG. 2E); expression or no-expression of MHL1 and PMS2 (Wilcoxon Rank Sum test p=0.02, FIG. 3E; and p=0.008, FIG. 4B respectively). Furthermore, the inventors tested the correlation between pathological tumour stage and discriminatory function score and found no significant correlation (Pearson PCC=0.12, p=0.12). However, as shown in FIG. 2 those covariates did not have a significant impact on overall classification accuracy.

[0151] The inventors then tested yet another proteomic dataset consisting of 76 tissue samples, in which four to five patient sample digests were pooled. In total, proteomics analyses were performed on eight pools from colorectal tissue samples obtained from early stages of CRC, eight pools of apparently normal tissue (at surgical margin) samples and four pools of inflamed mucosa samples (Quesada-Calvo et al., supra). This dataset contains only four out of the nine selected putative biomarkers: P4HA1 (P13674), LCN2 (P80188), C12orf10 (Q9HB07) and TRIM28 (Q13263). For this reason, a new classifier was created based on the sum of those four tentative CRC biomarkers, which yielded a high classification accuracy for discriminating early CRC from normal tissue (AUC=0.91, Wilcoxon Rank Sum test, unpaired samples, p=0.03, FIG. 5A). The inventors also found a significant difference between normal and inflamed tissue (Wilcoxon Rank Sum test p=0.03, FIG. 5A). However, the combination of the four proteins did not yield a higher AUC score than expected by chance for four random proteins (p=0.08, OR=1.68). Therefore, the inventors asked if individual proteins could discriminate normal from tumour tissues. The inventors found that LCN2 and C12orf10 differentiated between the two conditions (AUC=0.97, Wilcoxon Rank Sum test p=0.008 for both proteins), which was higher than expected by chance for a random single gene classifier (permutation test p=0.03, OR=1.86 for both proteins; FIG. 5B, C). A combination of those two proteins gave even higher classification accuracy (AUC=1, Wilcoxon Rank Sum test p=0.004; FIG. 5D), which is higher than expected for two random proteins (permutation test p=1.0×10.sup.−4, OR=1.9). This suggested that a subset of the nine proteins might be sufficient for a highly accurate classification of patients and controls.

[0152] Biomarker Tests in Independent Transcriptomic Datasets

[0153] The inventors also tested 3 transcriptome profiling studies of CRC. In case some of the selected nine genes were not expressed in the tested dataset, classifiers were built using genes that were expressed. Firstly, the inventors analysed a dataset consisting of 6 normal surface epithelium samples, 7 normal crypt epithelium samples, 17 CRC samples, 11 metastases and 17 adenoma samples (in total 19 subjects; EGEOD-77955). This dataset lacked details of expression of the C12orf10 gene. Therefore, the inventors created a classifier based on the remaining eight genes. The inventors obtained a high classification accuracy when comparing normal crypt epithelium samples to CRC, metastases and adenoma samples (AUC>0.95, Wilcoxon Rank Sum test p<6.1×10.sup.−4; FIG. 6A; Table 1). However, in comparison to normal surface samples good separation of the groups was not obtained (AUC<0.43, Wilcoxon Rank Sum test p>0.25; FIG. 6A; Table 1).

TABLE-US-00001 TABLE 1 Metastasis Adenoma CRC versus normal surface epithelium AUC 0.32 0.42 0.41 P 0.26 0.60 0.55 versus normal crypt epithelium AUC 1.00 0.96 0.96 P 6.3 × 10.sup.−5 6.0 × 10.sup.−4 6.0 x 10.sup.−4 Classification area under the receiver operating characteristic (Price etal., Nat Biotechnol 35: 747-756, 2017) of 6 normal surface epithelium samples and 7 normal crypt epithelium samples versus 17 CRC, 11 metastasis and 17 adenoma samples, respectively. Significance was calculated using the double-sided Wilcoxon Rank Sum test (p).

[0154] Similarly, in another transcriptome profiling study tested (E-GEOD-41258) the inventors obtained good separation between groups (AUC>0.77, Wilcoxon Rank Sum test p<3.4×10.sup.−10; FIG. 6B; Table 2). Here, the inventors compared 54 normal colon tissue samples with 186 primary CRC and 49 polyp samples. This dataset also lacked expression profiles of C12orf10, and therefore the classifier was made using the remaining eight genes.

TABLE-US-00002 TABLE 2 Primary CRC Polyp AUC 0.78 0.96 P 3.3 × 10.sup.−10 1.2 x 10.sup.−10 Classification area under the receiver operating characteristic (Price etal., supra) of 54 normal colon tissues versus 186 primary CRC and 49 polyp samples, respectively. Significance was calculated using the double-sided Wilcoxon Rank Sum test (p).

[0155] Finally, the inventors analysed a dataset (E-MTAB-3732) that aggregated and normalised microarray data from colorectal samples from 74 healthy tissues, three studies of CRC (n=4, 288 and 52 respectively), 4 familial hyperplastic polyposis samples, 30 colorectal adenomas, 47 ulcerative colitis samples and 37 Crohn's disease samples. All nine tentative biomarkers were profiled. In this dataset, normal colorectal tissue samples were compared to other groups. This yielded high separation for most comparisons (AUC>0.82; Wilcoxon Rank Sum test p<0.002; FIG. 6C; Table 3). The inventors also found significant differences between normal tissue compared to CRC (moderate separation AUC=0.78, Wilcoxon Rank Sum test p=1.3×10.sup.−7). However, the classifier didn't differentiate between normal colorectal tissue and familial hyperplastic polyposis (AUC=0.42, Wilcoxon Rank Sum test p=0.59).

TABLE-US-00003 TABLE 2 Classification area under the receiver operating characteristic (Price et al., supra) of 74 normal samples versus 37 Crohn's disease samples, three CRC studies (n = 4 colon tumour, 288 colorectal adenocarcinoma and 52 colorectal carcinoma, respectively), 30 colorectal adenoma, 4 familiar hyperplastic polyposis samples and 47 ulcerative colitis samples. Significance was calculated using the double-sided Wilcoxon Rank Sum test (p). Familial Crohn's CRC CRC Colorectal CRC hyperplastic Ulcerative disease (study 1) (study 2) adenoma (study 3) polyposis colitis AUC 0.83 0.99 0.84 0.98 0.78 0.42 0.90 p 9.9 × 10.sup.−9 0.001 2.1 × 10.sup.−19 2.5 × 10.sup.−14 1.3 × 10.sup.−7 0.59 9.1 × 10.sup.−14

[0156] In summary, analyses of transcriptome profiling data indicated that the chosen nine tentative CRC biomarkers could separate colorectal tumour samples from apparently normal samples.

[0157] Network Analysis Supports Pathogenic Relevance of the Nine Proteins

[0158] In order to test the pathogenic relevance of the biomarkers the inventors performed Ingenuity Pathway Analysis, as previously described (Gustafsson et al., Science Translational Medicine 7(313):313ra178, 2015). Briefly, the background to the method is that proteins which are associated with the same disease tend to be functionally related and interact, forming network modules (Tan et al., Curr Colorect Canc R 12: 151-161, 2016). Thus, if the proteins did interact, this would support their pathogenic and biomarker relevance. Indeed, the inventors found that the nine proteins did form a network module, which had an overriding function, namely regulating cell death and proliferation (FIG. 7).

[0159] Analyses of Nine Potential Protein Biomarkers in Plasma Samples from CRC Patients and Healthy Controls

[0160] The inventors proceeded to test all nine selected biomarkers in plasma samples obtained from 80 patients with CRC and 80 controls. It was found that seven of the proteins differed significantly between patients and controls: PLOD1, median 10 (0-10) vs. 0.19 (0-7.9) ng/mL, p=1.19×10.sup.−21; LCN2, 2.0 (0-10) vs. 2.3 (1.3-4.2) ng/mL, p=4.84×10.sup.−2; MAD1L1, 0 (0-3.1) vs. 0 (0-10) ng/mL, p=2.73×10.sup.−2; CEACAM5, 0 (0-0.45) vs. 0 (0-3.8) pg/mL, p=1.52×10.sup.−3; P4HA1, 3.7 (0-17) vs. 5.9 (1.4-22) ng/mL, p=4.88×10.sup.−6; TRIM28, 0 (0-0.82) vs. 1.14 (0-20) ng/mL, p=1.00×10.sup.−27; and GNS, 7.1 (2.8-13) vs. 10 (5.3-14) ng/mL, p=1.05×10.sup.−14. By contrast, no significant differences were found for C12orf10, 1.6 (0.1-11) vs. 1.6 (0.58-20) pg/mL, p=4.07×10.sup.−1 and P3H1, 7.8 (0-86) vs. 5.9 (0-586) pg/mL, p=1.28×10.sup.−1 (double-sided Wilcoxon Rank Sum test). Although seven of the proteins differed significantly between patients and controls, they showed considerable variability. This indicated that on their own, none of the proteins would suffice as potential biomarkers for early diagnosis. A combination of all nine proteins was then tested to determine whether this combination separated patients and controls with high accuracy. For that purpose, the inventors calculated AUC and p values as before (double-sided Wilcoxon Rank Sum test p=1.62×10.sup.−19; FIG. 8). While the p value was highly significant, there was some overlap between patients and controls, resulting in a specificity of 92% and a sensitivity of 90%.

[0161] Combinations of smaller numbers of proteins were then tested. Firstly, all possible combinations of the nine proteins were tested in a training set consisting of randomly selected 72 patients and 72 controls. As detailed above, optimal combinations of proteins, and diagnostic methods based on these, were determined using the inventors' classification algorithm. Based on the training set, the optimal combination of two proteins utilised TRIM28 and PLOD1; the optimal combination of three proteins utilised TRIM28, PLOD1 and CEACAM5; and the optimal combination of four proteins utilised TRIM28, PLOD1, CEACAM5 and P4HA1. The identified combinations were then tested in a test set consisting of the remaining 8 patients and 8 controls (FIG. 9). As shown, the combination of two proteins yielded a classification accuracy of 81% (sensitivity of 88%, specificity of 75%). The combination of three proteins was superior, with a classification accuracy of 94% (sensitivity 100%, specificity 88%), while the combination of four proteins displayed 100% accuracy. The combinations of three and four proteins, in particular, offer significant improvements to CRC screening accuracy relative to existing methods. Importantly, both display 100% sensitivity, indicating that false negatives can be minimised.

[0162] The diagnostic methodologies applied by the classification algorithms were determined. For the method based on three proteins, it was determined that the following diagnostic method was performed:

[0163] i) if the concentration of TRIM28 is greater than 0.27 ng/ml, the subject does not have colorectal cancer;

[0164] ii) if the concentration of TRIM28 is less than or equal to 0.27 ng/ml and the concentration of PLOD1 is greater than or equal to 1.69 ng/ml, the subject has colorectal cancer;

[0165] iii) if the concentration of TRIM28 is less than or equal to 0.27 ng/ml, the concentration of PLOD1 is less than 1.69 ng/ml and the concentration of CEACAM5 is greater than or equal to 0.125 pg/ml, the subject has colorectal cancer; and

[0166] iv) if the concentration of TRIM28 is less than or equal to 0.27 ng/ml, the concentration of PLOD1 is less than 1.69 ng/ml and the concentration of CEACAM5 is less than 0.125 pg/ml, the subject does not have colorectal cancer.

[0167] For the method based on four proteins, it was determined that the following diagnostic method was performed:

[0168] i) if (6.7×[TRIM28])−(0.65×[PLOD1])−(13.13×[CEACAM5])+(0.43×[P4HA1]) −2.21 is less than zero, the subject has colorectal cancer; and

[0169] ii) if (6.7×[TRIM28])−(0.65×[PLOD1])−(13.13×[CEACAM5])+(0.43×[P4HA1]) −2.21 is greater than or equal to zero, the subject does not have colorectal cancer. The protein concentrations are utilised in this method in the following units: TRIM28, PLOD1 and

[0170] P4HA1 ng/ml, CEACAM5 in pg/ml.