CLASSIFIER MODELS TO PREDICT TISSUE OF ORIGIN FROM TARGETED TUMOR DNA SEQUENCING
20220392579 · 2022-12-08
Inventors
- Michael F. Berger (New York, NY, US)
- Barry S. Taylor (New York, NY, US)
- Alexander Penson (New York, NY, US)
- Niedzica Camacho (New York, NY, US)
Cpc classification
G16B40/00
PHYSICS
C12Q2600/112
CHEMISTRY; METALLURGY
G16B20/00
PHYSICS
International classification
G16B40/00
PHYSICS
Abstract
Disclosed are systems and methods for using genomic features revealed by clinical targeted tumor sequencing to predict of tissue of origin. Using machine learning techniques, an algorithmic classifier is constructed and trained on a large cohort of prospectively sequenced tumors to predict cancer type and origin from DNA sequence data obtained at the point of care. Genome-directed reassessment of classifications may prompt tumor type reclassification resulting in altered cancer therapy. The clinical implementation of artificial intelligence to guide tumor type classifications at the point of care can complement standard histopathology and imaging to enable improved classification accuracy.
Claims
1. A method for classifying tumor origin sites, the method comprising: sequencing genetic material in a tissue sample from a subject to generate a subject sample dataset comprising one or more subject genes and one or more subject gene alteration categories; applying a predictive model to the subject sample dataset to generate one or more cancer origin site classifications, the predictive model having been trained using a training dataset generated from sequence reads corresponding to genetic material from a cohort of study subjects with known cancers, the training dataset comprising one or more genes, one or more gene alteration categories corresponding to the one or more genes, and one or more labels characterizing tumor origin sites for the known cancers of the study subjects in the cohort; and storing, in one or more data structures, an association between the subject and the one or more cancer origin site classifications.
2. The method of claim 1, wherein the predictive model is a random forest classification model.
3. The method of claim 2, wherein a feature set for the predictive model comprises one or more categories selected from a group consisting of mutations, indels, focal amplifications and deletions, broad copy number gains and losses, structural rearrangements, mutation signatures, mutation rate, and sex.
4. The method of claim 3, wherein classifier scores for the predictive model were calibrated using multinomial logistic regression to match empirically observed classification probabilities.
5. The method of claim 1, further comprising training the predictive model using supervised or unsupervised learning.
6. The method of claim 1, further comprising generating the training dataset.
7. The method of claim 6, wherein generating the training dataset further comprises acquiring, from a sequencing device, the sequence reads corresponding to the genetic material from the cohort of study subjects, and using the sequence reads to generate the training dataset.
8. The method of claim 1, wherein the cohort excludes study subjects with rare cancers not in the top 30 most common cancer types.
9. The method of claim 1, wherein the training dataset comprises gene alteration categories comprising one or more selected from a group consisting of gene amplification (AMP), chromosome gain, homozygous deletion, hotspot, allele, chromosome loss, promoter, signature, structural variant (SV), truncation, and variant of unknown significance (VUS).
10. The method of claim 1, wherein the one or more labels indicate whether a set of genes in the training dataset is from a cancer subject in the cohort of study subjects.
11. The method of claim 1, wherein the predictive model is configured to accept data on genes and gene alterations as inputs and to provide one or more cancer origin site classifications as output.
12. The method of claim 11, wherein the one or more cancer origin site classifications identify at least one of an internal organ of the subject or a cancer type.
13. The method of claim 11, wherein the predictive model is further configured to generate a confidence score for each cancer origin site classification.
14. The method of claim 13, wherein each confidence score corresponds with a likelihood of a cancer origin site for a tumor.
15. A system for classifying tumor origin sites, the system comprising a computing device having one or more processors configured to: acquire, from a sequencing device, sequence reads corresponding to genetic material in a tissue sample from a subject; generate, using the sequence reads, a subject sample dataset comprising one or more subject genes and one or more subject gene alteration categories; apply a predictive model to the subject sample dataset to generate one or more cancer origin site classifications, the predictive model having been trained using a training dataset generated using sequence reads corresponding to genetic material from a cohort of study subjects with known cancers, the training dataset comprising one or more genes, one or more gene alteration categories corresponding to the one or more genes, and one or more labels characterizing tumor origin sites for the known cancers of the study subjects in the cohort; and store, in one or more data structures, an association between the subject and the one or more cancer origin site classifications.
16. The system of claim 15, wherein the predictive model is a random forest classification model.
17. The system of claim 15, wherein the one or more processors are further configured to train the predictive model such that it is configured to accept data on genes and gene alterations as inputs and to provide one or more cancer origin site classifications as output.
18. The system of claim 15, wherein the one or more processors are further configured to generate the training dataset using the sequence reads corresponding to the genetic material from the study subjects in the cohort.
19. The system of claim 15, wherein the predictive model is further configured to generate a confidence score for each cancer origin site classification, wherein each confidence score corresponds to a likelihood of a cancer origin site for a tumor.
20. A system for determining sites of origin for cancer based on sequencing of genes, the system comprising one or more processors configured to: obtain a training dataset comprising a plurality of sample-derived genetic sequences corresponding to a plurality of cancer subjects, each sample defining a set of genes and a category, the category of each sample defining at least one alteration to the set of genes and/or at least one genomic alteration in the sample; train, using the plurality of sample genetic sequences, a classification model configured to generate likelihoods for corresponding cancer origin sites; acquire, via a sequencer, a genetic sequence corresponding to a subject, the genetic sequence including a set of genes and a category, the category of the genetic sequence defining a nature of alteration to the set of genes in the genetic sequence; and apply the classification model to the genetic sequence to determine a set of likelihoods for a corresponding set of origin sites of cancers, each likelihood indicating a probability measure that the genetic sequence correlates with a presence of cancer at a corresponding origin site.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] The objects, aspects, features, and advantages of the disclosure will become more apparent and better understood by referring to the following description taken in conjunction with the accompanying drawing, in which:
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
DETAILED DESCRIPTION
[0029] For purposes of reading the description of the various embodiments below, the following descriptions of the sections of the specification and their respective contents may be helpful:
[0030] Section A describes systems and methods of predicting tissue of origin from targeted tumor DNA sequencing.
[0031] Section B describes a network environment and computing environment which may be useful for practicing embodiments described herein.
Definitions
[0032] The definitions of certain terms as used in this specification are provided below. Unless defined otherwise, all technical and scientific terms used herein generally have the same meaning as commonly understood by one of ordinary skill in the art to which the present technology belongs.
[0033] As used in this specification and the appended claims, the singular forms “a”, “an” and “the” include plural referents unless the content clearly dictates otherwise. For example, reference to “a cell” includes a combination of two or more cells, and the like. Generally, the nomenclature used herein and the laboratory procedures in cell culture, molecular genetics, organic chemistry, analytical chemistry and nucleic acid chemistry and hybridization described below are those well-known and commonly employed in the art.
[0034] As used herein, the term “about” in reference to a number is generally taken to include numbers that fall within a range of 1%, 5%, or 10% in either direction (greater than or less than) of the number unless otherwise stated or otherwise evident from the context (except where such number would be less than 0% or exceed 100% of a possible value). As used herein, an “allele” refers to one of several alternative forms of a gene occupying a given locus on a chromosome.
[0035] As used herein, the terms “cancer,” “neoplasm,” and “tumor,” are used interchangeably and refer to cells that have undergone a malignant transformation that makes them pathological to the host organism or subject. Primary cancer cells (that is, cells obtained from near the site of malignant transformation) can be readily distinguished from non-cancerous cells by well-established techniques, particularly histological examination. The definition of a cancer cell, as used herein, includes not only a primary cancer cell, but any cell derived from a cancer cell ancestor. This includes metastasized cancer cells, and in vitro cultures and cell lines derived from cancer cells. When referring to a type of cancer that normally manifests as a solid tumor, a “clinically detectable” tumor is one that is detectable on the basis of tumor mass; e.g., by procedures such as CAT scan, MR imaging, X-ray, ultrasound or palpation, and/or which is detectable because of the expression of one or more cancer-specific antigens in a sample obtainable from a patient.
[0036] As used herein, a “chromosome” refers to a discrete threadlike structure of nucleic acids and proteins that carries genetic information in the form of genes. Chromosomes are visible as morphological entities only during cell division. In humans, each chromosome has two arms, the p (short) arm and the q (long) arm. The short and long chromosome arms are separated from each other only by a centromere, which is the point at which the chromosome is attached to the mitotic spindle during cell division. A chromosome contains roughly equal parts of protein and DNA. The chromosomal DNA contains an average of 150 million nucleotides or bases. The 3 billion base pairs in the human genome are organized into 24 chromosomes. All genes are arranged linearly along the chromosomes. Generally the nucleus of a human cell contains two sets of chromosomes: a maternal set and a paternal set. Each set has 23 single chromosomes: 22 autosomes and an X or a Y sex chromosome.
[0037] As used herein, “chromosome gain” refers to the duplication of a chromosome or a chromosomal segment (e.g., p (short) arm or q (long) arm) leading to an unbalanced chromosome complement, or any chromosome number that is not an exact multiple of the haploid number (which is 23 in humans).
[0038] As used herein, “chromosome loss” refers to the loss of a chromosome or a chromosomal segment (e.g., p (short) arm or q (long) arm) leading to an unbalanced chromosome complement, or any chromosome number that is not an exact multiple of the haploid number (which is 23 in humans).
[0039] As used herein, a “deletion” refers to a mutation (or a genetic alteration) in which part of a DNA sequence at a chromosome location is absent or lost compared to that observed in a reference genome. A deletion may occur within a gene or may encompass one or more genes. A “homozygous deletion” refers to the loss of both alleles of a gene within a genome. A homozygous deletion may comprise a partial or complete loss of each copy (maternal and paternal) of the gene sequence.
[0040] As used herein, “expression” includes one or more of the following: transcription of the gene into precursor mRNA; splicing and other processing of the precursor mRNA to produce mature mRNA; mRNA stability; translation of the mature mRNA into protein (including codon usage and tRNA availability); and glycosylation and/or other modifications of the translation product, if required for proper expression and function.
[0041] As used herein, the term “gene” means a segment of DNA that contains all the information for the regulated biosynthesis of an RNA product, including promoters, exons, introns, and other untranslated regions that control expression.
[0042] As used herein, “gene amplification” refers to an increase in the number of partial or complete copies of a single gene sequence or several gene sequences at a specific chromosome locus without a proportional increase in other genes. In some embodiments, gene amplifications can result from duplication of a DNA segment that contains a gene through errors in DNA replication and repair machinery. Gene amplification is common in cancer cells, and may cause an increase in the corresponding RNA and protein encoded by the amplified gene(s).
[0043] As used herein, “haploid” describes a cell that contains a single set of chromosomes, e.g., a copy of each autosome and one sex chromosome. In humans, gametes are haploid cells that contain 23 chromosomes, each of which represents one of a chromosome pair that exists in diploid cells. The number of chromosomes in a single set is represented as n, which is also called the haploid number (In humans, n=23).
[0044] As used herein, a “hotspot” refers to a site at which mutations or recombination events occur with a significantly higher frequency relative to the mutation or recombination rates of other sites within the genome of a subject. A “hotspot allele” refers to an allele in a hotspot region that occurs at a significantly higher frequency relative to other alleles at the same region. Examples of hotspot alleles are described in Chang M T, et al., Cancer Discov. 2018; 8(2):174-183.
[0045] As used herein, a “promoter” means a nucleic acid sequence capable of inducing transcription of a gene in a cell. A promoter is implicated in the recognition and binding of polymerase RNA and other proteins involved in transcription. Promoters may be constitutive, inducible, tissue-specific, ubiquitous, heterologous or endogenous.
[0046] As used herein, “signatures” refer to combinations of mutation types that are generated by different mutational processes. Signatures may be derived based on the analysis of whole genome sequences of thousands of tumors (See e.g., Alexandrov L B et al., Nature. 2013; 500(7463):415-421). Different signatures are identified based on the observed substitution classes (e.g., C>A, C>G) and the immediate flanking nucleotides (e.g., ACA>AAA, ACC>AAC). For example, for each tumor profile with a sufficient number of mutations, the observed mutations are compared to the known signatures and the dominant signature responsible for the observed profile is determined. In some embodiments, a signature contributes to the large majority of somatic mutations in the tumor class. If multiple mutational processes are operative, a jumbled composite signature is generated. Examples of methods for extracting mutational signatures from catalogues of somatic mutations are described in Alexandrov L B et al., Nature. 2013; 500(7463):415-421.
[0047] As used herein, “structural variants” or “SVs” include duplications, inversions, translocations or genomic imbalances (insertions and deletions). In some embodiments, SVs are about 500 bp to >1 kb in size. Commonly known structural variations include gene fusions as well as copy-number variants (whereby an abnormal number of copies of a specific genomic area are duplicated in a region of a chromosome).
[0048] As used herein, the terms “subject,” “individual,” or “patient” are used interchangeably and refer to an individual organism, a vertebrate, a mammal, or a human. In certain embodiments, the individual, patient or subject is a human.
[0049] As used herein, “truncation” refers to the premature termination of a polypeptide due to the presence of a termination codon in the sequence of its corresponding structural gene as a result of a nonsense mutation, a frameshift mutation, or a splice site mutation.
[0050] As used herein, “variant of unknown significance” or “VUS” refers to an allele, or variant form of a gene, which has been identified through genetic testing, but whose significance to the function or health of an organism is not known.
A. Systems and Methods of Predicting Tissue of Origin from Target Tumor DNA Sequencing
Introduction
[0051] The clinical management of cancer is largely determined by its site of origin, histopathologic subtype, and stage. Even for patients with tumors harboring a therapeutically sensitizing mutation that can guide molecularly-targeted therapy, clinical responses are often influenced by tumor origin. For example, BRAF V600E mutations are observed in cancers arising from numerous tissue sites, and the likelihood of response to RAF inhibitors varies widely as a function of tumor type. While critical for guiding patient management, histology-based cancer identification remains challenging in many patients, especially in those initially presenting with metastatic poorly differentiated neoplasms where ambiguous or incorrect classification may adversely impact choice of therapy and outcome.
[0052] While cancer classification has benefited from thorough immunohistochemical evaluation coupled with high quality cross-sectional imaging, molecular alterations highly indicative of the tumor site of origin may further assist in classifications when such tools fail. Some genomic alterations and mutational signatures are strongly associated with specific individual tumor types such as APC loss-of-function mutations in colorectal cancers, TMPRSS2-ERG fusions in prostate cancers, and a UV-associated mutational signature of C>T substitutions in cutaneous melanomas. For other cancer types, combinations of genomic alterations may commonly co-occur, such as TP53 and CTNNB1 mutations in endometrial cancer. The absence of highly prevalent alterations in a given tumor type, such as KRAS mutations in pancreatic adenocarcinoma and recurrent gene fusions in certain sarcomas, can also provide evidence against that particular prediction or classification. Both common and rare genomic alterations across numerous different cancers may, therefore, guide the inference of tumor origin as an adjunct to existing classification approaches.
[0053] The feasibility of tumor type classification from genomic data including mutations, copy number alterations, gene expression, methylation, and nucleosome occupancy may be demonstrated. Moreover, such molecular re-assessment of classifications can lead to a change of therapy. Yet the systematic application of such approaches to prospectively generated clinical sequencing data from often sub-optimal FFPE biopsies and their accuracy when applied to the targeted cancer gene panels most commonly used in the clinic to facilitate treatment selection remain largely unexplored.
[0054] Here, a machine learning-based approach is established to infer the probabilities of each common solid tumor type classification based on a broad array of genomic alterations identified by targeted tumor sequencing. To ensure applicability for clinical care, the model may be trained on prospective genomic data from advanced cancer patients. Using a population-scale approach allowed us to account for the varying prevalence and co-occurrence of genomic features across all tumor types. The probabilistic genome-based tumor type prediction, when considered alongside traditional immunohistochemical and clinical evaluation, can enable improved predictive accuracy, with important therapeutic implications.
Methods
Subjects
[0055] The training dataset was derived from a clinical cohort. Patients with rare cancer types or low tumor content were excluded from analysis, resulting in a total training dataset of patients identified or known to have one of 22 cancer types (Table 1). In various embodiments, cancer types may be deemed rare if, for example, they are not among the 50, 40, 30, 25, 20, 15, or 10 most common cancer types. An additional patients subsequently tested by MSK-IMPACT comprised an independent test set. All patients undergoing MSK-IMPACT testing signed a clinical consent form or enrolled on an institutional IRB-approved research protocol (NCT01775072). Demographic characteristics of both cohorts are displayed in Table 2.
Genomic Analysis
[0056] Tumor and matched normal DNAs were sequenced in a CLIA-compliant laboratory using MSK-IMPACT, an FDA-authorized clinical sequencing assay targeting up to 468 key cancer-associated genes. Genomic alterations including mutations, indels, copy number alterations, structural rearrangements, and selected mutation signatures were reported to patients and physicians to guide clinical care and aggregated in a HIPAA-compliant manner in the cBioPortal for Cancer Genomics for further analysis and visualization.
Random Forest Classifier
[0057] As an example technique that may be used in various potential embodiments to predict tumor site of origin, a random forest classifier may be constructed using the training cohort of patients. Prediction accuracy was determined from five-fold cross validation of the training data as well as the independent test set. As many diverse alterations and mutation patterns are associated with different sites of origin, the feature set for classification was drawn from the following categories: mutations and indels (hotspots and gene-level), focal amplifications and deletions, broad copy number gains and losses, structural rearrangements, mutation signatures, mutation rate, and sex. Classifier scores were subsequently calibrated using multinomial logistic regression to match empirically observed classification probabilities.
[0058] It is hypothesized that the information content from clinical targeted tumor genomic profiling would be sufficiently rich to predict the tumor site of origin with high accuracy. A machine learning-based classifier may be established to determine the ability of DNA genomic alterations (specifically, mutations and indels, focal and broad copy number alterations, structural rearrangements, and mutation signatures) to inform the classification of advanced cancer patients, as depicted in
[0059] Referring now to
[0060] Each of the components in the system 200 listed above may be implemented using hardware (e.g., one or more processors coupled with memory) or a combination of hardware and software as detailed herein in Section B. Each of the components in the system 200 may implement or execute the functionalities detailed herein in Section A, such as those described in conjunction with
[0061] The model trainer 208 executing on the classification system 202 may access the training dataset 214 to obtain, retrieve, or otherwise identify training sample datasets 216. The training dataset 214 may have been derived from DNA sequencing (e.g., DNA sequences 218 acquired via sequencer 204) and genetic analysis (e.g., using sequence analyzer 213) of tissue samples from a set of subjects with known cancers. Each DNA sequence sample 216 of the training dataset 214 may record, define, or otherwise include a set of genes, a category, and a label. In various embodiments, particular genes, categories, and labels may be identified and assigned by sequence analyzer analyzing DNA sequences 218. As an example, the set of genes may reference at least some of the genes or alleles described in Table 5. The category may define a nature of alterations to the set of genes of the DNA sequence sample 216. The nature of alterations may include, for example: a gene amplification (AMP), chromosome gain, homozygous deletion, hotspot, allele, chromosome loss, promoter, signature, structural variant (SV), truncation, and variant of unknown significance (VUS), among others. The label may indicate whether the set of genes of the DNA sequence sample 216 is from a cancer subject. In some embodiments, the DNA sequence sample 216 may include one or more traits of the cancer subject, such as sex, age, race and geographic location, among others. The training dataset 214 may be any form of data structure maintainable on the classification system 202, such as an array, a matrix, a table, a linked list, a tree, a heap, and a hash table, among others.
[0062] Using the training dataset 214, the model trainer 208 may train, develop, or otherwise establish the classification model 212. In some embodiments, the model trainer 208 may create or instantiate the classification model 212 in response to identifying the training dataset 214. The classification model 212 may be generated, established, and trained in accordance with any number of classification algorithms, such as a linear discriminant analysis, a support vector machine, a regression model (linear or logistic), a Naïve Bayesian classifier, and k-nearest neighbor classifier, among others. In some embodiments, the classification model 212 may be a random forest classifier and the training of the classification model 212 may be in accordance with a random forest algorithm. The classification model 212 may include a set of decision trees (e.g., a classification and regression tree (CART)) to output a likelihood of a presence of cancer at a site of origin given an input DNA sequence. The site of origin may correspond to a type of cancer, and may correspond with an organ in a subject from which the cancer originated, such as a prostate, bladder, breast, and lymph nodes, among others. The random forest classifier, for example, may be selected for its ability to better accommodate large numbers of potentially informative features, arbitrary combinations of features, and the imbalanced class representation of the cohort. The number of decision trees in the random forest classifier may correspond to the number of sites of origins.
[0063] To train the classification model 212, the model trainer 208 may perform a bootstrap aggregation process (sometimes referred to as bagging) using the training dataset 214. In performing the process, the model trainer 208 may select random subsets of the DNA sequence samples 216. Each selected DNA sequence sample 216 may include the set of genes, the category, and the label. The number of random subsets may be proportional to the number of sites of origins over the total number of DNA sequence samples 216 in the training dataset 214. In some embodiments, the model trainer 208 may construct or train one of the decision trees in the classification model 212 upon selection of the subsets. The construction of the tree may be in accordance with decision tree learning techniques, such as a classification and regression tree (CART). For example, the model trainer 208 may determine or generate a feature space using the variables in the selected random subset of DNA sequence samples 216. The model trainer 208 may divide the feature space based on where the DNA sequence samples 216 fall, and may construct the tree based on the division of the feature space. Subsequent to the construction, the model trainer 208 may determine a performance metric (e.g., Cohen's kappa) to assess the accuracy and confidence of the tree in the classification model 212.
[0064] Once the classification model 212 has been trained or otherwise established, the model applier 210 executing on the classification system 202 can retrieve, receive, or identify at least one patient sample dataset 217 in application dataset 215. The patient sample dataset 217 may comprise or have been derived through genetic analysis (e.g., by sequence analyzer 213) of DNA sequence 218 from the sequencer 204. The sequencer 204 may scan a biopsy sample taken from a subject and perform DNA sequencing to generate the DNA sequence 218, which may be analyzed, for example, by sequence analyzer 213 to identify genes, genetic alterations, etc. (e.g., through comparison of genetic sequences from sequencer 204 with known genetic sequences in a database). The patient or other subject may or may not have cancer. The DNA sequence 218 may include a set of genes and a category. The set of genes may correspond to a particular subset of a DNA sequencing from the tissue sample. The category may define the nature of alteration within the set of genes, such as a gene amplification (AMP), chromosome gain, homozygous deletion, hotspot, allele, chromosome loss, promoter, signature, structural variant (SV), truncation, and variant of unknown significance (VUS), among others. In some embodiments, the DNA sequence 218 may be accompanied by one or more traits, characteristics, or health history of the subject from whom the tissue sample is taken (such as age, gender, smoking history, etc.).
[0065] Genetic sequences from the sequencer 204 may be analyzed to generate a patient sample dataset 217, and the model applier 210 may apply the classification model 212 to the patient sample dataset 217. For example, where a random forest classifier is used, the model applier 210 may feed or provide the patient sample dataset 217 as an input to decision trees of the classification model 212. In applying the classification model 212, the model applier 210 may traverse each tree and nodes along at least one path within each decision tree of the classification model 212. By feeding the DNA sequence 218 to each decision tree of the classification model 212, the model applier 210 may generate or otherwise determine a likelihood of a presence of cancer for each site of origin. With the determination, the model applier 210 may send, transmit, or other provide output data 220, which in some embodiments may be provided to display 206 for presentation and/or may be transmitted or otherwise provided to other computing devices 230 or systems via a wired or wireless network communications interface or transceiver. In various embodiments, additionally or alternatively, one or more data structures 228 (which may be stored in classification system 202, in computing device(s) 230, and/or elsewhere) may be generated to comprise the output data 202, or if data structures 228 were previously generated, the output data 220 may be incorporated therein. Data structures 228 may comprise, for example, associations between patients and one or more cancer origin site classifications. The output data 220 may include the set of likelihoods outputted by the classification model 212.
[0066] In various embodiments, the training sample datasets 216 may include various other data that may be used to train a predictive model for classifications. For example, in addition to genetic sequence data, the predictive model may be trained using histopathological assessments or other histological data. In various embodiments, the predictive model may be trained by also incorporating other relevant data from the electronic medical records of study subjects.
[0067]
[0068] Using the training dataset, a predictive model (e.g., classification model 212) may be trained at 266. The predictive model may be trained using one or more suitable machine learning techniques, including supervised, unsupervised, or semi-supervised learning techniques. In some embodiments, the predictive model may comprise one or more artificial neural networks. The predictive model may be trained such that it is configured to accept genetic sequencing data (e.g., genes and gene alterations) as input, and generate cancer origin site classifications as outputs. In certain embodiments, process 250 may end (290) after step 266.
[0069] In various embodiments, process 250 may begin (254) by proceeding to model application at 278. In certain embodiments, process 250 may proceed to step 278 following step 266. At 278, genetic material in a tissue sample from a patient may be sequenced (e.g., by sequencer 204 to obtain DNA sequence 218). Genetic sequence data may be analyzed (e.g., by sequence analyzer 213) to identify genes and/or gene alterations. At 282, a patient sample dataset may be generated based on analysis of the sequenced genetic material of the patient. At 286, a trained predictive model (e.g., following step 266) may be applied to the patient sample dataset to generate an output (see, e.g.,
[0070] The outputs (e.g., output data 220) may, in various embodiments, be displayed (e.g., via display 206) and/or transmitted to other computing devices 230 (e.g., devices of healthcare professionals who may be treating the patient) for further analysis and/or for use in planning treatment or therapeutic protocols. In various embodiments, the output data 220 may be further analyzed (by itself or in combination with other patient data available in, e.g., the patient's electronic medical records) by system 200 to automatically generate one or more treatment or therapeutic recommendations. In certain embodiments, output data 220 may comprise various treatment or therapeutic recommendations. An association between a subject and classifications (e.g., organs, cancer types, and/or confidence scores) may be stored in one or more data structures.
Performance of Embodiments of Tumor Type Predictive Model
[0071] In the training set of patients tested by MSK-IMPACT, in an illustrative embodiment, cancer type was accurately predicted in 73.8% of cases based on five-fold cross-validation (
[0072] Due to the importance of high-confidence predictions for clinical decision-making in individual patients, the probability associated with each individual tumor type prediction is estimated. Raw classifier scores were calibrated to match empirically observed classification probabilities from cross-validation (log loss 0.98,
Relative Predictive Value of Molecular Features
[0073] Given the diverse categories of genomic features incorporated into the classifier (Table 5), the relative importance of each molecular alteration type to the overall classification performance may be determined. Using the Cohen's kappa metric to represent overall accuracy, it was found that somatic substitutions and indels had the highest predictive value, followed by chromosome arm-level (broad) copy number alterations (CNAs) (
[0074] Likewise, there was great breadth and variability among the specific features utilized to predict different cancer types (
[0075] Next, it may be sought to determine whether such feature diversity and feature interaction could discriminate among different tumor types that nevertheless share a common molecular feature that is therefore not discriminatory. In BRAF V600E-mutant melanomas, colorectal, and thyroid cancers, where response rates to RAF inhibitor therapies vary, the classifier correctly predicted the tissue of origin in 162/195 cases (83%). Despite the presence of BRAF V600E in all cases, high confidence predictions were driven by distinct co-occurring mutations and genomic features, such as TERT promoter mutations in melanoma and thyroid cancer, APC mutations and microsatellite instability in colorectal cancer, and UV-associated signatures in melanoma (
Application to Cell Free DNA
[0076] Various embodiments of the disclosed approach may employ training data from tissue biopsies of solid tumors. Using non-invasive molecular profiling of plasma circulating tumor DNA (ctDNA), a suggested classification of patients receiving cancer screening or with inaccessible disease may be inferred in various embodiments of the disclosure. The predictive power of an embodiment of the classifier may be tested in two independent cohorts: 19 patients with genitourinary cancers and MSK-IMPACT sequencing of ctDNA, and a set of 41 patients with metastatic breast or prostate cancer and whole exome sequencing (WES) of ctDNA. Corrected predicted was the tumor type from MSK-IMPACT in 12/19 (63%) patients with prostate, bladder, and testicular cancer from among the 22 cancer types included in the classifier, including 8/8 predictions with probability>85%. Only 1 prediction (out of 10) with probability>75% was inaccurate; a prostate cancer with a single missense mutation in VHL was incorrectly predicted as renal cell carcinoma. Also, the tumor type from WES in 23/27 (85%) patients with breast cancer and in 10/14 (71%) patients with prostate cancer was correctly predicted, demonstrating the general applicability of the classifier to multiple sequencing platforms as well as its suitability for diverse specimen types such as ctDNA.
Application of Various Embodiments to Challenging Clinical Scenarios
[0077] Given the predictive power of embodiments of the disclosed classifier, it was sought to determine the impact of real-time molecularly-driven classifications in multiple challenging clinical scenarios. One unmet clinical need for such accurate classification is the inference of the tissue of origin for cancers of unknown primary site (CUP). Refining tumor classification in this population can facilitate selection of potentially effective routine and investigational therapies. Using an embodiment of a trained predictive model, a likely tissue of origin may be predicted with, for example, a probability>50% in 67% (95/141) of patients (
[0078] In various embodiments, the classifier of the predictive model could help resolve the uncertainty that arises in distinguishing between primary brain tumors and metastatic tumors to the central nervous system (CNS). Including both cohorts, 299 brain metastases of solid tumors originating outside the CNS may be sequenced, including 133 non-small cell lung cancers, 56 breast cancers, 43 melanomas, and 67 other tumors. The correct tumor type in 83% (248/299) of cases was correctly predicted. Importantly, out of 51 incorrect predictions, only 2 were predicted as glioma. These results illustrate the predictive value of the classifier for CNS tumors and its promise for non-invasive ctDNA profiling from cerebrospinal fluid.
[0079] Another common and complex challenge occurs when patients with a history of cancer present with a new tumor that may represent either a distant metastasis of their prior tumor classification or a second primary tumor. Therefore, various embodiments may employ molecularly driven classifications to clarify such complex distinctions between tumor types. In one representative case, a 67-year old female with a history of breast cancer presented with a lymph node lesion three years after her initial classification. Histopathological assessment suggested metastatic poorly differentiated adenocarcinoma with micropapillary and apocrine cytology, and immunohistochemistry showed weak-to-moderate estrogen receptor staining, collectively leading to a classification of estrogen receptor-positive (ER+) breast cancer and a planned regimen of hormonal therapy (
[0080] Two cancers in a single patient may occasionally share mechanisms of pathogenesis that further complicate the distinction between metastatic progression and independent primary tumors. In a representative case, a 77-year-old female was referred to the center with lesions in the breast and bladder and a classification of metastatic breast lobular carcinoma (
[0081] In various embodiments, a systematic computational approach may be developed and deployed for molecularly-driven prediction of the site of origin of tumors based on targeted DNA sequencing. While tumor sequencing is rapidly being adopted as a routine test in clinical cancer care, its impact thus far has been largely limited to driving new enrollments onto clinical trials and for the identification of biomarkers of treatment response and resistance. In various embodiments, such sequencing informs cancer classification, potentially as an adjunct to histopathologic assessment. In this approach, multi-faceted molecular alteration types may be incorporated into a probabilistic prediction to accurately identify therapeutically significant cancer type differences under challenging classification circumstances.
[0082] Various embodiments may have a wide array of clinical applications. Genome-directed classification, as typified by the representative cases here, can alter patient eligibility for various clinical modalities. As liquid biopsy is increasingly used as a screening tool for cancer recurrence and new malignancies, the approach can inform the site of origin when ctDNA is detected. There are also many ways in which predictions may be utilized clinically, especially in light of the development of probability estimates on individual predictions. In cases in which traditional classification is ambiguous or challenging, computational predictions from genomic data can exclude possibilities even if the predictions are not definitive. In other cases, a high-confidence prediction that disagrees with the defined or suspected classification can prompt pathological and clinical re-evaluation, allowing additional testing that may help support an alternative classification. In contrast to using mRNA-based tissue classification to predict the site of origin for CUP, an advantage of embodiments of the disclosed approach is their ability to enumerate the discrete genomic features driving individual predictions, thereby providing pathologists and oncologists an opportunity to rationally interpret discordant results.
[0083] The high accuracy of the classifier, trained on MSK-IMPACT data, for predicting tumor type from ctDNA WES data suggests broad applicability to other panels with shared genomic targets. The disclosed approach may resolve challenging classification scenarios, alter established classifications (via prompting of additional pathological assessment), and affect therapeutic modalities.
[0084] Overall, as the understanding improves of how lineage influences response to the newest generation of therapies in cancer, embodiments of the disclosed systematic approach to molecularly-driven classification coupled to clinical histories, histopathologic assessment, and imaging will improve classifications and treatment decisions. The results exemplify the emerging and powerful role of artificial intelligence in medicine for clinical decision support.
Supplementary Content for Various Potential Embodiments
Detailed Methods
Training Set
[0085] The dataset was derived from the MSK-IMPACT (Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets) clinical series and includes samples from cancer patients among more than 60 cancer types. Patients predominantly exhibited advanced metastatic disease, and all patients consented to somatic mutation profiling in a CLIA-compliant laboratory. The cancer type and primary site classifications for each sample in this cohort were determined and recorded in real time as part of the clinical workup of each case. Molecular pathology fellows reviewed the surgical pathology report available at the time of MSK-IMPACT testing and selected the most appropriate OncoTree code representing the detailed tumor type. In total, 22 major cancer types with more than 40 independent tumors were selected for this analysis (Table 1). Samples that were not associated with a classification of one of these 22 selected cancer types were excluded from the training set.
TABLE-US-00001 TABLE 1 Distinct tumor types considered for classification CANCER_TYPE CANCER_TYPE_DETAILED Bladder.Cancer Bladder Urothelial Carcinoma | Upper Tract Urothelial Carcinoma Breast.Cancer Adenoid Cystic Breast Cancer | Breast Carcinoma | Breast Invasive Cancer, NOS | Breast Invasive Carcinoma, NOS | Breast Invasive Ductal Carcinoma | Breast Invasive Lobular Carcinoma | Breast Invasive Mixed Mucinous Carcinoma | Breast Mixed Ductal and Lobular Carcinoma | Metaplastic Breast Cancer Cholangiocarcinoma Cholangiocarcinoma | Extrahepatic Cholangiocarcinoma | Intrahepatic Cholangiocarcinoma | Perihilar Cholangiocarcinoma Colorectal.Cancer Colon Adenocarcinoma | Colorectal Adenocarcinoma | Medullary Carcinoma of the Colon | Mucinous Adenocarcinoma of the Colon and Rectum | Mucinous Colorectal Carcinoma | Rectal Adenocarcinoma Endometrial.Cancer Endometrial Carcinoma | Uterine Carcinosarcoma/Uterine Malignant Mixed Mullerian Tumor | Uterine Clear Cell Carcinoma | Uterine Dedifferentiated Carcinoma | Uterine Endometrioid Carcinoma | Uterine Mixed Endometrial Carcinoma | Uterine Neuroendocrine Carcinoma | Uterine Serous Carcinoma/Uterine Papillary Serous Carcinoma | Uterine Undifferentiated Carcinoma Esophagogastric.Cancer Adenocarcinoma of the Gastroesophageal Junction | Esophageal Adenocarcinoma | Esophageal Squamous Cell Carcinoma | Esophagogastric Adenocarcinoma | Intestinal Type Stomach Adenocarcinoma | Poorly Differentiated Carcinoma of the Stomach | Signet Ring Cell Carcinoma of the Stomach | Stomach Adenocarcinoma | Tubular Stomach Adenocarcinoma Gastrointestinal.Stromal.Tumor Gastrointestinal Stromal Tumor Germ.Cell.Tumor Embryonal Carcinoma | Immature Teratoma | Mature Teratoma | Mixed Germ Cell Tumor | Non-Seminomatous Germ Cell Tumor | Seminoma | Teratoma | Teratoma with Malignant Transformation | Yolk Sac Tumor Glioma Anaplastic Astrocytoma | Anaplastic Ganglioglioma | Anaplastic Oligoastrocytoma | Anaplastic Oligodendroglioma | Astrocytoma | Diffuse Intrinsic Pontine Glioma | Ganglioglioma | Glioblastoma Multiforme | Gliosarcoma | High-Grade Glioma, NOS | Low-Grade Glioma, NOS | Oligoastrocytoma | Oligodendroglioma | Pilocytic Astrocytoma | Pleomorphic Xanthoastrocytoma Head.and.Neck.Cancer Clear Cell Odontogenic Carcinoma | Epithelial-Myoepithelial Carcinoma | Head and Neck Carcinoma, Other | Head and Neck Neuroendocrine Carcinoma | Head and Neck Squamous Cell Carcinoma | Head and Neck Squamous Cell Carcinoma of Unknown Primary | Hypopharynx Squamous Cell Carcinoma | Larynx Squamous Cell Carcinoma | Nasopharyngeal Carcinoma | Odontogenic Carcinoma | Oral Cavity Squamous Cell Carcinoma | Oropharynx Squamous Cell Carcinoma | Sinonasal Adenocarcinoma | Sinonasal Squamous Cell Carcinoma | Sinonasal Undifferentiated Carcinoma Melanoma Acral Melanoma | Anorectal Mucosal Melanoma | Cutaneous Melanoma | Desmoplastic Melanoma | Genitourinary Mucosal Melanoma | Head and Neck Mucosal Melanoma | Melanoma of Unknown Primary | Mucosal Melanoma of the Esophagus | Mucosal Melanoma of the Urethra | Mucosal Melanoma of the Vulva/Vagina | Primary CNS Melanoma Mesothelioma Peritoneal Mesothelioma | Pleural Mesothelioma | Pleural Mesothelioma, Biphasic Type | Pleural Mesothelioma, Epithelioid Type | Pleural Mesothelioma, Sarcomatoid Type | Testicular Mesothelioma Neuroblastoma Neuroblastoma Non.Small.Cell.Lung.Cancer Atypical Lung Carcinoid | Basaloid Large Cell Carcinoma of the Lung | Ciliated Muconodular Papillary Tumor of the Lung | Large Cell Lung Carcinoma | Large Cell Neuroendocrine Carcinoma | Lung Adenocarcinoma | Lung Adenosquamous Carcinoma | Lung Carcinoid | Lung Squamous Cell Carcinoma | Lymphoepithelioma- like Carcinoma of the Lung | Non-Small Cell Lung Cancer | Pleomorphic Carcinoma of the Lung | Poorly Differentiated Non- Small Cell Lung Cancer | Sarcomatoid Carcinoma of the Lung | Spindle Cell Carcinoma of the Lung Ovarian.Cancer Clear Cell Ovarian Cancer | Endometrioid Ovarian Cancer | High- Grade Neuroendocrine Carcinoma of the Ovary | High-Grade Serous Ovarian Cancer | Low-Grade Serous Ovarian Cancer | Mixed Ovarian Carcinoma | Mucinous Ovarian Cancer | Ovarian Cancer, Other | Ovarian Carcinosarcoma/Malignant Mixed Mesodermal Tumor | Ovarian Epithelial Tumor | Ovarian Seromucinous Carcinoma | Serous Borderline Ovarian Tumor | Serous Borderline Ovarian Tumor, Micropapillary | Serous Ovarian Cancer | Small Cell Carcinoma of the Ovary Pancreatic.Cancer Acinar Cell Carcinoma of the Pancreas | Adenosquamous Carcinoma of the Pancreas | Intraductal Papillary Mucinous Neoplasm | Mucinous Cystic Neoplasm | Pancreatic Adenocarcinoma | Pancreatoblastoma | Serous Cystadenoma of the Pancreas | Solid Pseudopapillary Neoplasm of the Pancreas | Undifferentiated Carcinoma of the Pancreas Pancreatic.Neuroendocrine.Tumor Pancreatic Neuroendocrine Tumor Prostate.Cancer Prostate Adenocarcinoma | Prostate Neuroendocrine Carcinoma | Prostate Small Cell Carcinoma Renal.Cell.Cancer Chromophobe Renal Cell Carcinoma | Collecting Duct Renal Cell Carcinoma | Papillary Renal Cell Carcinoma | Renal Angiomyolipoma | Renal Cell Carcinoma | Renal Clear Cell Carcinoma | Renal Clear Cell Carcinoma with Sarcomatoid Features | Renal Medullary Carcinoma | Renal Mucinous Tubular Spindle Cell Carcinoma | Renal Oncocytoma | Translocation-Associated Renal Cell Carcinoma | Unclassified Renal Cell Carcinoma Small.Cell.Lung.Cancer Lung Neuroendocrine Tumor | Small Cell Lung Cancer Thyroid.Cancer Anaplastic Thyroid Cancer | Follicular Thyroid Cancer | Hurthle Cell Thyroid Cancer | Medullary Thyroid Cancer | Papillary Thyroid Cancer | Poorly Differentiated Thyroid Cancer Uveal.Melanoma Uveal Melanoma Total
[0086] The MSK-IMPACT cohort includes many samples derived from biopsy specimens with often low tumor content. Such samples can have reduced sensitivity for detection for genomic alterations, especially changes in DNA copy number. In order to reduce associated bias in the frequency of the genomic alterations defining each cancer type, samples for which all mutations have a somatic mutant allele frequency less than 1000 and with copy number alterations with an absolute log ratio less than 0.2 were excluded from the training set. Samples with no evident genomic alterations were also excluded from the training set and were not used for prediction. Only one sample per patient was included, with preference given to primary over metastatic samples. In total, the training set excluded samples from less frequent cancer types, samples from low purity specimens, and redundant samples from patients with more than one tumor specimen sequenced. The resulting training cohort included samples. Prediction accuracy may be determined for samples in the training set using five-fold cross-validation. An independent set of tumors subsequently profiled using MSK-IMPACT as part of the same prospective clinical sequencing initiative was used to test the accuracy of the classifier. Demographic characteristics of both cohorts are displayed in Table 2.
TABLE-US-00002 TABLE 2 Clinical and technical characteristics of the training and validation cohorts TRAINING VALIDATION COHORT COHORT Age at Sequencing mean 60.3 62.1 median 62 64 SD 14.5 13.7 Tumor Purity mean 45.5 39.1 median 40 40 SD 21.3 20.4 Sequence Coverage mean 718 676 SD 268 199 Mutations mean 8 8.8 median 5 4 SD 18.1 22.4 Fraction Genome mean 0.21 0.19 Altered median 0.17 0.13 SD 0.19 0.19
TABLE-US-00003 TABLE 3 Sensitivity and specificity of predictions for each tumor type Total Accurate Cancer Type Predictions Predictions Sensitivity Specificity Non.Small.Cell.Lung.Cancer 1600 1099 0.782 0.687 Breast.Cancer 1360 1035 0.876 0.761 Colorectal.Cancer 892 785 0.847 0.880 Prostate.Cancer 550 423 0.812 0.769 Glioma 500 440 0.873 0.880 Bladder.Cancer 342 274 0.765 0.801 Pancreatic.Cancer 372 248 0.719 0.667 Renal.Cell.Cancer 293 217 0.707 0.741 Melanoma 267 205 0.707 0.768 Esophagogastric.Cancer 246 119 0.431 0.484 Germ.Cell.Tumor 243 191 0.799 0.786 Thyroid.Cancer 189 113 0.523 0.598 Ovarian.Cancer 160 73 0.348 0.456 Endometrial.Cancer 146 99 0.495 0.678 Cholangiocarcinoma 117 63 0.364 0.538 Head.and.Neck.Cancer 91 55 0.320 0.604 Gastrointestinal.Stromal.Tumor 118 88 0.727 0.746 Mesothelioma 85 51 0.537 0.600 Small.Cell.Lung.Cancer 62 48 0.552 0.774 Pancreatic.Neuroendocrine.Tumor 64 41 0.621 0.641 Neuroblastoma 50 42 0.737 0.840 Uveal.Melanoma 44 39 0.951 0.886
TABLE-US-00004 TABLE 4 Prediction accuracy for detailed histological subtypes Accurate Cancer Type Cancer Type Detailed Predictions Sensitivity Bladder.Cancer Bladder Urothelial Carcinoma 223 0.78 Bladder.Cancer Upper Tract Urothelial Carcinoma 51 0.70 Breast.Cancer Breast Invasive Ductal Carcinoma 767 0.87 Breast.Cancer Breast Invasive Lobular 167 0.95 Carcinoma Breast.Cancer Breast Mixed Ductal and Lobular 46 0.88 Carcinoma Breast.Cancer Breast Invasive Carcinoma, NOS 23 0.70 Breast.Cancer Breast Invasive Cancer, NOS 17 0.94 Breast.Cancer Other 15 0.83 Cholangiocarcinoma Intrahepatic Cholangiocarcinoma 46 0.46 Cholangiocarcinoma Cholangiocarcinoma, NOS 14 0.28 Cholangiocarcinoma Extrahepatic Cholangiocarcinoma 3 0.14 Cholangiocarcinoma Other 0 0.00 Colorectal.Cancer Colon Adenocarcinoma 555 0.85 Colorectal.Cancer Rectal Adenocarcinoma 192 0.89 Colorectal.Cancer Mucinous Adenocarcinoma of the 24 0.69 Colon and Rectum Colorectal.Cancer Colorectal Adenocarcinoma 12 0.75 Colorectal.Cancer Other 2 0.67 Endometrial.Cancer Uterine Endometrioid Carcinoma 58 0.67 Endometrial.Cancer Uterine Serous Carcinoma/Uterine 20 0.45 Papillary Serous Carcinoma Endometrial.Cancer Uterine Carcinosarcoma/Uterine 9 0.26 Malignant Mixed Mullerian Tumor Endometrial.Cancer Uterine Mixed Endometrial 6 0.35 Carcinoma Endometrial.Cancer Uterine Clear Cell Carcinoma 3 0.21 Endometrial.Cancer Other 3 0.60 Esophagogastric.Cancer Stomach Adenocarcinoma 42 0.34 Esophagogastric.Cancer Esophageal Adenocarcinoma 55 0.54 Esophagogastric.Cancer Adenocarcinoma of the 20 0.54 Gastroesophageal Junction Esophagogastric.Cancer Esophageal Squamous Cell 1 0.11 Carcinoma Esophagogastric.Cancer Other 1 0.17 Gastrointestinal.Stromal.Tumor Gastrointestinal Stromal Tumor 88 0.73 Germ.Cell.Tumor Mixed Germ Cell Tumor 95 0.87 Germ.Cell.Tumor Seminoma 54 0.81 Germ.Cell.Tumor Yolk Sac Tumor 8 0.38 Germ.Cell.Tumor Non-Seminomatous Germ Cell 14 0.78 Tumor Germ.Cell.Tumor Embryonal Carcinoma 15 0.94 Germ.Cell.Tumor Other 5 0.63 Glioma Glioblastoma Multiforme 237 0.89 Glioma Anaplastic Astrocytoma 65 0.86 Glioma Anaplastic Oligodendroglioma 39 0.98 Glioma Oligodendroglioma 34 0.94 Glioma Astrocytoma 27 0.84 Glioma Anaplastic Oligoastrocytoma 13 0.93 Glioma High-Grade Glioma, NOS 7 0.50 Glioma Other 18 0.69 Head.and.Neck.Cancer Head and Neck Squamous Cell 13 0.31 Carcinoma Head.and.Neck.Cancer Oral Cavity Squamous Cell 21 0.55 Carcinoma Head.and.Neck.Cancer Oropharynx Squamous Cell 12 0.32 Carcinoma Head.and.Neck.Cancer Larynx Squamous Cell Carcinoma 1 0.08 Head.and.Neck.Cancer Nasopharyngeal Carcinoma 3 0.25 Head.and.Neck.Cancer Head and Neck Squamous Cell 5 0.17 Carcinoma of Unknown Primary Melanoma Cutaneous Melanoma 139 0.79 Melanoma Melanoma of Unknown Primary 36 0.90 Melanoma Acral Melanoma 8 0.38 Melanoma Anorectal Mucosal Melanoma 12 0.60 Melanoma Mucosal Melanoma of the 4 0.27 Vulva/Vagina Melanoma Head and Neck Mucosal 4 0.36 Melanoma Melanoma Other 2 0.29 Mesothelioma Pleural Mesothelioma, Epithelioid 20 0.53 Type Mesothelioma Pleural Mesothelioma 22 0.67 Mesothelioma Peritoneal Mesothelioma 6 0.35 Mesothelioma Other 3 0.43 Neuroblastoma Neuroblastoma 42 0.74 Non.Small.Cell.Lung.Cancer Lung Adenocarcinoma 923 0.81 Non.Small.Cell.Lung.Cancer Lung Squamous Cell Carcinoma 100 0.68 Non.Small.Cell.Lung.Cancer Large Cell Neuroendocrine 25 0.71 Carcinoma Non.Small.Cell.Lung.Cancer Poorly Differentiated Non-Small 15 0.68 Cell Lung Cancer Non.Small.Cell.Lung.Cancer Non-Small Cell Lung Cancer 11 0.79 Non.Small.Cell.Lung.Cancer Atypical Lung Carcinoid 3 0.23 Non.Small.Cell.Lung.Cancer Sarcomatoid Carcinoma of the 7 0.54 Lung Non.Small.Cell.Lung.Cancer Lung Adenosquamous Carcinoma 7 0.78 Non.Small.Cell.Lung.Cancer Lung Carcinoid 1 0.13 Non.Small.Cell.Lung.Cancer Other 7 1.00 Ovarian.Cancer High-Grade Serous Ovarian 59 0.47 Cancer Ovarian.Cancer Clear Cell Ovarian Cancer 2 0.09 Ovarian.Cancer Low-Grade Serous Ovarian 2 0.10 Cancer Ovarian.Cancer Ovarian 7 0.64 Carcinosarcoma/Malignant Mixed Mesodermal Tumor Ovarian.Cancer Mucinous Ovarian Cancer 0 0.00 Ovarian.Cancer Endometrioid Ovarian Cancer 0 0.00 Ovarian.Cancer Other 3 0.20 Pancreatic.Cancer Pancreatic Adenocarcinoma 238 0.77 Pancreatic.Cancer Acinar Cell Carcinoma of the 0 0.00 Pancreas Pancreatic.Cancer Intraductal Papillary Mucinous 3 0.38 Neoplasm Pancreatic.Cancer Adenosquamous Carcinoma of the 6 0.86 Pancreas Pancreatic.Cancer Other 1 0.11 Pancreatic.Neuroendocrine.Tumor Pancreatic Neuroendocrine Tumor 41 0.62 Prostate.Cancer Prostate Adenocarcinoma 415 0.82 Prostate.Cancer Prostate Neuroendocrine 3 0.38 Carcinoma Prostate.Cancer Other 5 1.00 Renal.Cell.Cancer Renal Clear Cell Carcinoma 167 0.93 Renal.Cell.Cancer Unclassified Renal Cell 21 0.46 Carcinoma Renal.Cell.Cancer Papillary Renal Cell Carcinoma 13 0.46 Renal.Cell.Cancer Chromophobe Renal Cell 13 0.54 Carcinoma Renal.Cell.Cancer Translocation-Associated Renal 1 0.11 Cell Carcinoma Renal.Cell.Cancer Other 2 0.10 Small.Cell.Lung.Cancer Small Cell Lung Cancer 48 0.59 Small.Cell.Lung.Cancer Lung Neuroendocrine Tumor 0 0.00 Thyroid.Cancer Papillary Thyroid Cancer 59 0.74 Thyroid.Cancer Poorly Differentiated Thyroid 28 0.48 Cancer Thyroid.Cancer Anaplastic Thyroid Cancer 14 0.44 Thyroid.Cancer Hurthle Cell Thyroid Cancer 7 0.30 Thyroid.Cancer Medullary Thyroid Cancer 0 0.00 Thyroid.Cancer Follicular Thyroid Cancer 5 1.00 Thyroid.Cancer Other 0 0.00 Uveal.Melanoma Uveal Melanoma 39 0.95
Derivation of Features
[0087] The molecular feature set was based on 341 oncogenes and tumor suppressor genes common to all MSK-IMPACT panel versions. This panel covers all exons of each gene including some relevant intronic regions to capture known structural variants, the TERT promoter and additional “tiling” SNPs to improve copy number calling. The features were derived from the following genomic alteration classes.
[0088] Somatic mutations. Mutations were annotated with Ensembl VEP. For each gene in the panel, the training set contained a binary feature corresponding to the presence or absence of a non-synonymous missense mutation and a binary feature corresponding to the presence or absence of a truncating mutation in the gene. The mutation status of known hotspot mutations and the status of the 30 distinct mutational signatures were also included as binary features. Mutational signatures were derived for each sample with at least ten synonymous or nonsynonymous somatic mutations and those signatures representing more than 40% of mutations were considered as present. The total number of nonsynonymous mutations per sample was included as a numeric feature.
[0089] Copy number alterations. The presence or absence of genomic gains and losses of each chromosome arm were identified from MSK-IMPACT data. Genomic coordinates for the chromosome arms in the GRCh37/hg19 human genome assembly were considered gained or lost if a majority of the arm (>50%) is affected by segment of absolute value of log-ratio of ±0.2. The presence or absence of focal amplifications and deep deletions (presumed homozygous deletions) for each of the 341 genes in the panel were also included as features. In addition, included may be a numeric feature representing the overall DNA copy number alteration burden, defined as the percentage of the autosomal genome that was affected by copy number alterations (gains or losses) inferred from the segmented log-ratio data.
[0090] Structural variants. The MSK-IMPACT panel includes several intronic regions designed to detect structural variants in genes that are commonly rearranged in cancer. Features were included for the presence or absence of selected structural variants detected by MSK-IMPACT (Table 5).
TABLE-US-00005 TABLE 5 Individual molecular features selected by the classifier Feature Category Feature Category AKT2_Amp Amp Del_7q Loss ALK_Amp Amp Del_8p Loss AMER1_Amp Amp Del_8q Loss AR_Amp Amp Del_9p Loss ASXL1_Amp Amp Del_9q Loss AURKA_Amp Amp Del_Xp Loss AXIN2_Amp Amp Del_Xq Loss BBC3_Amp Amp CN_Burden Other BCL2L1_Amp Amp Gender_F Other BCL6_Amp Amp LogINDEL_Mb Other BRCA1_Amp Amp LogSNV_Mb Other BRIP1_Amp Amp TERTp Promoter CARD11_Amp Amp Sig_APOBEC Signature CCND1_Amp Amp Sig_MMR Signature CCND2_Amp Amp Sig_UV Signature CCND3_Amp Amp EGFR_SV SV CCNE1_Amp Amp TMPRSS2_ERG_fusion SV CD274_Amp Amp TMRPSS2_ETV1_fusion SV CD79B_Amp Amp APC_TRUNC Truncation CDK12_Amp Amp ALK_TRUNC Truncation CDK4_Amp Amp AMER1_TRUNC Truncation CDK6_Amp Amp AR_TRUNC Truncation CDK8_Amp Amp ARID1A_TRUNC Truncation CDKN1B_Amp Amp ARID1B_TRUNC Truncation CRKL_Amp Amp ARID2_TRUNC Truncation DAXX_Amp Amp ASXL1_TRUNC Truncation DCUN1D1_Amp Amp ASXL2_TRUNC Truncation DDR2_Amp Amp ATM_TRUNC Truncation DIS3_Amp Amp ATRX_TRUNC Truncation DNMT3B_Amp Amp AXL_TRUNC Truncation E2F3_Amp Amp BAP1_TRUNC Truncation EGFR_Amp Amp BBC3_TRUNC Truncation ERBB2_Amp Amp BCOR_TRUNC Truncation ERBB3_Amp Amp BRCA2_TRUNC Truncation ERCC5_Amp Amp CARD11_TRUNC Truncation ERG_Amp Amp CASP8_TRUNC Truncation ETV1_Amp Amp CDH1_TRUNC Truncation ETV6_Amp Amp CDK12_TRUNC Truncation FAM46C_Amp Amp CDKN1A_TRUNC Truncation FGF19_Amp Amp CDKN2A_TRUNC Truncation FGF3_Amp Amp CIC_TRUNC Truncation FGF4_Amp Amp CREBBP_TRUNC Truncation FGFR1_Amp Amp CTCF_TRUNC Truncation FH_Amp Amp DAXX_TRUNC Truncation FLT1_Amp Amp EIF1AX_TRUNC Truncation FLT3_Amp Amp EP300_TRUNC Truncation FOXA1_Amp Amp EPHA3_TRUNC Truncation GNAS_Amp Amp FAT1_TRUNC Truncation H3F3C_Amp Amp FBXW7_TRUNC Truncation HIST1H1C_Amp Amp FLT1_TRUNC Truncation HIST1H2BD_Amp Amp FOXA1_TRUNC Truncation HIST1H3B_Amp Amp FUBP1_TRUNC Truncation IKBKE_Amp Amp GATA3_TRUNC Truncation IL10_Amp Amp GRIN2A_TRUNC Truncation IL7R_Amp Amp JAK1_TRUNC Truncation IRF4_Amp Amp KDM5A_TRUNC Truncation IRS1_Amp Amp KDM5C_TRUNC Truncation IRS2_Amp Amp KDM6A_TRUNC Truncation JAK2_Amp Amp KEAP1_TRUNC Truncation KDM5A_Amp Amp KIT_TRUNC Truncation KDM6A_Amp Amp LATS1_TRUNC Truncation KDR_Amp Amp MAP2K4_TRUNC Truncation KIT_Amp Amp MAP3K1_TRUNC Truncation KRAS_Amp Amp MCL1_TRUNC Truncation MCL1_Amp Amp MED_12_TRUNC Truncation MDC1_Amp Amp MEN1_TRUNC Truncation MDM2_Amp Amp MET_TRUNC Truncation MDM4_Amp Amp NCOR1_TRUNC Truncation MET_Amp Amp NF1_TRUNC Truncation MITF_Amp Amp NF2_TRUNC Truncation MPL_Amp Amp NOTCH1_TRUNC Truncation MYC_Amp Amp NSD1_TRUNC Truncation MYCL_Amp Amp PBRM1_TRUNC Truncation MYCN_Amp Amp PIK3R1_TRUNC Truncation NBN_Amp Amp PTCH1_TRUNC Truncation NKX2.1_Amp Amp PTEN_TRUNC Truncation NOTCH2_Amp Amp PTPRT_TRUNC Truncation NTRK1_Amp Amp RASA1_TRUNC Truncation PAK1_Amp Amp RB1_TRUNC Truncation PDGFRA_Amp Amp RBM10_TRUNC Truncation PIK3C2G_Amp Amp RECQL4_TRUNC Truncation PIK3CA_Amp Amp RNF43_TRUNC Truncation PIK3R2_Amp Amp SETD2_TRUNC Truncation PMS2_Amp Amp SF3B1_TRUNC Truncation PRKAR1A_Amp Amp SMAD4_TRUNC Truncation PTPRD_Amp Amp SMARCA4_TRUNC Truncation RAC1_Amp Amp SMARCB1_TRUNC Truncation RAD51C_Amp Amp SOX9_TRUNC Truncation RAD52_Amp Amp SPEN_TRUNC Truncation RAFI_Amp Amp STAG2_TRUNC Truncation RARA_Amp Amp STK11_TRUNC Truncation RECQL4_Amp Amp TBX3_TRUNC Truncation RET_Amp Amp TET2_TRUNC Truncation RICTOR_Amp Amp TGFBR2_TRUNC Truncation RIT1_Amp Amp TP53_TRUNC Truncation RNF43_Amp Amp TSC1_TRUNC Truncation RPS6KB2_Amp Amp TSC2_TRUNC Truncation RPTOR_Amp Amp VHL_TRUNC Truncation RUNX1_Amp Amp AMER1 VUS SDHA_Amp Amp ABL1 VUS SDHC_Amp Amp AKT1 VUS SOX17_Amp Amp AKT3 VUS SOX2_Amp Amp ALK VUS SOX9_Amp Amp ALOX12B VUS SPOP_Amp Amp APC VUS SRC_Amp Amp AR VUS TBX3_Amp Amp ARAF VUS TERT_Amp Amp ARID1A VUS TET2_Amp Amp ARID1B VUS TMPRSS2_Amp Amp ARID2 VUS TP63_Amp Amp ARID5B VUS YAP1_Amp Amp ASXL1 VUS Amp_10p Gain ASXL2 VUS Amp_10q Gain ATM VUS Amp_11p Gain ATR VUS Amp_11q Gain ATRX VUS Amp_12p Gain AURKA VUS Amp_12q Gain AXIN1 VUS Amp_13q Gain AXIN2 VUS Amp_14q Gain AXL VUS Amp_15q Gain BAP1 VUS Amp_16p Gain BARD1 VUS Amp_16q Gain BBC3 VUS Amp_17p Gain BCOR VUS Amp_17q Gain BLM VUS Amp_18p Gain BMPR1A VUS Amp_18q Gain BRAF VUS Amp_19p Gain BRCA1 VUS Amp_19q Gain BRCA2 VUS Amp_1p Gain BRD4 VUS Amp_1q Gain BTK VUS Amp_20p Gain CARD11 VUS Amp_20q Gain CASP8 VUS Amp_21q Gain CBFB VUS Amp_22q Gain CBL VUS Amp_2p Gain CCND1 VUS Amp_2q Gain CD79B VUS Amp_3p Gain CDH1 VUS Amp_3q Gain CDK12 VUS Amp_4p Gain CDK8 VUS Amp_4q Gain CDKN1A VUS Amp_5p Gain CDKN1B VUS Amp_5q Gain CDKN2A VUS Amp_6p Gain CHEK2 VUS Amp_6q Gain CIC VUS Amp_7p Gain CREBBP VUS Amp_7q Gain CSF1R VUS Amp_8p Gain CTCF VUS Amp_8q Gain CTNNB1 VUS Amp_9p Gain CUL3 VUS Amp_9q Gain DAXX VUS Amp_Xp Gain DDR2 VUS Amp_Xq Gain DICER1 VUS ARID1A_HomDel Homdel DIS3 VUS ARID5B_HomDel Homdel DNMT1 VUS B2M_HomDel Homdel DNMT3A VUS BAP1_HomDel Homdel DNMT3B VUS BCOR_HomDel Homdel DOT1L VUS BRCA2_HomDel Homdel EGFR VUS CARD11_HomDel Homdel EIF1AX VUS CDKN1B_HomDel Homdel EP300 VUS CDKN2A_HomDel Homdel EPHA3 VUS CDKN2B_HomDel Homdel EPHA5 VUS CRLF2_HomDel Homdel EPHB1 VUS FAT1_HomDel Homdel ERBB2 VUS FLT4_HomDel Homdel ERBB3 VUS FOXL2_HomDel Homdel ERBB4 VUS GATA3_HomDel Homdel ERCC2 VUS JUN_HomDel Homdel ERCC4 VUS NF1_HomDel Homdel ERCC5 VUS PAK1_HomDel Homdel ERG VUS PIK3CD_HomDel Homdel ESR1 VUS PTEN_HomDel Homdel ETV1 VUS PTPRD_HomDel Homdel ETV6 VUS RAD51_HomDel Homdel EZH2 VUS RASA1_HomDel Homdel FAM46C VUS RB1_HomDel Homdel FANCA VUS RET_HomDel Homdel FAT1 VUS SMAD4_HomDel Homdel FBXW7 VUS SUZ12_HomDel Homdel FGF4 VUS TGFBR2_HomDel Homdel FGFR1 VUS TNFRSF14_HomDel Homdel FGFR2 VUS AKT1_hotspot Hotspot FGFR3 VUS ALK_hotspot Hotspot FGFR4 VUS APC_hotspot Hotspot FH VUS AR_hotspot Hotspot FLCN VUS ARID1A_hotspot Hotspot FLT1 VUS BAP1_hotspot Hotspot FLT3 VUS BCOR_hotspot Hotspot FLT4 VUS BRAF_hotspot Hotspot FOXA1 VUS CARD11_hotspot Hotspot FOXL2 VUS CDKN2A_hotspot Hotspot FOXP1 VUS CIC_hotspot Hotspot FUBP1 VUS CTNNB1_hotspot Hotspot GATA1 VUS EGFR_hotspot Hotspot GATA2 VUS EIF1AX_hotspot Hotspot GATA3 VUS EP300_hotspot Hotspot GNA11 VUS ERBB2_hotspot Hotspot GNAQ VUS ERBB3_hotspot Hotspot GNAS VUS ERCC2_hotspot Hotspot GRIN2A VUS ESR1_hotspot Hotspot GSK3B VUS FBXW7_hotspot Hotspot HGF VUS FGFR2_hotspot Hotspot HNF1A VUS FGFR3_hotspot Hotspot HRAS VUS FOXA1_hotspot Hotspot IDH1 VUS GNA11_hotspot Hotspot IDH2 VUS GNAQ_hotspot Hotspot IFNGR1 VUS GNAS_hotspot Hotspot IGF1R VUS HRAS_hotspot Hotspot IKBKE VUS IDH1_hotspot Hotspot IKZF1 VUS IDH2_hotspot Hotspot IL7R VUS KDM6A_hotspot Hotspot INPP4A VUS KIT_hotspot Hotspot INPP4B VUS KRAS_hotspot Hotspot INSR VUS MAP2K1_hotspot Hotspot IRF4 VUS MTOR_hotspot Hotspot IRS1 VUS NFE2L2_hotspot Hotspot IRS2 VUS NOTCH1_hotspot Hotspot JAK1 VUS NRAS_hotspot Hotspot JAK2 VUS PDGFRA_hotspot Hotspot JAK3 VUS PIK3CA_hotspot Hotspot KDM5A VUS PIK3R1_hotspot Hotspot KDM5C VUS PPP2R1A_hotspot Hotspot KDM6A VUS PTEN_hotspot Hotspot KDR VUS PTPN11_hotspot Hotspot KEAP1 VUS RAC1_hotspot Hotspot KIT VUS RB1_hotspot Hotspot KLF4 VUS RET_hotspot Hotspot KRAS VUS RHOA_hotspot Hotspot LATS1 VUS SF3B1_hotspot Hotspot LATS2 VUS SMAD4_hotspot Hotspot MAP2K1 VUS SMARCA4_hotspot Hotspot MAP2K4 VUS SPOP_hotspot Hotspot MAP3K1 VUS STK11_hotspot Hotspot MAP3K13 VUS TP53_hotspot Hotspot MAPK1 VUS TRAF7_hotspot Hotspot MAX VUS VHL_hotspot Hotspot MDC1 VUS AKT1.E17K Hotspot Allele MED12 VUS ALK.F1174L Hotspot Allele MEF2B VUS ALK.F1245V Hotspot Allele MEN1 VUS ALK.R1275Q Hotspot Allele MET VUS APC.R1450. Hotspot Allele MITF VUS APC.R216. Hotspot Allele MLH1 VUS APC.R876. Hotspot Allele MPL VUS BAP1.K25_D34delinsN Hotspot Allele MRE11A VUS BCOR.N1459S Hotspot Allele MSH2 VUS BRAF.V600E Hotspot Allele MSH6 VUS BRAF.V600K Hotspot Allele MTOR VUS CARD11.R337. Hotspot Allele MYCN VUS CDKN2A.H83Y Hotspot Allele NBN VUS CDKN2A.R80. Hotspot Allele NCOR1 VUS CTNNB1.D32Y Hotspot Allele NF1 VUS CTNNB1.S37F Hotspot Allele NF2 VUS CTNNB1.S45F Hotspot Allele NFE2L2 VUS EGFR.E746_A750del Hotspot Allele NOTCH1 VUS EGFR.L858R Hotspot Allele NOTCH2 VUS EGFR.T790M Hotspot Allele NOTCH3 VUS EIF1AX.X113_splice Hotspot Allele NOTCH4 VUS EIF1AX.X6_splice Hotspot Allele NRAS VUS EP300.H1451Q Hotspot Allele NSD1 VUS ERBB2.S310F Hotspot Allele NTRK1 VUS ESR1.D538G Hotspot Allele NTRK2 VUS FBXW7.R479Q Hotspot Allele NTRK3 VUS FGFR3.R248C Hotspot Allele PAK1 VUS FGFR3.S249C Hotspot Allele PAK7 VUS FGFR3 Y373C Hotspot Allele PALB2 VUS GNA11.Q209L Hotspot Allele PARK2 VUS GNAQ.Q209L Hotspot Allele PARP1 VUS GNAQ.Q209P Hotspot Allele PAX5 VUS GNAQ.R183Q Hotspot Allele PBRM1 VUS IDH1.R132C Hotspot Allele PDGFRA VUS IDH1.R132H Hotspot Allele PDGFRB VUS IDH1.R132L Hotspot Allele PHOX2B VUS KIT.A502_Y503dup Hotspot Allele PIK3C2G VUS KIT.L576P Hotspot Allele PIK3C3 VUS KIT.V559D Hotspot Allele PIK3CA VUS KIT.V654A Hotspot Allele PIK3CB VUS KIT.W557_K558del Hotspot Allele PIK3CD VUS KRAS.G12A Hotspot Allele PIK3CG VUS KRAS.G12C Hotspot Allele PIK3R1 VUS KRAS.G12D Hotspot Allele PIK3R2 VUS KRAS.G12R Hotspot Allele PLK2 VUS KRAS.G12V Hotspot Allele PMS1 VUS KRAS.G13D Hotspot Allele PMS2 VUS KRAS.Q61H Hotspot Allele POLE VUS MYCN.P44L Hotspot Allele PPP2R1A VUS NRAS.Q61K Hotspot Allele PRDM1 VUS NRAS.Q61R Hotspot Allele PTCH1 VUS PDGFRA.D842V Hotspot Allele PTEN VUS PIK3CA.E542K Hotspot Allele PTPN11 VUS PIK3CA.E545K Hotspot Allele PTPRD VUS PIK3CA.H1047R Hotspot Allele PTPRS VUS PIK3CA.M1043I Hotspot Allele PTPRT VUS PPP2R1A.P179R Hotspot Allele RAC1 VUS PPP2R1A.S256F Hotspot Allele RAD50 VUS PTEN.R130G Hotspot Allele RAD52 VUS SF3BER625C Hotspot Allele RAF1 VUS SF3BER625H Hotspot Allele RARA VUS SPOP.F133L Hotspot Allele RASA1 VUS TP53.G245S Hotspot Allele RB1 VUS TP53.H179Y Hotspot Allele RBM10 VUS TP53.R158L Hotspot Allele RECQL4 VUS TP53.R175H Hotspot Allele REL VUS TP53.R213. Hotspot Allele RET VUS TP53.R248Q Hotspot Allele RHOA VUS TP53.R248W Hotspot Allele RICTOR VUS TP53.R273C Hotspot Allele RNF43 VUS TP53.R273H Hotspot Allele ROS1 VUS TP53.R282W Hotspot Allele RPS6KA4 VUS TP53.R342. Hotspot Allele RPS6KB2 VUS TP53.V157F Hotspot Allele RPTOR VUS TP53.X225_splice Hotspot Allele RUNX1 VUS TP53.Y220C Hotspot Allele RYBP VUS TP53.Y234C Hotspot Allele SDHA VUS TRAF7.N520S Hotspot Allele SETD2 VUS U2AF1.S34F Hotspot Allele SF3B1 VUS VHL.X114_splice Hotspot Allele SMAD2 VUS Del_10p Loss SMAD3 VUS Del_10q Loss SMAD4 VUS Del_11p Loss SMARCA4 VUS Del_11q Loss SMARCB1 VUS Del_12p Loss SMARCD1 VUS Del_12q Loss SMO VUS Del_13q Loss SOX_17 VUS Del_14q Loss SOX2 VUS Del_15q Loss SOX9 VUS Del_16p Loss SPEN VUS Del_16q Loss SPOP VUS Del_17p Loss STAG2 VUS Del_17q Loss STK11 VUS Del_18p Loss SUFU VUS Del_18q Loss SYK VUS Del_19p Loss TBX3 VUS Del_19q Loss TERT VUS Del_1p Loss TET1 VUS Del_1q Loss TET2 VUS Del_20p Loss TGFBR1 VUS Del_20q Loss TGFBR2 VUS Del_21q Loss TMPRSS2 VUS Del_22q Loss TNFAIP3 VUS Del_2p Loss TOP1 VUS Del_2q Loss TP53 VUS Del_3p Loss TP63 VUS Del_3q Loss TRAF7 VUS Del_4p Loss TSC1 VUS Del_4q Loss TSC2 VUS Del_5p Loss TSHR VUS Del_5q Loss U2AF1 VUS Del_6p Loss VHL VUS Del_6q Loss XPO1 VUS Del_7p Loss
[0091] Clinical information. The sex of the patient is included as a binary feature. While the age at screening has been linked to the incidence of some cancer types, it was excluded from the feature set due to the ambiguity that arises for patients with multiple independent cancer classification or those earlier ages of classification associated with germline pathogenic alterations.
Classification
[0092] A multi-class classifier was built using the random forest algorithm. The random forest ensemble learning method may be suited for this complex classification problem due to its ability to better accommodate large numbers of potentially informative features, arbitrary combinations of features, and the imbalanced class representation of the cohort (i.e., wide range in the prevalence of individual cancer types) as compared to alternative approaches. Moreover, random forest classifiers quantify the relative importance of each variable, enabling the classifier to provide valuable context for clinical interpretations. The imbalanced representation was resolved by equal stratified sampling of tumor types during learning. Specifically, the portion of data used to build each tree included an equal number of samples drawn from each cancer type equal to 80% of the size of the smallest class. This sampling exacerbates the tendency of ensemble classification algorithms, including random forests, to return ambivalent confidence scores even in cases of high certainty. For the primary performance metric, Cohen's kappa, which takes into account the degree of agreement expected by chance between the output and the reference labels, may be used.
Calibration
[0093] The raw classifier scores may be adjusted to match the classification probability using Platt scaling, a multinomial regression. Classification scores from ensemble machine learning methods such as random forest trees often do not approach the extremes of 0 or 1, resulting in a sigmoid shaped distribution relative to the probability. This mismatch between classifier score and probability tends to be exacerbated by stratified sampling of classes. The results of the random forest classifier were calibrated to approximate the empirical accuracy of predictions, using multinomial logistic regression with an elastic-net penalty using the glmnet package in R. Naive calibration tends to lead to a large loss of sensitivity for less common and less distinctive tumor types, especially those that share features with a common tumor type. This effect may be mitigated with slight down-sampling of more common tumor types to maximize the mean balanced accuracy across cancer types. Twenty repeats of five-fold cross-validation were used to determine the robustness of classifier predictions. The agreement between calibrated probability and prediction accuracy is shown in
Circulating DNA
[0094] The classifier was applied to predict cancer type for two separate groups of patients with circulating tumor DNA (cfDNA) sequencing data. First, 19 patients with prostate, bladder, and testicular cancer were selected from a larger cohort with MSK-IMPACT sequencing of cfDNA based on the detection of mutations with a median variant allele fraction greater than 0.10. None of these patients were included in the classifier training set. Second, cancer types using ctDNA whole exome sequencing results was predicted.
[0095] An example data structure of a potential training dataset to train a classifier according to certain embodiments may include, for example, fields such as CANCER_TYPE, CANCER_TYPE_DETAILED, SAMPLE_TYPE, PRIMARY_SITE, METASTATIC_SITE, Cancer_Type, Classification_Category, Gender_F, LogSNV_Mb, and LogINDEL_Mb. Example values corresponding to the fields may comprise, for example: AKT1, AKT2, AKT3, ALK, ALOX12B, AMER1, APC, AR, ARAF, and ARID1A.
[0096] An example data structure of a potential patient sample dataset that may be input to a model to obtain a prediction may, according to certain embodiments, be represented by the following (in JavaScript Object Notation (JSON) format):
B. Computing and Network Environment Text
[0097] Various operations described herein can be implemented on computer systems, which can be of generally design.
[0098] Server system 1100 can have a modular design that incorporates a number of modules 1102 (e.g., blades in a blade server embodiment); while two modules 1102 are shown, any number can be provided. Each module 1102 can include processing unit(s) 1104 and local storage 1106.
[0099] Processing unit(s) 1104 can include a single processor, which can have one or more cores, or multiple processors. In some embodiments, processing unit(s) 1104 can include a general-purpose primary processor as well as one or more special-purpose co-processors such as graphics processors, digital signal processors, or the like. In some embodiments, some or all processing units 1104 can be implemented using customized circuits, such as application specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs). In some embodiments, such integrated circuits execute instructions that are stored on the circuit itself. In other embodiments, processing unit(s) 1104 can execute instructions stored in local storage 1106. Any type of processors in any combination can be included in processing unit(s) 1104.
[0100] Local storage 1106 can include volatile storage media (e.g., DRAM, SRAM, SDRAM, or the like) and/or non-volatile storage media (e.g., magnetic or optical disk, flash memory, or the like). Storage media incorporated in local storage 1106 can be fixed, removable or upgradeable as desired. Local storage 1106 can be physically or logically divided into various subunits such as a system memory, a read-only memory (ROM), and a permanent storage device. The system memory can be a read-and-write memory device or a volatile read-and-write memory, such as dynamic random-access memory. The system memory can store some or all of the instructions and data that processing unit(s) 1104 need at runtime. The ROM can store static data and instructions that are needed by processing unit(s) 1104. The permanent storage device can be a non-volatile read-and-write memory device that can store instructions and data even when module 1102 is powered down. The term “storage medium” as used herein includes any medium in which data can be stored indefinitely (subject to overwriting, electrical disturbance, power loss, or the like) and does not include carrier waves and transitory electronic signals propagating wirelessly or over wired connections.
[0101] In some embodiments, local storage 1106 can store one or more software programs to be executed by processing unit(s) 1104, such as an operating system and/or programs implementing various server functions such as functions of the system 100 (e.g., the classification system 102 and the sequencer 104) in
[0102] “Software” refers generally to sequences of instructions that, when executed by processing unit(s) 1104 cause server system 1100 (or portions thereof) to perform various operations, thus defining one or more specific machine embodiments that execute and perform the operations of the software programs. The instructions can be stored as firmware residing in read-only memory and/or program code stored in non-volatile storage media that can be read into volatile working memory for execution by processing unit(s) 1104. Software can be implemented as a single program or a collection of separate programs or program modules that interact as desired. From local storage 1106 (or non-local storage described below), processing unit(s) 1104 can retrieve program instructions to execute and data to process in order to execute various operations described above.
[0103] In some server systems 1100, multiple modules 1102 can be interconnected via a bus or other interconnect 1108, forming a local area network that supports communication between modules 1102 and other components of server system 1100. Interconnect 1108 can be implemented using various technologies including server racks, hubs, routers, etc.
[0104] A wide area network (WAN) interface 1110 can provide data communication capability between the local area network (interconnect 1108) and the network 1126, such as the Internet. Technologies can be used, including wired (e.g., Ethernet, IEEE 802.3 standards) and/or wireless technologies (e.g., Wi-Fi, IEEE 802.11 standards).
[0105] In some embodiments, local storage 1106 is intended to provide working memory for processing unit(s) 1104, providing fast access to programs and/or data to be processed while reducing traffic on interconnect 1108. Storage for larger quantities of data can be provided on the local area network by one or more mass storage subsystems 1112 that can be connected to interconnect 1108. Mass storage subsystem 1112 can be based on magnetic, optical, semiconductor, or other data storage media. Direct attached storage, storage area networks, network-attached storage, and the like can be used. Any data stores or other collections of data described herein as being produced, consumed, or maintained by a service or server can be stored in mass storage subsystem 1112. In some embodiments, additional data storage resources may be accessible via WAN interface 1110 (potentially with increased latency).
[0106] Server system 1100 can operate in response to requests received via WAN interface 1110. For example, one of modules 1102 can implement a supervisory function and assign discrete tasks to other modules 1102 in response to received requests. Work allocation techniques can be used. As requests are processed, results can be returned to the requester via WAN interface 1110. Such operation can generally be automated. Further, in some embodiments, WAN interface 1110 can connect multiple server systems 1100 to each other, providing scalable systems capable of managing high volumes of activity. Techniques for managing server systems and server farms (collections of server systems that cooperate) can be used, including dynamic resource allocation and reallocation.
[0107] Server system 1100 can interact with various user-owned or user-operated devices via a wide-area network such as the Internet. An example of a user-operated device is shown in
[0108] For example, client computing system 1114 can communicate via WAN interface 1110. Client computing system 1114 can include computer components such as processing unit(s) 1116, storage device 1118, network interface 1120, user input device 1122, and user output device 1124. Client computing system 1114 can be a computing device implemented in a variety of form factors, such as a desktop computer, laptop computer, tablet computer, smartphone, other mobile computing device, wearable computing device, or the like.
[0109] Processor 1116 and storage device 1118 can be similar to processing unit(s) 1104 and local storage 1106 described above. Suitable devices can be selected based on the demands to be placed on client computing system 1114; for example, client computing system 1114 can be implemented as a “thin” client with limited processing capability or as a high-powered computing device. Client computing system 1114 can be provisioned with program code executable by processing unit(s) 1116 to enable various interactions with server system 1100 of a message management service such as accessing messages, performing actions on messages, and other interactions described above. Some client computing systems 1114 can also interact with a messaging service independently of the message management service.
[0110] Network interface 1120 can provide a connection to the network 1126, such as a wide area network (e.g., the Internet) to which WAN interface 1110 of server system 1100 is also connected. In various embodiments, network interface 1120 can include a wired interface (e.g., Ethernet) and/or a wireless interface implementing various RF data communication standards such as Wi-Fi, Bluetooth, or cellular data network standards (e.g., 3G, 4G, LTE, etc.).
[0111] User input device 1122 can include any device (or devices) via which a user can provide signals to client computing system 1114; client computing system 1114 can interpret the signals as indicative of particular user requests or information. In various embodiments, user input device 1122 can include any or all of a keyboard, touch pad, touch screen, mouse or other pointing device, scroll wheel, click wheel, dial, button, switch, keypad, microphone, and so on.
[0112] User output device 1124 can include any device via which client computing system 1114 can provide information to a user. For example, user output device 1124 can include a display to display images generated by or delivered to client computing system 1114. The display can incorporate various image generation technologies, e.g., a liquid crystal display (LCD), light-emitting diode (LED) including organic light-emitting diodes (OLED), projection system, cathode ray tube (CRT), or the like, together with supporting electronics (e.g., digital-to-analog or analog-to-digital converters, signal processors, or the like). Some embodiments can include a device such as a touchscreen that function as both input and output device. In some embodiments, other user output devices 1124 can be provided in addition to or instead of a display. Examples include indicator lights, speakers, tactile “display” devices, printers, and so on.
[0113] Some embodiments include electronic components, such as microprocessors, storage and memory that store computer program instructions in a computer readable storage medium. Many of the features described in this specification can be implemented as processes that are specified as a set of program instructions encoded on a computer readable storage medium. When these program instructions are executed by one or more processing units, they cause the processing unit(s) to perform various operation indicated in the program instructions. Examples of program instructions or computer code include machine code, such as is produced by a compiler, and files including higher-level code that are executed by a computer, an electronic component, or a microprocessor using an interpreter. Through suitable programming, processing unit(s) 1104 and 1116 can provide various functionality for server system 1100 and client computing system 1114, including any of the functionality described herein as being performed by a server or client, or other functionality associated with message management services.
[0114] It will be appreciated that server system 1100 and client computing system 1114 are illustrative and that variations and modifications are possible. Computer systems used in connection with embodiments of the present disclosure can have other capabilities not specifically described here. Further, while server system 1100 and client computing system 1114 are described with reference to particular blocks, it is to be understood that these blocks are defined for convenience of description and are not intended to imply a particular physical arrangement of component parts. For instance, different blocks can be but need not be located in the same facility, in the same server rack, or on the same motherboard. Further, the blocks need not correspond to physically distinct components. Blocks can be configured to perform various operations, e.g., by programming a processor or providing appropriate control circuitry, and various blocks might or might not be reconfigurable depending on how the initial configuration is obtained. Embodiments of the present disclosure can be realized in a variety of apparatus including electronic devices implemented using any combination of circuitry and software.
[0115] Various potential embodiments of the disclosure include:
[0116] Embodiment A: A method for classifying tumor origin sites, the method comprising: sequencing genetic material in a tissue sample from a subject to generate a subject sample dataset comprising one or more subject genes and one or more subject gene alteration categories; applying a predictive model to the subject sample dataset to generate one or more cancer origin site classifications, the predictive model having been trained using a training dataset generated from sequence reads corresponding to genetic material from a cohort of study subjects with known cancers, the training dataset comprising one or more genes, one or more gene alteration categories corresponding to the one or more genes, and one or more labels characterizing tumor origin sites for the known cancers of the study subjects in the cohort; and storing, in one or more data structures, an association between the subject and the one or more cancer origin site classifications
[0117] Embodiment B: The method of Embodiment A, wherein the predictive model is a random forest classification model.
[0118] Embodiment C: The method of either Embodiment A or B, wherein a feature set for the predictive model comprises one or more categories selected from a group consisting of mutations, indels, focal amplifications and deletions, broad copy number gains and losses, structural rearrangements, mutation signatures, mutation rate, and sex.
[0119] Embodiment D: The method of any of Embodiments A-C, wherein classifier scores for the predictive model were calibrated using multinomial logistic regression to match empirically observed classification probabilities.
[0120] Embodiment E: The method of any of Embodiments A-D, further comprising training the predictive model.
[0121] Embodiment F: The method of any of Embodiments A-E, wherein the predictive model is trained using supervised learning.
[0122] Embodiment G: The method of any of Embodiments A-F, wherein the predictive model is trained using unsupervised learning.
[0123] Embodiment H: The method of any of Embodiments A-G, further comprising generating the training dataset.
[0124] Embodiment I: The method of any of Embodiments A-H, wherein generating the training dataset comprises acquiring, from a sequencing device, the sequence reads corresponding to the genetic material from the study subjects in the cohort, and using the sequence reads to generate the training dataset.
[0125] Embodiment J: The method of any of Embodiments A-I, wherein the cohort excludes study subjects with rare cancers not in the top 30 most common cancer types.
[0126] Embodiment K: The method of any of Embodiments A-J, wherein the training dataset comprises gene alteration categories comprising one or more selected from a group consisting of gene amplification (AMP), chromosome gain, homozygous deletion, hotspot, allele, chromosome loss, promoter, signature, structural variant (SV), truncation, and variant of unknown significance (VUS).
[0127] Embodiment L: The method of any of Embodiments A-K, wherein the one or more labels indicate whether a set of genes in the training dataset is from a cancer subject in the cohort of study subjects.
[0128] Embodiment M: The method of any of Embodiments A-L, wherein the predictive model is configured to accept data on genes and gene alterations as inputs and to provide one or more cancer origin site classifications as output.
[0129] Embodiment N: The method of any of Embodiments A-M, wherein the one or more cancer origin site classifications identify at least one of an internal organ of the subject or a cancer type.
[0130] Embodiment O: The method of any of Embodiments A-N, wherein the predictive model is further configured to generate a confidence score for each cancer origin site classification.
[0131] Embodiment P: The method of any of Embodiments A-O, wherein each confidence score corresponds with a likelihood of a cancer origin site for a tumor.
[0132] Embodiment Q: A system for classifying tumor origin sites, the system comprising a computing device having one or more processors configured to: acquire, from a sequencing device, sequence reads corresponding to genetic material in a tissue sample from a subject; generate, using the sequence reads, a subject sample dataset comprising one or more subject genes and one or more subject gene alteration categories; and apply a predictive model to the subject sample dataset to generate one or more cancer origin site classifications, the predictive model having been trained using a training dataset generated using sequence reads corresponding to genetic material from a cohort of study subjects with known cancers, the training dataset comprising one or more genes, one or more gene alteration categories corresponding to the one or more genes, and one or more labels characterizing tumor origin sites for the known cancers of the study subjects in the cohort.
[0133] Embodiment R: The system of Embodiment Q, wherein the one or more processors are further configured to store, in one or more data structures, an association between the subject and the one or more cancer origin site classifications.
[0134] Embodiment S: The system of either Embodiment Q or R, wherein the predictive model is a random forest classification model.
[0135] Embodiment T: The system of any of Embodiments Q-S, wherein the one or more processors are further configured to train the predictive model such that it is configured to accept data on genes and gene alterations as inputs and to provide one or more cancer origin site classifications as output.
[0136] Embodiment U: The system of any of Embodiments Q-T, wherein the one or more processors are configured to generate the training dataset using the sequence reads corresponding to the genetic material from the study subjects in the cohort.
[0137] Embodiment V: The system of any of Embodiments Q-U, wherein the predictive model trained such that it is configured to accept data on genes and gene alterations as inputs and to provide one or more cancer origin site classifications as output.
[0138] Embodiment W: The system of any of Embodiments Q-V, wherein the predictive model is further configured to generate a confidence score for each cancer origin site classification.
[0139] Embodiment X: The system of any of Embodiments Q-W, wherein each confidence score corresponds with a likelihood of a cancer origin site for a tumor.
[0140] Embodiment Y: A system for determining sites of origin for cancer based on sequencing of genes, the system comprising one or more processors configured to: obtain a training dataset comprising a plurality of sample-derived genetic sequences corresponding to a plurality of cancer subjects, each sample defining a set of genes and a category, the category of each sample defining at least one alteration to the set of genes and/or at least one genomic alteration in the sample; train, using the plurality of sample genetic sequences, a classification model configured to generate likelihoods for corresponding cancer origin sites; acquire, via a sequencer, a genetic sequence corresponding to a subject, the genetic sequence including a set of genes and a category, the category of the genetic sequence defining a nature of alteration to the set of genes in the genetic sequence; and apply the classification model to the genetic sequence to determine a set of likelihoods for a corresponding set of origin sites of cancers, each likelihood indicating a probability measure that the genetic sequence correlates with a presence of cancer at a corresponding origin site.
[0141] Embodiment Z: The system of Embodiment Y, wherein the classification model is trained as a random forest classification model.
[0142] Embodiment AA: The system of either Embodiment Y or Z, wherein the one more processors are configured to generate the training dataset using sequence reads from the sequencer.
[0143] While the disclosure has been described with respect to specific embodiments, one skilled in the art will recognize that numerous modifications are possible. Embodiments of the disclosure can be realized using a variety of computer systems and communication technologies including but not limited to specific examples described herein.
[0144] Embodiments of the present disclosure can be realized using any combination of dedicated components and/or programmable processors and/or other programmable devices. The various processes described herein can be implemented on the same processor or different processors in any combination. Where components are described as being configured to perform certain operations, such configuration can be accomplished, e.g., by designing electronic circuits to perform the operation, by programming programmable electronic circuits (such as microprocessors) to perform the operation, or any combination thereof. Further, while the embodiments described above may make reference to specific hardware and software components, those skilled in the art will appreciate that different combinations of hardware and/or software components may also be used and that particular operations described as being implemented in hardware might also be implemented in software or vice versa.
[0145] Computer programs incorporating various features of the present disclosure may be encoded and stored on various computer readable storage media; suitable media include magnetic disk or tape, optical storage media such as compact disk (CD) or DVD (digital versatile disk), flash memory, and other non-transitory media. Computer readable media encoded with the program code may be packaged with a compatible electronic device, or the program code may be provided separately from electronic devices (e.g., via Internet download or as a separately packaged computer-readable storage medium).
[0146] Thus, although the disclosure has been described with respect to specific embodiments, it will be appreciated that the disclosure is intended to cover all modifications and equivalents within the scope of the following claims.