Diagnosis of Malignancy Using Developmental Relationships and Machine Learning
20220415438 · 2022-12-29
Inventors
Cpc classification
G16B25/10
PHYSICS
G16B25/00
PHYSICS
G16B45/00
PHYSICS
International classification
G16B25/00
PHYSICS
Abstract
A computer-implemented method and system uses a map which maps from gene expression data for a plurality of training tumors in a tumor atlas to gene expression data representing single cells derived from mammal samples in developmental stages in a single-cell atlas. The method and system: (A) use the map to extract, from the plurality of training tumors, a plurality of biological components, thereby generating, for each training tumor-biological component pair, a corresponding biological component score; and (B) construct, based on the two atlases and the map, a machine learning perceptron classifier that outputs a tumor type of an input tumor based on its gene expression data. The method and system may generate the map before using it. The method and system may apply the machine learning perceptron classifier to the input tumor's gene expression data to generate the tumor type of the input tumor.
Claims
1. A method, performed by at least one computer processor executing computer program instructions stored on at least one non-transitory computer-readable medium, for using a map which maps from gene expression data for a plurality of training tumors in a tumor atlas to gene expression data representing single cells derived from mammal samples in developmental stages in a single-cell atlas, the method comprising: (A) using the map to extract, from the plurality of training tumors, a plurality of biological components, thereby generating, for each of the plurality of training tumors and each of the plurality of biological components, a corresponding biological component score representing the training tumor as a compilation of non-gene terms describing a plurality of biological programs; and (B) constructing, based on the map, a machine learning perceptron classifier that outputs a tumor type of an input tumor based on gene expression data for the input tumor, wherein the input tumor is not among the plurality of training tumors.
2. The method of claim 1, further comprising: (C) before A, generating the map.
3. The method of claim 2, wherein generating the map comprises mapping each of the plurality of training tumors to at least one of the plurality of mammalian developmental trajectories in the single-cell atlas.
4. The method of claim 2, wherein (C) comprises: (C) (1) receiving the single-cell atlas; and (C) (2) receiving the tumor atlas; (C) (3) generating the map based on the single-cell atlas and the tumor atlas.
5. The method of claim 4, wherein the gene expression data for the plurality of training tumors comprises, for each of the plurality of training tumors: (1) gene sequencing data for the training tumor; and (2) a label indicating a type of cancer of the training tumor.
6. The method of claim 4, wherein the gene expression data representing single cells derived from mammal samples in developmental stages in the single-cell atlas comprises gene expression data representing organogenesis of a plurality of mammalian developmental trajectories.
7. The method of claim 1, wherein the tumor atlas comprises a version of The Cancer Genome Atlas (TOGA).
8. The method of claim 1, wherein the single-cell atlas comprises a developmental atlas.
9. The method of claim 8, wherein the single-cell atlas comprises a single-cell organogenesis atlas.
10. The method of claim 8, wherein the single-cell atlas comprises data representing normal development for each of a plurality of mammalian developmental trajectories.
11. The method of claim 10, wherein the single-cell atlas comprises a version of the Mouse Organogenesis Cell Atlas (MOCA).
12. The method of claim 1, wherein the plurality of known biological programs comprises a plurality of known developmental programs.
13. The method of claim 1, further comprising: (C) applying the machine learning perceptron classifier to the gene expression data for the input tumor to generate the tumor type of the input tumor.
14. The method of claim 13, wherein the input tumor comprises a sample of a cancer of unknown primary.
15. The method of claim 1, wherein using the map to extract, from the plurality of training tumors, the plurality of biological components comprises using the map to deconvolute the plurality of training tumors into the plurality of biological components.
16. The method of claim 1, wherein the developmental stages in the single-cell atlas comprise prenatal developmental stages in the single-cell atlas.
17. A system comprising at least one non-transitory computer-readable medium having computer program instructions stored thereon, the computer program instructions being executable by at least one computer processor to perform a method for using a map which maps from gene expression data for a plurality of training tumors in a tumor atlas to gene expression data representing single cells derived from mammal samples in developmental stages in a single-cell atlas, the method comprising: (A) using the map to extract, from the plurality of training tumors, a plurality of biological components, thereby generating, for each of the plurality of training tumors and each of the plurality of biological components, a corresponding biological component score representing the training tumor as a compilation of non-gene terms describing a plurality of biological programs; and (B) constructing, based on the map, a machine learning perceptron classifier that outputs a tumor type of an input tumor based on gene expression data for the input tumor, wherein the input tumor is not among the plurality of training tumors.
18. The system of claim 17, wherein the method further comprises: (C) before A, generating the map.
19. The system of claim 18, wherein generating the map comprises mapping each of the plurality of training tumors to at least one of the plurality of mammalian developmental trajectories in the single-cell atlas.
20. The system of claim 18, wherein (C) comprises: (C) (1) receiving the single-cell atlas; and (C) (2) receiving the tumor atlas; (C) (3) generating the map based on the single-cell atlas and the tumor atlas.
21. The system of claim 20, wherein the gene expression data for the plurality of training tumors comprises, for each of the plurality of training tumors: (1) gene sequencing data for the training tumor; and (2) a label indicating a type of cancer of the training tumor.
22. The system of claim 20, wherein the gene expression data representing single cells derived from mammal samples in developmental stages in the single-cell atlas comprises gene expression data representing organogenesis of a plurality of mammalian developmental trajectories.
23. The system of claim 17, wherein the tumor atlas comprises a version of The Cancer Genome Atlas (TOGA).
24. The system of claim 17, wherein the single-cell atlas comprises a developmental atlas.
25. The system of claim 24, wherein the single-cell atlas comprises a single-cell organogenesis atlas.
26. The system of claim 24, wherein the single-cell atlas comprises data representing normal development for each of a plurality of mammalian developmental trajectories.
27. The system of claim 26, wherein the single-cell atlas comprises a version of the Mouse Organogenesis Cell Atlas (MOCA).
28. The system of claim 17, wherein the plurality of known biological programs comprises a plurality of known developmental programs.
29. The system of claim 17, further comprising: (C) applying the machine learning perceptron classifier to the gene expression data for the input tumor to generate the tumor type of the input tumor.
30. The system of claim 29, wherein the input tumor comprises a sample of a cancer of unknown primary.
31. The system of claim 17, wherein using the map to extract, from the plurality of training tumors, the plurality of biological components comprises using the map to deconvolute the plurality of training tumors into the plurality of biological components.
32. The system of claim 17, wherein the developmental stages in the single-cell atlas comprise prenatal developmental stages in the single-cell atlas.
33-56. (canceled)
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
DETAILED DESCRIPTION
[0027] Referring to
[0028] The system 100 includes a tumor type identification module 102 which receives, as input, data 104 representing the input tumor (
[0029] Embodiments of the present invention may generate the tumor type data 106 in a variety of ways. For example, referring to
[0030] The system 110 includes a classifier construction module 112. As described in more detail below, the classifier construction module 112 constructs a classifier 114, which is capable of outputting data representing a tumor type of an input tumor. For example, the tumor type identification module 102 of
[0031] The classifier construction module 112 may construct the classifier 114 in any of a variety of ways. For example, the system 110 may include a tumor atlas 118 and a single-cell atlas 120, which may be used to produce a map 116 (as described in more detail below in connection with
[0032] The map 116 may map from (some or all of) the gene expression data for the plurality of training tumors in the tumor atlas 118 to (some or all of) the gene expression data representing single cells in the single-cell atlas 120.
[0033] The system 110 may include a biological component score generation module 122, which may receive the map 116 produced from the tumor atlas 118 and the single-cell atlas 120, and use the map 116 to extract, from the plurality of training tumors in the tumor atlas 118, a plurality of biological components, thereby generating, for each of the plurality of training tumors and each of the plurality of biological components, a corresponding biological component score representing the training tumor as a compilation of non-gene terms describing known biological programs (such as a plurality of known developmental programs) (
[0034] The extraction of the plurality of biological components in operation 212 may, for example, include using the map 116 to deconvolute the plurality of training tumors into the plurality of biological components.
[0035] The classifier construction module 112 may construct the machine learning perceptron classifier 114 based on the map 116, which was constructed from the single-cell atlas 120 and the tumor atlas 118 (
[0036] Note that the input tumor may not be among the plurality of training tumors; i.e., the classifier 114 can generate tumor types for tumors that were not in the tumor atlas 118 that was used to construct the classifier 114.
[0037] The tumor atlas 118 may take any of a variety of forms. For example, in some embodiments of the present invention, the tumor atlas 118 is, or includes, or is included within, some or all of a version of The Cancer Genome Atlas (TCGA).
[0038] The single-cell atlas 120 may take any of a variety of forms. For example, in some embodiments of the present invention, the single-cell atlas is, or includes, or is included within, a developmental atlas. Such a developmental atlas may, for example, be a single-cell organogenesis atlas. The single-cell atlas may, for example, include data representing normal development for each of a plurality of mammalian developmental trajectories. The single-cell atlas may, for example, be, or include, or be included within a version of the Mouse Organogenesis Cell Atlas (MOCA).
[0039] Embodiments of the present invention may generate the map 116 before performing the method 200 of
[0040] The system 130 includes a map generation module 132, which receives as inputs: (1) the single-cell atlas 120, which includes gene expression data representing organogenesis of a plurality of mammalian development trajectories (
[0041] Generating the map in operation 236 may include, for example, mapping each of the plurality of training tumors (in the tumor atlas 118) to at least one of the plurality of mammalian developmental trajectories in the single-cell atlas 120.
[0042] Having described various embodiments of the present invention at a high level, certain implementations of embodiments of the present invention will now be described in more detail.
Tumor Datasets
[0043] It was described above that a tumor atlas 118 may be used and that the tumor atlas 118 may include gene expression data for a plurality of training tumors, such as gene sequencing data and a label indicating a type of cancer of the training tumor. The tumor atlas 118 may include The Cancer Genome Atlas (TCGA), non-TCGA sample cohort data, and/or institution-derived sample cohort data. TCGA contains bulk RNA sequencing data for 33 tumor and normal tissue types accompanied by diagnostic annotations. There are 10,388 samples, with 62 different sample types. Non-TCGA data may include tumor samples obtained from various cancer studies. Institution-derived data may be used for validation of methods in an independent sample set. Below, additional details are provided about these datasets.
The Cancer Genome Atlas (TCGA) Data
[0044] The tumor atlas 118 may include, for example, The Cancer Genome Atlas (TCGA). For example, the coding gene expression profile [RNAseqV2_RSEM_genes_normalized_data_Level_3] and clinical information [Merge_Clinical.Level_1.2016012800.0.0] of TCGA samples (release 2016_02_28) may be systematically downloaded using firehose_get v 0.4.1 tool, from the Broad TCGA GDAC site (https://gdac.broadinstitute.org/). This contains the data and the analytic categories used by TCGA consortium, including for aggregating data for the following tumor types: COADREAD (COAD+READ), GBMLGG (GBM+LGG), KIDNEY (KICH+KIRC+KIRP), and STES (ESCA+STAD). A category LUNG may be created by aggregating LUAD and LUSC.
Non-TCGA Sample Cohort Data
[0045] The tumor atlas 118 may also include, for example, data aggregated from various cancer studies. For example, a non-TCGA sample cohort may be created by aggregating data from cancer studies with gene expression profiles and clinical information retrieved from the Genomic Data Commons (GDC) Data Portal (https://portal.gdc.cancer.gov/). A merging of TCGA categories may allow incorporation of the maximum number of non-TCGA samples, as non-TCGA studies use slightly different diagnostic categories.
Institution-Derived Data, e.g. MGH Sample Cohort Data
[0046] The tumor atlas 118 may also include, for example, tumor data derived from a particular institution such as Massachusetts General Hospital (MGH). For example, in an embodiment, an MGH Sample Cohort may be created using samples from formalin fixed paraffin embedded (FFPE) tissue from cases seen in the Center for Integrated Diagnostics in the Department of Pathology at Massachusetts General Hospital either with known diagnosis (33 cases) or as cancers of unknown primary (52 cases). Total nucleic acid may be isolated from six scraped blank slides using clinically validated protocols.
RNA Extraction from FFPE Clinical Samples and RNA Sequencing Analysis of FFPE Clinical Samples
[0047] Tumor data derived from FFPE clinical samples may be obtained through RNA extraction and RNA sequencing processes. RNA extraction from FFPE clinical samples may involve the following steps: libraries may be prepared using a modified version of the Takara SMARTer Stranded Total RNA-Seq Kit—Pico Input Mammalian kit. In brief 100 ng of RNA at 10 ng/ul may be sonicated using RL230 Covaris sonicator (Covaris Inc) and resulting material may be confirmed using a Fragment Analyzer (Agilent). 10 ng of each sonicated sample may be prepared using the pico input kit as for FFPE samples using a 1:8 volume reduction on the STP MosquitoHV. Final libraries may be validated by Fragment Analyzer and qPCR prior to sequencing on a NovaSeq6000 S4 with 150 nt paired-end reads.
[0048] RNA-sequencing analysis of FFPE-derived clinical samples may be performed. Reads obtained from sequencing step may be processed as follows. A STAR reference genome using GENCODE v35 fasta and gtf files may be generated using STAR ‘genomeGenerate’. Next, fastq files may be aligned to the genome generated with STAR using two pass mapping (see mgh sequencing.sh for details on STAR parameters). This step may generate bam files compatible with RSEM gene expression calculation. An RSEM reference may be prepared using rsem-prepare-reference command. These files along with the bam files may be used to calculate gene expression with rsem-calculate-expression using parameters (—p 16, —bam, —paired-end, —no-bam-output, —forward-prob 0.5, and —seed 12345).
[0049] Gene expression measured in transcripts per million (TPM) may be used to assess the similarity with MOCA dataset. Homo Sapiens primary assembly v35_GRCh38.p13 genome and relative annotation GTF file (v35) may be downloaded from GENCODE website (https://www.gencodegenes.org/). STAR v2.7.1a alignment tools, RSEM v1.3.1, R v3.6.0 (https://www.R-project.org/) and Perl v5.24.1 may be used for gene expression analysis.
Cancer Single-Cell Sequencing Data
[0050] Tumor-related data 104 may also include cancer single-cell sequencing data. For example, cancer single-cell sequencing data may include the expression profile of normal and malignant single cells from 13 tumor types across 17 different studies. All of the above studies may be used to generate pseudobulk cohorts to test how purity affects a classifier's prediction as seen in
[0055]
In-Silico Generation of a Tumor-Normal Mixed Sample Cohort
[0056] In-silico generation of a tumor-normal mixed sample cohort may be part of data processing. To test the effect of tumor purity on classifier accuracy (
Example Organ Development Datasets
[0057] It was described above that the single-cell atlas 120 may, for example, include gene expression data representing single cells derived from mammal samples in developmental stages, or may include data representing normal development for each of a plurality of mammalian development trajectories. The single-cell atlas could be, for example, the Mouse Organogenesis Cell Atlas (MOCA), and/or the Human Fetal Organs (HFO) dataset.
[0058] In brief, the Mouse Organogenesis Cell Atlas (MOCA) contains single cell RNA sequencing data for organogenesis in mice. MOCA arranges single cells into developmental trajectories, and contains RNA sequencing data for 1,331,984 cells. MOCA defines developmental trajectories; the process of defining developmental trajectories is described in the paper introducing the MOCA dataset. The Human Fetal Organs (HFO) dataset contains single cell RNA sequencing data of 377,456 single cells from 15 human fetal organs. Below, more detail is provided about MOCA and HFO.
The Mouse Organogenesis Cell Atlas (MOCA)
[0059] The expression profile and meta information of MOCA cells RNA sequencing may be manually downloaded from: https://oncoscape.v3.sttrcancer.org/atlas.gs.washington.edu.mous e.rna/downloads. The expression data of the 1,331,984 high quality cells defined in the MOCA study may be used. Filtering may be applied to MOCA data. MOCA filtering may include: cells with less than 400 detected mRNA molecules were removed, all detected doublet cells removed, and all cells from doublet derived sub-clusters were removed. In the MOCA study, the authors identified 10 main and 56 sub, non-continuous trajectories based on transcriptional similarities between the analyzed cells and literature curated marker genes. These trajectories may be used as part of calculation of biological component scores.
Human Fetal Organs (HFO) Single Cell Dataset
[0060] In another example, the single-cell atlas 120 may include the Human Fetal Organs (HFO) single cell dataset. The HFO dataset includes expression data of 377,456 single cells from 15 human fetal organs and relative meta data which may be downloaded from https://descartes.brotmanbaty.org/bbi/human-gene-expression-during-development/.
[0061] Similarities between TCGA correlations with MOCA and HFO may be analyzed. In order to test the specificity and reproducibility of the correlation between the mouse expression profiles and human expression profiles one may adopt the following approach. TCGA expression profiles may be correlated with HFO cells as previously described. The HFO cell types may then be mapped to the MOCA sub trajectories. The correlation coefficient between the two similarity matrices (each 56×62, see formulas) may then be calculated by cor.test( )R implementation. For per TCGA sample type correlations, the same function may be applied across the appropriate column of the matrices.
High-Level Overview of the Diagnosis of Malignancy by Developmental Deconvolution and Machine Learning
[0062]
Overview of the Mapping Process
[0063] It was described above that a map generation module 132 may generate a map 116 that maps from (some or all of) the gene expression data for the plurality of training tumors in the tumor atlas 118 to (some or all of) the gene expression data representing single cells in the single-cell atlas 120. It was further described above that such generation of the map 116 may be performed, for example, using normalization and rank-based, non-parametric approaches. In brief, the mapping process used by the map generation module 132 may include mouse-human gene name conversion followed by similarity score calculation and processing. For example, similarity score calculation may first involve calculation of Spearman's rank-ordered correlation coefficient of MOCA single-cell RNA sequencing data versus TCGA bulk-sequenced cancer samples. Spearman's correlation coefficient is calculated on the RNA sequence data of shared genes. This yields a matrix of dimension (number of cells in MOCA study) by (number of samples in TCGA). Next, the similarity score matrix is further processed. Each column is mean centered and standard deviation scaled. For each MOCA single cell, the Spearman correlation coefficients from the same TCGA tumor sample type are averaged. Then, an average is taken across all cells of the same MOCA developmental sub-trajectory to arrive at a single similarity score relating each tumor type to each developmental sub-trajectory. For MOCA and TCGA, the matrix produced by this step consists of 56 MOCA developmental sub-trajectories by 62 TCGA tumor types. Finally this matrix is processed with column-wise mean centering and standard deviation scaling, and min-max normalization, to yield a final similarity scores matrix that relates tumor types to developmental trajectories. This final similarity scores matrix is an example of the map 116 and the aforementioned steps are an example of the mapping process that a map generation module 132 may perform. Note that the steps for mapping can be performed between any organ development sample (e.g. MOCA and/or HFO), and any tumor sample (e.g. TCGA, non-TCGA, and/or MGH samples).
[0064] An example of the mapping process used by the map generation module 132 is shown in
Mouse-Human Gene Names Conversion
[0065] The first step of the mapping process may be mouse-human gene names conversion. For example, to compare murine gene expression from MOCA to human gene expression in TCGA, gene names may be standardized between mouse and human using the following approach. A conversion from Mouse Ensembl id (MOCA) to Human gene symbol/Entrez id (TCGA) may be achieved using BioMart (https://www.ensembl.org/biomart/martview/) Ensembl v95. Human gene symbol/Entrez id (TCGA) to Ensembl id (NON-TCGA) mapping may be achieved using org.Hs.egENSEMBL2EG from org.Hs.eg.db (v3.8.2) Bioconductor (v3.9) (https://bioconductor.org/) package. The intersection of these two sources may be performed using the Human gene symbol/Entrez id shared identifier. This process may generate a list of translated names. This list may then be used as a dictionary for gene names and mouse-human orthologue comparison. In an example, this may identify on the order of 15,929 unique human genes that may be used. In the case of multiple mouse gene names mapping to the same human gene name, the average expression levels may be calculated across occurrences.
Similarity Score Calculation
[0066] The second step of the mapping process may be similarity score calculation. The similarity between gene expression profiles from either MOCA or human fetal organs (HFO) cells and bulk (TCGA, NON-TCGA, MGH samples) or single cells from cancer datasets may be calculated by means of Spearman's Correlation Coefficient, implemented using cor function in R on the expression profile of all shared genes identified as described above for each bulk/single cell sample and MOCA cell. Spearman's Correlation Coefficient may provide advantages because this non-parametric, rank-based approach is more robust to outliers caused by single cell transcript dropout and is unaffected by the normalization method, which standardized the use of different gene expression datasets.
[0067] Spearman correlation generates a matrix C of correlation coefficients of dimensions n×m, where: [0068] n represents the number of individual cells from the organ development study, e.g. MOCA or HFO. For the MOCA study, n=1,331,984. For HFO, n=377,456. [0069] m represents the number of samples from the comparing tumor study, e.g. TCGA or MGH. For TCGA, m=10,393 (9274 primary tumors, 394 metastasis and 725 normal tissues). For MGH, m=85.
The matrix of correlation coefficients for MOCA/TCGA samples is termed C.sub.MOCA/TCGA, and is a 1331984×10393 matrix, containing 1.38e+10 correlation coefficients. C.sub.HFO/TCGA is a 377456×10393 matrix containing 3.92e+09 correlation coefficients, and C.sub.MOCA/MGH is a 1331984×85 matrix. The correlation coefficient may then be used as a metric for the similarity between the samples in the cohorts under exam. Correlation coefficients may then be further processed as described in the next section, “Similarity score aggregation, scaling and normalization.”
Similarity Score Aggregation, Scaling and Normalization
[0070] To generate TCGA aggregated similarity scores shown in
Overall, after these aggregation, scaling, and normalization steps, a final matrix of similarity scores is produced. This final matrix of similarity scores is an example of a map 116 that provides a mapping between the tumor atlas 118 and the single-cell atlas 120.
[0077]
[0078]
Verifying that Calculated Coefficients Represent Meaningful Associations
[0079] It is possible to verify that the calculated coefficients in the map 116, represented in an example by the similarity scores matrix, display meaningful association between the two datasets by comparing them to those generated from row-randomized data (
[0080] In the next part of this analysis, one may average correlation coefficients across all cells of the same developmental sub-trajectory (
[0081] In order to further validate the observed correlations, we employed two approaches. In the first approach, we developed an optimized protocol for transcriptome sequencing from formalin fixed paraffin embedded tissue (FFPE), sequenced the transcriptome for 40 tumors of known types, and compared similarity for developmental trajectories to TCGA. Comparison between the FFPE cohort and TCGA cohort showed strong agreement, validating the method in an independent sample set.
[0082] In the second approach, we utilized a single cell atlas of human fetal tissues cataloguing later embryonic stages of mid-gestation development. More specifically, during the course of our study a partial atlas of mid-gestation human fetal development became available, and we used a representative set of cells provided by the human atlas, correlated them with TOGA sample types, and compared the results to those from the murine atlas.
Pan-Cancer Comparisons of Tumor-Normal Tissues and Embryonic Period
[0083] Pan-cancer comparisons of tumor-normal tissues and embryonic period may be performed. For every TCGA sample from tissue types that contained the expression profile of at least one normal sample, the number of the most strongly correlated cells (top 1,331 cells, sorted by correlation coefficient), was binned by their embryonic period of origin (E9.5-E13.5). This created a matrix of either normal or malignant samples versus the 5 embryonic time periods (E9.5, E10.5, E11.5, E12.5 and E13.5), wherein each matrix entry indicates the number of MOCA cells in the given category. This matrix was then analyzed using the chi-squared test to produce
Deconvolution By Developmental Components (DCs)
[0084] It was described above that the system 110 may include a biological component score generation module 122 that generates a biological component score representing the training tumor as a compilation of non-gene terms describing known biological programs (such as a plurality of known developmental programs).
[0085] In an example, the creation of a correlation map between TCGA samples and developmental trajectories inspired us to attempt a systematic developmental deconvolution of human tumor gene expression. In deconvolution, a recorded signal (bulk gene expression) made of component parts (developmental programs) is deconstructed into individual signals from each component (trajectories at embryonic timepoints). We used developmental components (DC), a single quantitative measure of each developmental sub-trajectory at each timepoint, to represent the developmental information for every TCGA sample. DCs were scaled across all tumor samples and charted on radar plots, which represent information about developmental period, sub-trajectory, and DC score for each sample. A schematic for this plot is shown (
[0086] To perform the deconvolution into developmental components, the following steps may be performed: For each TCGA, NON-TCGA, MGH bulk and single cell sample the MOCA cells were sorted in increasing (lowest to highest) order based on their correlation coefficient. Next, the 1,331 most strongly correlated cells were selected, representing the top ˜0.1% of all MOCA cells tested. A rank-based score was then assigned to this selection of cells. The most highly correlated cell was given a score of 1331, while the least correlated score was given a score of 1. These scores were then summed across all cells belonging to the same combination of sub-trajectory at a particular developmental time, creating the raw developmental component (DCs) score. The raw scores were then transformed by taking the natural logarithm (ln). The correlation between the DCs and sets of samples was calculated using the Kruskal-Wallis rank sum test using the kruskal.test( )R function. Only DCs with a Benjamini-Hochberg adjusted p-value<0.05 were considered statistically different between groups.
[0087]
[0088]
[0089]
Developmental Multilayer Perceptron
[0090] It was described above that the machine learning classifier 114 may be capable of outputting a tumor type of an input tumor based on gene expression data for the input tumor. The machine learning classifier 114 may be implemented as a multilayer perceptron (MLP). An MLP is a type of supervised machine learning model. An MLP trained on natural log transformed and min-max normalized developmental component (DC) scores may be termed a developmental multilayer perceptron (D-MLP). As an example, the D-MLP may have 214 deconvolution scores/developmental component scores (DC scores) as input; these DC scores are calculated as the similarity of each tumor's gene expression to embryologic developmental trajectories. The D-MLP may output scores for the 27 aggregated TCGA tumor classes. Examples used for training and testing the D-MLP may come from the TCGA, non-TCGA, and-or institution-specific (MGH cohort) datasets. To ensure reproducibility, the minimum and maximum values of the aggregated TCGA, non-TCGA and MGH cohort were calculated and used as the minimum and maximum for all min-max normalization, including for the normalization of the CUP cohort. The model's hyper-parameters were identified by means of grid search over the following variables: 1) number of hidden layers, 2) the number of nodes per layer, 3) the type of optimizer function, 4) the type of loss indicator and 5) the number of epochs for which the model is trained. The architecture selection and the training/validation of the model were performed by a 10-fold cross validation of a training-validation set split of 60%-10% of the totality of the TCGA, NON-TCGA and MGH cohorts. The performance of the D-MLP was tested on the remaining test set (30% of the whole cohort). None of the samples present in the test set were used during the model training or during the architecture selection. Further, the performance of the D-MLP was tested on an independent cohort of single cell derived-pseudo bulk samples. The final model has the following architecture: 1 input layer with 214 nodes, 2 hidden layers of 800 and 200 nodes, respectively and 1 output layer with 27 nodes. The model was trained with a stochastic gradient descent optimizer using a mean squared error loss function. Accuracy was calculated as a performance metric. The model was trained on the training set for 300 epochs with an early stopping function based on the accuracy score, with a patience of 3 epochs. The D-MLP was implemented in Python (v3.6.4) using sklearn (v0.19.1) and Keras (v2.2.0, with TensorFlow backend).
[0091]
D-MLP Classification Analysis
[0092] A classification analysis may be performed of the results of the D-MLP. In an example, the raw likelihood score resulting from the classification of the test set, CUP cohort, tumor purity cohort, and benchmark cohorts were each analyzed as follows. The output of the D-MLP classifier is a matrix containing a number of rows equal to the number of samples analyzed and a number of columns equal to 27, for the 27 classification labels. Each sample's top classification (defined by highest likelihood score) is assigned as top1, the next as top2, and so on (up to 4+ as shown in
[0093]
[0094]
[0095] Grouping of discordant predictions may be used as part of a performance assessment of a D-MLP. As an example, to assess the distance between top3 classification results of discordantly classified samples and concordantly classified samples, the following ‘hot-encoder’ approach was taken. Each sample's classification response was encoded using a vector of length 81 (27 possible prediction labels with a spot for each of the top3 predictions). Each element of this vector was then populated with a score of 5, 3, 2 or 0 depending whether the correct answer was in the top1, top2, top3, or top4 (all other predictions) category. A Top1 (correctly classified tumor) would have a score of 5 repeated 3 times, a top2 correctly classified sample would have a score of 3 in the range 28 to 54, and so on. For example, an ACC sample correctly classified in top1 would have a score of 5 in position 1, 28 and 55 (and 0s everywhere else) while an ACC sample guessed right in top2 would have a score of 3 in position 28 and 0s everywhere else. These vectors were then compared using cosine similarity , and UMAP was used to plot the results of cosine similarity analysis.
[0096] CUPs clustering and DC analysis may be performed. Raw DC scores of the CUP cohort were processed as follows. Each DC score was mean centered and standard deviation scaled (z-scored) across the whole cohort of 52 CUPs. After scaling, Spearman distance was calculated between different samples. This distance was evaluated by means of Dist( ) function in the amap R package. The total number of samples is x; the function outputs an x×x matrix with the pairwise distance between x samples. A hierarchical clustering analysis using ‘ward.D’ algorithm was then calculated on this distance matrix using hclust( ) R function and cut into 4 main clusters (chosen based on observed distance between branch points) using cutree( ) R function. Statistical analysis of the differential developmental programs between the 4 clusters was performed by the Kruskal-Wallis test using kruskal.test( ) in R. Enrichment for specific classifications was performed using the chi-squared test. 70 DCs with a Bonferroni corrected p-value<0.05 were considered correlated with at least one cluster and are shown in
[0097] The D-MLP may be compared to alternative configurations of machine learning models. In an example, the performance of the D-MLP classifier was tested against models sharing the same architecture trained on pure gene expression profiles directly without developmental deconvolution. This analysis included two sets of genes: I) Clinical oncopanel genes and II) highly variable genes. I) Clinical oncopanel genes represent a list of 251 genes tested in routine clinical cancer care at the Massachusetts General Hospital (assays: SNaPshot, Solid Fusion Assay, Heme SNaPshot). To match feature counts with D-MLP classifier, we generated 10 random subsets of 214 genes out of 251. Each of these 214 gene subsets were used to train a benchmark classifier, and the highest performing assessed by top1 and top3 accuracy was directly compared against the developmental based classifier (D-MLP) as shown (
Diagnosis of Cancer of Unknown Primary by a Developmental Multilayer Perceptron Classifier
[0098] As described above, the classifier 114 may be applied to gene expression data for an input tumor (e.g., the input tumor data 104). This input tumor may be a Cancer of Unknown Primary (CUP).
Statistical Analysis
[0099] Statistical analysis may be performed using R (v3.6.3). Enrichment (
[0100] One embodiment of the present invention is directed to a method, performed by at least one computer processor executing computer program instructions stored on at least one non-transitory computer-readable medium, for using a map which maps from gene expression data for a plurality of training tumors in a tumor atlas to gene expression data representing single cells derived from mammal samples in developmental stages in a single-cell atlas. The method includes: (A) using the map to extract, from the plurality of training tumors, a plurality of biological components, thereby generating, for each of the plurality of training tumors and each of the plurality of biological components, a corresponding biological component score representing the training tumor as a compilation of non-gene terms describing a plurality of biological programs; and (B) constructing, based on the map, a machine learning perceptron classifier that outputs a tumor type of an input tumor based on gene expression data for the input tumor, wherein the input tumor is not among the plurality of training tumors.
[0101] The method may further include: (C) before (A), generating the map. Generating the map may include mapping each of the plurality of training tumors to at least one of the plurality of mammalian developmental trajectories in the single-cell atlas. Operation (C) may include: (C)(1) receiving the single-cell atlas; (C)(2) receiving the tumor atlas; and (C)(3) generating the map based on the single-cell atlas and the tumor atlas. The gene expression data for the plurality of training tumors may include, for each of the plurality of training tumors: (1) gene sequencing data for the training tumor; and (2) a label indicating a type of cancer of the training tumor. The gene expression data representing single cells derived from mammal samples in developmental stages in the single-cell atlas may include gene expression data representing organogenesis of a plurality of mammalian developmental trajectories.
[0102] The tumor atlas may include a version of The Cancer Genome Atlas (TOGA).
[0103] The single-cell atlas may include a developmental atlas. The single-cell atlas may include a single-cell organogenesis atlas. The single-cell atlas may include data representing normal development for each of a plurality of mammalian developmental trajectories. The single-cell atlas may include a version of the Mouse Organogenesis Cell Atlas (MOCA).
[0104] The plurality of known biological programs may include a plurality of known developmental programs.
[0105] The method may further include: (C) applying the machine learning perceptron classifier to the gene expression data for the input tumor to generate the tumor type of the input tumor. The input tumor may include a sample of a cancer of unknown primary.
[0106] Using the map to extract, from the plurality of training tumors, the plurality of biological components may include using the map to deconvolute the plurality of training tumors into the plurality of biological components.
[0107] The developmental stages in the single-cell atlas may include prenatal developmental stages in the single-cell atlas.
[0108] Another embodiment of the present invention is directed to a method performed by at least one computer processor executing computer program instructions stored on at least one non-transitory computer-readable medium. The method includes: (A) receiving a single-cell atlas comprising gene expression data; (B) receiving a tumor atlas comprising gene expression data for a plurality of training tumors; and (C) generating a map, which maps from the gene expression data in the tumor atlas to the gene expression data in the single-cell atlas.
[0109] The gene expression data for the plurality of training tumors may include, for each of the plurality of training tumors: (1) gene sequencing data for the training tumor; and (2) a label indicating a type of cancer of the training tumor. The operation (C) may include generating the map using normalization and rank-based, non-parametric approaches.
[0110] The gene expression data in the single-cell atlas may include data representing organogenesis of a plurality of mammalian development trajectories. Generating the map may include mapping each of the plurality of training tumors to at least one of a plurality of mammalian developmental trajectories in the single-cell atlas. The gene expression data for the plurality of training tumors may include, for each of the plurality of training tumors: (1) gene sequencing data for the training tumor; and (2) a label indicating a type of cancer of the training tumor.
[0111] Another embodiment of the present invention is directed to a machine learning perceptron classifier generated by a method. The machine learning perceptron classifier is stored in at least one non-transitory computer-readable medium. The method includes: (A) using a map to extract, from a plurality of training tumors in a tumor atlas, a plurality of biological components, thereby generating, for each of the plurality of training tumors and each of the plurality of biological components, a corresponding biological component score representing the training tumor as a compilation of non-gene terms describing a plurality of biological programs, wherein the map maps from gene expression data for the plurality of training tumors in the tumor atlas to gene expression data representing single cells derived from mammal samples in developmental stages in a single-cell atlas; and (B) constructing, based on the map, the machine learning perceptron classifier, wherein the machine learning perceptron classifier is adapted to output a tumor type of an input tumor based on gene expression data for the input tumor, wherein the input tumor is not among the plurality of training tumors.
[0112] The gene expression data for the plurality of training tumors may include, for each of the plurality of training tumors: (1) gene sequencing data for the training tumor; and (2) a label indicating a type of cancer of the training tumor. The gene expression data representing single cells derived from mammal samples in developmental stages in the single-cell atlas may include gene expression data representing organogenesis of a plurality of mammalian developmental trajectories. The tumor atlas may include a version of The Cancer Genome Atlas (TOGA).
[0113] The single-cell atlas may include a developmental atlas. The single-cell atlas may include a single-cell organogenesis atlas. The single-cell atlas may include data representing normal development for each of a plurality of mammalian developmental trajectories. The single-cell atlas may include a version of the Mouse Organogenesis Cell Atlas (MOCA).
[0114] The plurality of known biological programs may include a plurality of known developmental programs. The input tumor may include a sample of a cancer of unknown primary.
[0115] Using the map to extract, from the plurality of training tumors, the plurality of biological components may include using the map to deconvolute the plurality of training tumors into the plurality of biological components. The developmental stages in the single-cell atlas may include prenatal developmental stages in the single-cell atlas.
[0116] It is to be understood that although the invention has been described above in terms of particular embodiments, the foregoing embodiments are provided as illustrative only, and do not limit or define the scope of the invention. Various other embodiments, including but not limited to the following, are also within the scope of the claims. For example, elements and components described herein may be further divided into additional components or joined together to form fewer components for performing the same functions.
[0117] Any of the functions disclosed herein may be implemented using means for performing those functions. Such means include, but are not limited to, any of the components disclosed herein, such as the computer-related components described below.
[0118] The techniques described above may be implemented, for example, in hardware, one or more computer programs tangibly stored on one or more computer-readable media, firmware, or any combination thereof. The techniques described above may be implemented in one or more computer programs executing on (or executable by) a programmable computer including any combination of any number of the following: a processor, a storage medium readable and/or writable by the processor (including, for example, volatile and non-volatile memory and/or storage elements), an input device, and an output device. Program code may be applied to input entered using the input device to perform the functions described and to generate output using the output device.
[0119] Embodiments of the present invention include features which are only possible and/or feasible to implement with the use of one or more computers, computer processors, and/or other elements of a computer system. Such features are either impossible or impractical to implement mentally and/or manually. For example, embodiments of the present invention using machine learning to train a model to generate diagnoses of tumors. Such training, and subsequent use, of a machine learning model is inherently rooted in computer technology and could not otherwise be performed mentally or manually by a human.
[0120] Any claims herein which affirmatively require a computer, a processor, a memory, or similar computer-related elements, are intended to require such elements, and should not be interpreted as if such elements are not present in or required by such claims. Such claims are not intended, and should not be interpreted, to cover methods and/or systems which lack the recited computer-related elements. For example, any method claim herein which recites that the claimed method is performed by a computer, a processor, a memory, and/or similar computer-related element, is intended to, and should only be interpreted to, encompass methods which are performed by the recited computer-related element(s). Such a method claim should not be interpreted, for example, to encompass a method that is performed mentally or by hand (e.g., using pencil and paper). Similarly, any product claim herein which recites that the claimed product includes a computer, a processor, a memory, and/or similar computer-related element, is intended to, and should only be interpreted to, encompass products which include the recited computer-related element(s). Such a product claim should not be interpreted, for example, to encompass a product that does not include the recited computer-related element(s).
[0121] Each computer program within the scope of the claims below may be implemented in any programming language, such as assembly language, machine language, a high-level procedural programming language, or an object-oriented programming language. The programming language may, for example, be a compiled or interpreted programming language.
[0122] Each such computer program may be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a computer processor. Method steps of the invention may be performed by one or more computer processors executing a program tangibly embodied on a computer-readable medium to perform functions of the invention by operating on input and generating output. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, the processor receives (reads) instructions and data from a memory (such as a read-only memory and/or a random access memory) and writes (stores) instructions and data to the memory. Storage devices suitable for tangibly embodying computer program instructions and data include, for example, all forms of non-volatile memory, such as semiconductor memory devices, including EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROMs. Any of the foregoing may be supplemented by, or incorporated in, specially-designed ASICs (application-specific integrated circuits) or FPGAs (Field-Programmable Gate Arrays). A computer can generally also receive (read) programs and data from, and write (store) programs and data to, a non-transitory computer-readable storage medium such as an internal disk (not shown) or a removable disk. These elements will also be found in a conventional desktop or workstation computer as well as other computers suitable for executing computer programs implementing the methods described herein, which may be used in conjunction with any digital print engine or marking engine, display monitor, or other raster output device capable of producing color or gray scale pixels on paper, film, display screen, or other output medium.
[0123] Any data disclosed herein may be implemented, for example, in one or more data structures tangibly stored on a non-transitory computer-readable medium. Embodiments of the invention may store such data in such data structure(s) and read such data from such data structure(s).
[0124] Any step or act disclosed herein as being performed, or capable of being performed, by a computer or other machine, may be performed automatically by a computer or other machine, whether or not explicitly disclosed as such herein. A step or act that is performed automatically is performed solely by a computer or other machine, without human intervention. A step or act that is performed automatically may, for example, operate solely on inputs received from a computer or other machine, and not from a human. A step or act that is performed automatically may, for example, be initiated by a signal received from a computer or other machine, and not from a human. A step or act that is performed automatically may, for example, provide output to a computer or other machine, and not to a human.
[0125] The terms “A or B,” “at least one of A or/and B,” “at least one of A and B,” “at least one of A or B,” or “one or more of A or/and B” used in the various embodiments of the present disclosure include any and all combinations of words enumerated with it. For example, “A or B,” “at least one of A and B” or “at least one of A or B” may mean: (1) including at least one A, (2) including at least one B, (3) including either A or B, or (4) including both at least one A and at least one B.