SYSTEM FOR PREDICTING TREATMENT OUTCOMES BASED UPON GENETIC AND PROTEOMIC IMPUTATION
20230011166 · 2023-01-12
Inventors
Cpc classification
G16B25/10
PHYSICS
G16H50/20
PHYSICS
G16H50/70
PHYSICS
G16H10/60
PHYSICS
International classification
G16B20/00
PHYSICS
Abstract
Methods, systems, and software provide machine learning and artificial intelligence including deep neural networks that enable the creation and operation of unique, AI-driven genomic and proteomic test results augmentation through variable genetic imputation.
Claims
1. A system for variably imputing genetic and/or proteomic information into a collected data set, comprising: storage access circuitry structured to read an input genetic and/or proteomic data set comprising genetic and/or proteomic expression data including genetic and/or proteomic test results, and to read at least one imputation data set from a database, the imputation data set comprising one or more imputation technique definitions for variably imputing collected data; and an imputation engine coupled to a storage access circuitry, the imputation engine applying the one or more imputation technique definitions from the at least one imputation data set to a collected genetic and/or proteomic data set to variably impute genetic and/or proteomic data and create a resulting processed dataset wherein the storage access circuitry is further configured to output the resulting dataset.
2-55. (canceled)
Description
4 BRIEF DESCRIPTION OF THE DRAWINGS
[0094] The features of the present technology will best be understood from a detailed description of the technology and example embodiments thereof selected for the purposes of illustration and shown in the accompanying drawings.
[0095]
[0096]
[0097]
[0098]
[0099]
[0100]
5 DESCRIPTION OF SOME EMBODIMENTS OF THE TECHNOLOGY
5.1 High Level Overview
[0101] The technologies support accurate recommendation of specific therapies, and the prediction of patient health outcomes in spite of the wide variation in patient immune system responses and known immunotherapy outcomes by using a collection of machine learning techniques for diagnosis (DX)/treatment (RX) and outcome prediction, utilizing a combination of selected genetic imputation and trained expert systems, and ongoing monitoring of patient progress to determine the correlation between predicted and actual patient outcomes.
[0102] The exemplary technologies presented herein use artificial intelligence (AI) and deep neural network (DNN)-based techniques as applied to complex datasets comprising the combination of medical data obtained from EHR mining, collection of detailed proteomic and genetic data further comprising one or more of DNA tumor sequences, a subset of the circulating free DNA (cfDNA) obtained from blood and/or plasma samples comprising DNA shed from tumor cells, tumor imaging, genetic data derived from cells of the immune system, genetic sequencing of microbiome, protein information obtained from protein and proteomic assays, and data from other similar assays, the set of which is enhanced using a novel imputation methodology that increases the robustness of the input data set for a particular patient using imputation strategies and methods that are less resource intensive than existing methods. The exemplary technologies presented herein improve the results obtained using a trained deep neural network for predicting disease progression and outcomes over existing imputation methods without incurring the increased computing resource costs required by existing imputation methods.
[0103] The ongoing monitoring techniques permit the system to determine how an individual is responding to a specific therapy and enables the prediction to be refined on the basis of observed response and any additional test results that may become available.
[0104] The exemplary technologies described herein extend EHR mining results to include complete outcome information for historic cases (length of survival, disease progression timelines), as well as enabling the ongoing collection and inclusion of genetic and/or proteomic data that may not have been available in early stages of treatment. The EHR mining component further supports the inclusion of patient-specific markers (e.g., genetic and proteomic markers, immune-specific markers, microbiome markers), which are used by the current technology as part of its data completion (e.g., imputation), predictive, and recommender features. When the system determines that conditions exist where insufficient information is provided in the mined EHR dataset that cannot be resolved by the imputation features of the methods described herein, the system recommends further activities (e.g., tests, procedures) capable of generating additional information in order to improve the predictive and/or recommender outcomes of the system (e.g., additional or deeper sequencing). These recommended tests and procedures are identified in a database (e.g., recommended tests database,
[0105] Because common genetic sequencing and proteomic techniques, especially those used in single cell analysis workflows, often do not identify complete enough information (e.g., they fail to sequence enough of the patient’s genetic information to enable identification of an effective therapy, or contain “dropouts” where expressed genes may be missed because reverse transcription is not robust), taking into account all available information, or the testing that has been performed to date does not identify a sufficiently robust list of proteins and/or a sequence or set of sequences that enable the identification of an effective therapy, the described technology uses a data completion model for genetic imputation to enhance the collected data in order to improve the technologies’ predictive outcomes. This reduces the number of potential treatments considered by more completely associating the state of the collected data, the treatments performed, and known outcomes in other cases and permits the system to more accurately identify likely effective treatments along with their predicted outcomes. The ability to narrow the list of potential treatments and outcomes to one or more likely effective therapies associated with specific outcomes reduces the administration of ineffective and costly immunotherapies or other treatments as part of the treatment regime.
[0106] Lastly, the described technology provides for ongoing monitoring of the patient’s progress on a specific immunotherapy, which enables the treating physician to more quickly determine that a patient’s tumor is not responding to that therapy, leading to earlier termination of ineffective treatment regimens and modification of the treatment plan. Existing prediction systems do not track a patient’s progress over time, are not capable of providing early indications that a treatment is not working and that an alternative treatment should be considered. The training, prediction, and tracking capabilities of these types of systems may be improved to > 62% accuracy or > 60% accuracy or > 50% accuracy or > 40% accuracy or > 30% accuracy or > 25% accuracy by using the techniques and technologies described herein by highlighting a large fraction of patients that are at risk of non-response.
[0107] Thus, the described system facilitates improved selection of effective immunotherapy treatment options for each individual patient, in which the selected treatment is more likely to be effective, and monitors patient outcomes in order to quickly determine if the treatment is having the expected effect or if an alternative treatment would be more effective.
[0108] The accuracy of the machine learning and prediction/recommendation process steps are optionally improved by pre-processing collected data to impute missing, incomplete, and/or low expressed genetic information in the collected dataset, creating a pre-processed collected dataset. This allows the machine learning models to more accurately predict outcomes and recommend treatments by eliminating conditions in which there is insufficient distinguishing data in the patient’s collected data. The imputation process imputes genetic and/or proteomic data using one or more imputation trained models configured and selected to determine missing, incomplete, and/or low expressed data. In this way, each trained model is an imputation definition for a specific condition. In some embodiments, a general imputation may be performed where all imputable genetic and proteomic information is imputed by the system. In some embodiments, a specific imputation may be performed, such a specific imputation operation that imputes (e.g., genetic and/or proteomic) information associated with one or more particular disease conditions. The specific imputation alternative embodiment allows the prediction model to save computing resources by not imputing information that is not needed for prediction and/or recommendation activities. Limiting the process to imputing only missing information associated with one or more specific diseases further increases the resolution and accuracy when imputing low abundance genetic and/or proteomic data that may be missing in the collected data. The prediction process further imputes missing data associated with one or more identified diseases, and generates a prediction value representing confidence in, or strength of, imputation, using one or more imputation possibility rules. These rules are generated during model training (using training datasets) as described below. This allows the system to use a low-resource rules-matching imputation method rather than employing a known resource-intensive imputation method when processing collected data.
[0109] The described system includes the hardware and software components necessary for processing a variety of clinical data and predicting patient outcomes for immunotherapeutic treatment of disease, comprising at least one trained predictive model which in an exemplary embodiment is a comprising a deep neural network of at least 6 layers, in which the predictive model is trained on a combination of patient EHR data and/or directly read genetic and/or proteomic data and predicts at least one of the following: a disease diagnosis, the effectiveness of a disease treatment, a disease outcome, a disease progression over time, or disease survival.
[0110] In more specific embodiments, the described system and methods predict outcomes of therapies on disease and/or immune states treatable with specific therapies including predicted outcomes of these specific therapies. In these more specific embodiments, the described system and methods comprise collecting incomplete patient data, imputing missing data on the basis of the collected genetic and proteomic data to produce processed collected genetic and proteomic data, and using the processed collected genetic and proteomic data with the trained predictive model in order to determine one or more of a disease diagnosis, the effectiveness of a disease treatment, a disease outcome, a disease progression over time, or disease survival.
[0111] These and other aspects and advantages will become apparent when the Description below is read in conjunction with the accompanying Drawings.
5.2 Definitions
[0112] The following definitions are used throughout, unless specifically indicated otherwise:
TABLE-US-00001 TERM DEFINITION Genetic data Germline sequence data, mutation data, single cell RNA sequencing (scRNAseq), microbiome sequencing, metabolomic data, transcriptomic, epigenomic, DNA fragmentomic, and other genetic and genomic test results. Proteomic data Data related to identification and quantification of a specific identified protein or the complete protein complement of a cell or tissue. GP data Genetic data and/or Proteomic data Collected dataset Genetic and/or proteomic data extracted from one or more EHR record (s) and/or collected experimentally. Produce Create, modify, or remove a genetic or proteomic marker under the control of the imputation process. Producing a dataset includes creating a dataset including at least one produced genetic or proteomic marker. EHR Electronic Health Record Comprises medical records including diagnosis (DX) and treatment (RX) information and standard medical laboratory test results, as well as proteomic, genetic and genomic test results that comprise genetic or proteomic data. Marker A genetic or proteomic data point, comprising an identifier (ID or tag) identifying the datum, an optional value, and an optional expression level. Operations on a marker may be upon the marker as a whole, or upon an individual element of the marker. A genetic marker is a marker that represents genetic data. A proteomic marker is a marker that represents protein / proteomic data. A Marker may comprise or include a data structure. Disease profile A set comprising one or more genetic and/or proteomic markers, typically associated with a disease and/or immunologic state. SNP Single Nucleotide Polymorphism occurs where there is a substitution of a single nucleotide at a specific position in the genome, and the substitution is present in 1% or more of the population. Some SNPs are predictors of susceptibility to some genetic-based diseases such as sickle-cell anemia and cystic fibrosis, and also predictors of how the body will respond to treatment. VCF Variant Call Format, a standardized text file format for representing SNP and other genetic information. Microbiome The combined genetic material of the micro-organisms present in a particular physiological environment such as, for example, the gut or the skin. Disease areas Auto-immunity, oncology (e.g., cancers (general), breast cancer, hematologic malignancies) Diagnosis (DX) The identification of a disease or illness, sometimes associated with a well-known identifier such as an ICD code. Treatment (RX) Medical care provided to a patient for an illness or injury, sometimes associated with a well-known identifier such as an ICD code. Treatments may include, for example, immunotherapy, chemotherapy, radiation, anti-inflammatory drugs. Outcome The effect of a treatment on a patient’s health and/or disease course. Imputation engine The program or programs that perform a variable imputation process
5.3 Detailed Description of Exemplary Embodiments
5.3.1 Data Flow for Exemplary Training Mode Operation
[0113]
[0114] Machine learning techniques are typically ineffective, or at best partially effective, in analyzing patient EHR data for determining the best outcome for patients with specific cancer types, treatments, and outcomes because the training datasets often do not have sufficient genetic and/or proteomic information in the encoded patient EHR data to establish necessary correlation(s). Thus, the models produced by training a machine learning system on this type of data do not accurately predict treatment/outcome for a specific patient with a definitive diagnosis and supporting lab work. Improvements can be made to the data flows in the model training processes and the patient EHR processing in order to produce a system that can accurately predict the outcome of various treatment regimes.
[0115] The training process improvements include the combination of parsing EHR records to generate a training dataset storage database (2275) along with statistical significance data that indicates significance of the information that may be found in the parsed EHR data. Exemplary statistical significance data includes p-values that indicate the relevance of particular markers, and/or relevance of traits associated with the markers, to particular diseases or conditions. The statistical significance data are taken from a database of markers and their associated statistics (e.g., database 1120). For example, the genomic database may be generated by extracting summary statistics, including p-values, from Genome Wide Association Studies (GWASs), both public and private, and in particular, from GWASs that are related to specific disease pathologies, such as oncology and immunology. GWAS identifies inherited genetic variants associated with a risk of disease or a particular genetic trait. Other databases may be used as training sources for protein and proteomic data as understood by those skilled in the art.
[0116] The EHR data is mined (1200) for diagnosis (DX), treatment (RX), and outcome (e.g. partial response, complete response, progression-free survival, length of survival) information, and any additional data needed to create at least one collected dataset. The statistical significance data is used to select markers from the collected dataset for inclusion in training datasets based upon the usefulness of the markers for determining diagnosis, treatment, and predicted outcomes, as assessed in part by their p-values. Genetic and proteomic data is optionally imputed by the training dataset imputation process (1220) with the collected dataset(s) to generate one or more processed collected dataset(s). The processed collected dataset(s) comprise collected and imputed genetic and/or proteomic data that includes a more complete set of the selected markers than the collected dataset(s) contains. The imputation process may be performed across all of the collected dataset(s), or may be performed limited on the basis of one or more of diagnosis and/or treatments identified in a collected dataset(s). The training dataset imputation process (1220) operates in a manner consistent with the steps described below for the imputation process (1650), and uses either (or both) trained models from the imputation model database (1300) and rules from the imputation possibilities database (2600). If imputation is not performed, the collected dataset is copied to the processed collected dataset. The processed collected dataset is stored in the training dataset storage database (
[0117] The resulting collected dataset and/or one or more processed collected datasets are used to fully train (1250) one or more multi-task trained prediction models. The resulting trained models (e.g., models 1410a, 1410b, ...) are stored in a trained model database (1400), indexed by one or more of diagnosis, treatment, and marker. The trained models are used by the system to complete datasets by imputation (e.g., an imputation trained model), recommend treatment courses, and predict treatment outcomes.
[0118] Additional training process improvements involve producing one or more imputation trained models that identify genetic and/or proteomic information to be imputed in a patient’s collected dataset when incomplete input information is discovered. In some embodiments, the training system uses one or more disease profiles to determine a set of associated genetic and/or proteomic information. The profiles may be implemented as associations between specific genetic and/or proteomic information identified in a database. The training system uses a model/rule generation program (
5.3.2 Data Flow for Exemplary Predictive Mode Operation
[0119] After the trained model databases are constructed, the system operates in predictive mode by accepting patient-specific genetic and EHR data and replicating the previously described data mining operations to identify DX, RX, outcome (to date), and genetic and/or proteomic data (e.g., tumor, SNP, VCF, and microbiome) associated with the patient and writes the data to the patient database (1500). This step is performed by the patient dataset extractors program (
[0120] The patient’s collected dataset(s) is then ordered on the basis of genetic data corresponding to one or more diagnoses or treatments. In one exemplary embodiment, the collected information is ordered by the most likely variant using the VCF SNP information (1600) (which is obtained from summary statistics, GWAS-like work, or other predictive models of immune and oncology-related traits) and the system proceeds to the imputation step (1650). The imputation step is provided by a genetic and proteomic imputation program selecting and executing on a computer (e.g., imputation program
[0121] The imputation engine operates on the genetic and/or proteomic information from one or more datasets (either patient’s collected data or training collected data) in order to enhance the dataset(s) by imputing missing data using one or more imputation trained models (taken from the above-mentioned imputation database 1300). Alternatively or additionally, in some exemplary embodiments, rules stored in imputation possibilities database (2600) are used during imputation.
[0122] The process of imputing data completes missing (or low expressed) portions of the patient’s genetic and/or proteomic information (in certain single cell sequencing embodiments, and depending on the assay, certain types of results may be either obtaining the sequence or not). In some embodiments, the system limits imputation to specific markers on the basis of information in the dataset, e.g., patient data such as RX, DX, demographic data, or specific markers that are included in the database of markers (which may by additionally identified as associated with specific type of patient data), markers having a significance greater than a threshold value, or markers that are associated with an identified disease, treatment, tumor, or immunological state. In an exemplary embodiment, the system limits imputation to missing markers that have an established association with a disease identified in the patient’s dataset. Alternatively, the limitation may be for imputations where the p-value associated with a particular marker is greater than a pre-determined and stored threshold value, for example, a threshold value greater than 75%, greater than 80%, or greater than 90%. The threshold value is associated with each set of selection parameters (such as RX, DX, specific demographic data) and is stored in a system configuration or database (not shown). The variable nature of the imputation process allows prediction models to be more accurate in their predictions by eliminating conditions in which insufficient distinguishing data is in the patient’s records, by being able to utilize a wider set of features that have known association with the patient’s particular condition, as well as in aggregating data sources from different datasets. Note that the imputation process varies depending upon the set of selection parameters, we call this selective imputation as variable imputation. The processed collected data (1675) is optionally logically associated with the patient’s EHR data, and is passed, along with the patient’s EHR data, to the prediction step.
[0123] The prediction step (1700) uses the trained prediction models from the prediction model database (1400) against the patient’s processed collected dataset (1675) in order to establish or confirm a diagnosis, identify and recommend one or more treatments (RX) in ranked order of predicted effectiveness given the patient’s processed collected data, and predict the outcomes of each treatment (survival, length, and end state).
[0124] The prediction step comprises selecting and executing a set of specific trained models (1410a, 1410b, 1410c) each of which is trained to predict the effects of a particular marker in the input set of measured and imputed markers. For example, with genomic data, each specific trained model predicts the outcome influenced by the particular gene in light of the DX and RX (e.g., the diagnosis and drug response). In some alternative embodiments, the prediction step uses specific trained models configured for proteomic markers, or may use specific trained models configured for genomic and proteomic markers. Collectively, the genomic-specific and proteomic-specific trained models are called specific trained models.
[0125] The predicted outcomes generated by one or more specific trained models are optionally combined to create a combined trained model (1415).
[0126] The combined trained model is created by training the model to predict the results of the combination of outcomes predicted for each of the markers represented by the individual specific trained models. Based on a combination of outcomes predicted by multiple specific trained models, the combined trained model generates a single trained model output that predicts one or more of the following: a disease diagnosis (DX), a treatment recommendation (RX), patient outcomes, and probabilities for each outcome. An exemplary combined trained model output includes one or more predicted outcomes based upon the combination of the determined genetic, proteomic, DX, and RX information. Specifically, in some exemplary embodiments, the resulting combined trained model output includes prediction of a drug (RX) response based upon one or more markers identified as present in the collected and processed collected datasets. Exemplary combined trained model output includes one or more predicted outcomes, each with an associated outcome probability, and each based on a different treatment (RX) recommendation.
[0127] The two-step prediction method of the technology described herein has a number of advantages over known prediction systems that may use a monolithic trained model to predict outcomes. For example, by selecting and executing first specific trained models specific to certain markers, the system saves computing resources by only reasoning over a disease-specific subset of all known markers using models that each include many less nodes or other model parameters than would be required for a monolithic model trained on a larger set of markers. Further, the arrangement of multiple specific trained models can be executed using parallel processing techniques, as are known to those having skill in the art, which may save time required to generate an outcome. In addition, training and prediction efficiency of the second model is improved by providing second model inputs that comprise information that has been reasoned over by the first, specific, trained models rather than raw or unprocessed input data. In this manner, an exemplary embodiment includes multiple specific trained models that each pre-process collected marker information in parallel to generate input features that are processed more efficiently by the second model as compared to known prediction models which distinguish undifferentiated data and thus must be trained with substantially larger training datasets.
5.3.3 Genomic and Proteomic Imputation
[0128] Genomic imputation, for example DNA Polymorphism (e.g., single nucleotide polymorphism (SNP)) imputation, is typically done for the whole genome at once without prioritizing specific regions in existing imputation methods. In contrast, imputation methods of the technologies are targeted to specific genetic and proteomic markers, i.e., genetic and proteomic information that has been discovered to be statistically significant for predicting treatment outcomes, generating probabilistic outcome measures, and for recommending treatments for specific diseases or immunological conditions, for example for specific cancer types, autoimmune diseases, or other pathologies.
[0129] Some exemplary configurations of the technologies described herein use an imputation method that is based on an LSTM autoencoder tuned to perform imputation on regions of interest that are prioritized by relevance to selected conditions or diseases of therapeutic interest. Other imputation methods may also be used as described below. Data mined from one or more data sources, such as the whole genome sequencing data of the thousand genomes project, UK Biobank, and more may be used for this purpose. Further, by leveraging summary statistics from GWASs (both public and private) that are oncology and immunology related, the imputation methods described herein prioritize SNPs that have significant p-values in those GWASs, thereby increasing specificity and accuracy for imputing the prioritized SNPs, including low abundance SNPs, and reducing computing resources by limiting the scope of imputation.
[0130] One exemplary aspect of the system is the imputation of missing SNPs for the construction of a shallow long short-term memory (LSTM) autoencoder for four inner layers (e.g., 6 layers overall). The input layer comprises the SNPs identified by particular assay arrays (e.g., from Illumina’s MEGA array, or other arrays) and the output layer provides a complete genome of a target set of genes, for example the complete genome of the set of HLA-A and HLA-B genes. The four inner layers comprise two RNN layers and two convolution layers.
[0131] The various layers have selected weights (loss functions) selected as follows.
[0132] A set of SNPs (e.g., HLA SNPs) that are not part of the selected assay arrays, and weights associated with each SNP.
[0133] Parameters from polygenic risk scores associated with auto-immune diseases.
[0134] Parameters from polygenic risk scores associated with cancer pre-disposition.
[0135] Parameters identified in the literature (e.g., Stanford GWAS, literature searches).
[0136] Similar layer weights and constructions may be used in differing auto-encoder configurations, including parameters directed to proteomic data, or to a combination of proteomic and genomic data.
[0137] In a first exemplary embodiment, trained models (1310a, 1301b, 1310c) each include a trained LSTM autoencoder. In a particular exemplary embodiment, each LSTM autoencoder is trained to impute genetic and proteomic information corresponding to a particular disease state, immune state, or condition. For example, a first trained model (1310a) is trained using one or more GWASs corresponding to pancreatic cancer and second trained model (1310b) is trained using GWASs corresponding to breast cancer. When trained in this way, the trained models may impute markers differently due to changes the underlying disease or changes in the disease state, immune state, or condition, which improve the overall accuracy of the results. Similar training may be performed using the corresponding databases for proteomic and mixed proteomic/genetic information. When performing imputation, the system selects one or more of trained models (1310a, 1310b, and 1310c) from the imputation model database (1300), for use in imputing genetic and proteomic information that corresponds to a particular disease or condition of interest. In this manner, the system saves computing resources by only imputing genetic and proteomic information that is associated with the disease or conditions of interest and increases accuracy of the imputations performed as compared to existing imputation methods.
[0138] In a second exemplary embodiment, the system includes one or more imputation possibilities databases (2600), each with multiple imputation rule entries. The imputation possibilities database(s) may be stored in a single database, or alternatively, may be stored in multiple databases to segregate imputation rules created for use by a specific source and/or for a specific purpose, e.g., for use with a specific diagnosis. In an exemplary embodiment, at least some of the multiple entries of the database are generated from or by one or more trained machine learning models or are derived from external study types and datasets listed above. In an alternative embodiment, one or more of the entries in the database(s) are generated and entered into the database externally to the system.
[0139] In a third exemplary embodiment, the system uses the imputation possibilities database as described above, along with one or more trained models in a two-pass imputation process. The collected data is pre-processed by the system using the imputation database, and then processed a second time using one or more trained models.
[0140] Each entry in an imputation possibilities database includes at least one measured genetic or proteomic trait, for example a measured SNP, one or more imputation possibilities corresponding to the measured trait, and, in some embodiments, one or more rules for selecting an imputation possibility for inclusion with imputed markers, for example, imputed genetic information.
[0141] The system creates, updates, adds, and deletes rules entries to the imputation possibilities database based on newly processed patient EHR information during a training process. In this manner, the system continuously improves the accuracy of produced imputations as the machine learning models are trained. The continued improvement of entries in the imputation possibilities data allows the imputation possibilities database to improve the imputations recommended by the database as the system continues to process patient data. The improvements in the imputation possibilities database permit further improvement to the imputation results of subsequently processed data, and when used as a collected data pre-processing step, improve the results from the trained models.
[0142] An exemplary imputation rule is shown in three-part form, [0143] an evaluation expression specification used to determine the presence or absence of conditions necessary for imputation, [0144] an imputation action, and [0145] a set of imputations.
[0146] The evaluation expression specification is a Boolean expression that is evaluated by the imputation rule execution function (part of the imputation program) against a collected dataset and/or processed collected dataset in order to determine a true/false result on whether the imputation action is to be selected and applied to modify the resulting processed collected dataset. The expression is evaluated; if true, imputation is selected and applied, if false, imputation is not performed. Any expression language may be used for the specification of the expression, as long as arbitrarily large sets may be specified.
[0147] Expression example syntax
[0148] An example expression definition may be encoded using a grammar, such as the example below in Backus normal form (BNF):
[0149] Expr ::= { NOT } <gene ID> {<relop> <value>} { ANDIOR <expr>) +
[0150] Relop := ∼ | < | = | > | >= | <=
[0151] Value := <number, magnitude of expression> | in <gene class/subclass ID>
[0152] A non-exhaustive exemplary list of Gene ID’s usable by the embodiments herein is included in Table 1.
[0153] Gene class/subclass ID’s are understood in the art. The IN operator treats the class/subclass as a set, and tests for a particular gene being a member of that set.
TABLE-US-00002 Gene ID Gene ID Gene ID Gene ID Gene ID Gene ID CXCR5 KLRF1 TBX21 MS4A1 VPREB3 GATA3 ASB2 GNLY IL12RB2 CD79A PAX5 IL4 CD200 LILRB1 TRBV25-1 CD79B LILRA4 IL5 BCL6 CCL4 TRAV10 HLA-DOB IL3RA IL13 PDCD1 NKG7 ZBTB16 BANK1 CLEC4C IL17RB CD4 FCGR3A CD33 JCHAIN LAMPS CCR4 IL17A KLRD1 CD14 IGHM PTCRA IL9R IL17F CD244 TGFB1 IGKC TNFRSF21 TNFSF10 IL22 SLAMF7 CD3D IGHA1 FUT7 IL11 RA KIT PRF1 CD3G IGHG2 ITGA2B TNFRSF9 IL17RE F2R CD3E IGHD ITGB3 TIGIT RORC KLRK1 TRAC CR2 CXCL5 ICOS CTSH CTSW TRBC2 TNFRSF17 S100A8 IL2RA LGALS3 CCL5 TRBC1 TNFRSF13B LYZ TOX CCR6 CST7 IL7R TNFRSF13C S100A12 CCR10 TNFSF13B TGFBR3 CD2 BLK FCN1 CCL27 TNFRSF18 CD300A TCF7 FCRLA MNDA CCR8 IL1R1 IL5RA LEF1 CD22 CTSS CTLA4 CX3CR1 GZMA CD27 FCER2 MS4A6A DUSP4 ZEB2 IL18RAP CD8A PDLIM1 CST3 FOXP3 ITGAM KLRG1 CD8B POU2AF1 CSTA IKZF2 EOMES GZMK TRGV2 TCL1A CYBB LRRC32 FCRL6 CXCR3 IL32 CD40 NCF2 IL2 GZMH CCR5 CD6 AFF3 AIF1 IL20RA GZMB IFI27 CD19 BLNK CFD IL21 ITGB1 SMAD3 FOXJ1 ILDR1 IL37 HLA-DQB2 CCR7 IL2RB IGHG4 TRAV1-2 NR1l2 LAYN SELL KLRB1 IGHG1 CCR1 CSF2 TNFSF15 FCER1G MAP3K1 IGHA2 LTK LIF CCL7 FCGBP KIR2DL3 XBP1 IL23R XCL1 CST6 ENTPD1 KLRC3 IGHJ2 FLT4 TNFSF11 CCL13 HAVCR2 KLRC2 CD38 IL4l1 TXK CCL8 CXCL13 KIR3DL1 IGLG1 CXCR6 CD7 CXCL1 1 LAG3 KIR2DL4 IGHJ1 NCR3 CRTAM CLEC4D PRDM1 KIR2DL1 TCL1B CCL20 FCER1A CCL2 IL6ST CD160 IL21R CCR2 ENHO CD93 TNFRSF10D EGR1 IL4R TRGC1 CD1C CLEC4E IRF4 KIR3DL2 TGFA CCL4L2 CD1E FCGR1A FAS CD63 TRDV2 PDGFRB IDO1 FCGR1B CD58 ITGA1 TRDV3 FCGP2A CD200R1 CCL19 SLAMF1 CCR9 TRDC XCL2 FLT3 CCL24 NCR1 CXCR1 KLRC1 TRGV4 XCR1 CCL26 BACH2 NCAM1 ITGAX FCGR2B HLA-DQA1 CXCL12 CD9 CCL3 TRGV9 LYN HLA-DQB1 C1QA CD70 CXCR2 C1QC TRGC2 CD1B C1QB EBI3 IL7 MSR1 CLEC4F CD109 MS4A7 CD80 FCGR3B ICAM4 FCRL1 CXCL14 C3 CD44 FGFBP2 LTB ITGAL HLA-DRB1 GPR183 ADGRG1
[0154] Examples of evaluation expression specifications include the following. The expression is evaluated; if true, imputation is selected and applied, if false, imputation is not performed.
[0155] * Presence of a genetic or proteomic marker or set of genetic or proteomic markers, e.g., {M} or {M, M’ }.
[0156] * Presence of a set of genetic or proteomic markers { M, M’ } and the absence of a genetic or proteomic marker { M }.
[0157] * Presence of a genetic or proteomic marker with an expression level above a specified threshold (e.g., { [ M, > threshold value ] }, where [M, threshold value] represents M the genetic or proteomic marker, and threshold value is a measured value of the expression of M).
[0158] * Presence of a genetic or proteomic marker with an expression level below a specified threshold (e.g., { [ M, < threshold value ] }, where [M, threshold value] represents M the genetic or proteomic marker, and threshold value is a measured value of the expression of M).
[0159] * Presence of a genetic or proteomic marker with an expression level equal to a specified threshold (e.g., { [ M, = threshold value ] }, where [M, threshold value] represents M the genetic or proteomic marker, and threshold value is a measured value of the expression of M).
[0160] * Presence of a genetic or proteomic marker with an expression level between two specified thresholds (e.g., { [ M, >= threshold 1, <= threshold 2 ] }).
[0161] Other expression operators may be added without departing from the scope of the present technology.
[0162] The imputation action is a code that indicates how the imputation process should process the information, selected from a list including:
[0163] Add - Add the imputations/outcomes data specified to the resulting processed collected dataset.
[0164] Remove - Remove the imputations/outcomes data specified from the resulting processed collected dataset (if present)
[0165] Modify - Modify existing data in the processed collected dataset in accordance with the imputations/outcomes data.
[0166] Modify-relative - Modify existing data in the processed collected dataset in accordance with the imputations/outcomes data by adjusting the reported expression level values (but not changing the reported genetic or proteomic markers) as a function of one or more marker expression levels. For example, increment/decrement an expression level, or set an expression level to a percentage of another gene’s expression level.
[0167] Other actions may be added to the system within the scope of the technology.
[0168] Exemplary imputation rules include:
[0169] - Given an evaluated Boolean result value of “true” for expression X, where X represents an expression as described above, impute and add a set of genes {Y’, Y”, Y’’’, ... } with an imputed strength of expression of each gene in the imputed set.
[0170] - Given an evaluated Boolean result value of “true” for expression X, impute and add genetic data represented by Y’, with a corresponding imputed strength of expression that is a function of the measured strength of expression of one or more elements used as parameters of X.
[0171] - Given an evaluated Boolean result value of “true” for expression X, add a set of markers {e.g., Y’, Y”} and corresponding imputed strengths of expression.
[0172] - Given an evaluated Boolean result value of “true” for expression X, where X includes the testing measured genes X’ and X”, add a set of markers { Y’, Y”, Y’’’ } with specified levels of expression.
[0173] - Given an evaluated Boolean result value of “true” for expression X, where X includes the testing of measured gene X’ with an expression level of less than a specified value, modify the specified expression level of X’ to a specified value.
[0174] - Given an evaluated Boolean result value of “true” for expression X, where X includes the testing of measured gene X’ with a measured expression level of less than a specified value, modify-relative the specified expression level of X’ by altering the specified expression level of X’ by a function related to the measured expression level (e.g., changing the measured expression level value by a calculated amount or percentage).
[0175] - Given an evaluated Boolean result value of “true” for expression X, modify-relative the strength of expression of a gene X’ by setting the strength of expression of X’ as a function of the strength or certainty of an imputation link between an element of X and X’ (e.g., based upon the strength of linkage of SNPs to a particular disease or disorder).
[0176] - Given an evaluated Boolean result value of “true” for expression X, impute and add a set of markers {Y’, Y”, Y”’} and based upon the imputed marker Y’, further impute marker Z’, which is related to Y’, to produce the set of imputed markers {Y’, Y’’, Y’’’, Z’}.
[0177] - Given an evaluated Boolean result value of “true” for expression X, impute by deleting the set of markers {Y’, Y’’, Y’’’}.
[0178] Further exemplary rules can specify imputation strength scores as a function of one or more additional or alternative factors that may influence imputation accuracy including, for example, SNP density and sequencing depth or coverage.
[0179] In a first exemplary embodiment, the system uses the marker information extracted from the collected dataset to marker(s) identified in one or more instances of imputation possibilities database(s) (2600) and uses the associated rules to determine imputed marker data based on the measured marker(s) and RX/DX information in the collected dataset.
[0180] In a second exemplary embodiment, one or more of the trained models (1310a, 1310b, and 1310c) compares marker information extracted from the collected dataset to markers identified in one or more instances of genetic and proteomic possibilities database(s) (2600) and uses the associated rules to determine imputed markers based on the measured markers in the collected data. In an exemplary embodiment, the trained model(s) can function as expert systems modules by identifying and implementing only those rules contained in the imputation possibilities database necessary to produce imputed markers based on the observed markers (e.g., gene IDs such as those examples enumerated in Table 1) and a newly identified RX or DX value generated by the trained model. In this manner, the system can generate imputation results using fewer computing resources as compared to traditional imputation methods and programs.
[0181] In some exemplary embodiments, the system uses one or more additional or alternative methods to impute markers based on measured markers including, for example, Bayesian approaches and graphical causal models.
5.3.4 Low Expressed Genetic and Proteomic Information
[0182] Some exemplary genetic and proteomic markers include low expressed markers, for example low abundance markers, e.g., rare SNPs or other rare variants. “Low expressed genetic information” is defined herein as genetic information which may be difficult to characterize by a default or usual sequencing depth or coverage used by the system. “Low expressed proteomic data” is defined herein as proteomic information which may be difficult to characterize by a default or usual proteomic analysis techniques. Low expressed genetic and proteomic information may also be difficult to characterize by imputation, for example due to low levels of association or linkage with other markers that are more readily observed. In an exemplary embodiment, the system parses a patient EHR and determines genetic and/or proteomic markers contained therein, for example, markers determined by a low coverage or low depth, e.g., 4X or 8X, sequencing performed on one or more biological samples (blood cells/PBMCs, tumor cells, etc.) from the patient. The system determines whether markers are missing from the initial sequencing results, i.e., from the observed markers, and, if so, uses one or more of the imputation methods of the exemplary technology described herein to attempt to impute the missing markers, which can include low abundance markers. Because the imputation methods are designed to impute specific markers, as previously described, low abundance markers are more readily and accurately identified by the imputation methods.
[0183] If, however, the system determines that one or more low abundance markers have not been observed and additional marker information has not been successfully imputed, the system can recommend (using a recommender program like the low abundance method recommender program
5.3.5 Exemplary System Architecture
[0184]
[0185] EHR extractor/encoder program (2250) receives, from collected data database (1100), historical patient EHRs, e.g., EHRs from a cohort of patients comprising a particular patient population of interest, and mines the patient EHRs datasets in order to generate datasets including DX, RX, outcomes, and genetic and proteomic data associated with the EHRs data, which is stored in training dataset storage database (2275).
[0186] The imputation program (2650) receives, from training dataset storage database (2275), collected data including genetic and proteomic data and uses one or more imputation methods to produce processed collected data that includes imputed genetic and proteomic marker information. In an exemplary embodiment the imputation program produces markers associated with those genetic and proteomic markers identified for imputation, as defined herein. The imputation program stores processed collected data, including imputed genetic and proteomic marker information, in the training data storage database.
[0187] Model/rule generator program (2260) receives, from training dataset storage database (2275), collected and/or processed collected data including genetic and proteomic information and other patient-related data (e.g., DX, RX, outcomes). In a first exemplary embodiment, the model/rule generator program receives, from model storage database (2270) one or more untrained models. The model/rule generator program uses the collected and/or processed collected data to train the untrained models, thereby generating trained imputation models (e.g., 1310a, 1310b, 1310c) which it stores in the external trained model database (2300). In a second exemplary embodiment, the model/rule generator program uses the collected and/or processed collected data to generate one or more imputation rules, which it stores in imputation possibilities database (2600).
[0188] The computer system may optionally further comprise one or more database((s)/cache(s) (e.g., databases 2270, 2275) in which the computer stores information about one or more of collected data, processed collected data, training datasets, and trained models with which it is configured to inter-operate. The database/cache may be stored in an internal memory of the computer system (e.g., model storage database 2270, training dataset storage database 2275) or may be stored in an external database such as the collected dataset database (1100), the trained model database (2300), or the imputation possibilities database (2600) to which the system is operably connected. In some embodiments, the computer system may further comprise a secure key storage mechanism (not shown) such as a secure processor, a TPC chip, or other mechanism in which it stores cryptographic materials that protect patient confidential information.
[0189] When operating, the exemplary computer processor executes one or more of the exemplary systems’ programs from persistent or transient memory in order to convert the general purpose computer into a processor-controlled role-specific device that implements the one or more defined processes of the executing programs.
[0190]
[0191]
[0192] Patient data extractor program (3280) receives, from patient data database (1500), collected data corresponding to a selected patient. In some embodiments (not shown), the patient data extractor receives data directly from an EHR, and/or directly from a genetic reader. When new data is obtained, the patient data extractor program writes the collected data to the patient data database (1500). The patient data extractor program then mines the patient’s collected dataset to identify aspects of the collected data including DX, RX, outcomes, and collected genetic and/or proteomic data. In some embodiments, the patient data extractor program receives additional information identifying selected markers (and/or disease profiles) from a markers and statistics database (1120) and limits the collected data to include only those identified markers. In additional exemplary embodiments, the patient data extractor program encodes the collected data for use by the system, for example, by standardizing aspects of the collected data by converting numerical laboratory test results to standardized ranges (e.g., low, normal, high). The patient data extractor program stores the collected data in system database (3270) or back to the patient data database (1500). In some embodiments, the patient data extractor program provides the collected genetic and proteomic data directly to the genetic and proteomic imputation program (3650).
[0193] Genetic and proteomic imputation program (3650) receives collected information from either the patient data extractor program (3208), system database (3270), or both and produces imputed marker(s) of one or more genetic and proteomic data in the dataset as processed collected data, for example, creating specific (genetic and/or proteomic) markers (some or all of ID, values, and/or expression levels), that are missing from the collected data. In a first exemplary embodiment, the imputation program retrieves, from external trained model database (2300), a trained imputation model and uses the trained imputation model to produce the imputed markers. In a second exemplary embodiment, the imputation program retrieves, from imputation possibilities database (2600) one or more imputation rules and uses the one or more imputation rules to produce the imputed genetic and proteomic information. The imputation program stores the processed collected data, including imputed information, in system database (3270) or in the patient data database (1500). In a third exemplary embodiment, the imputation program selects and applies imputation rules to pre-process the collected data, and then selects and uses the trained imputation model to further process the collected data.
[0194] In some exemplary embodiments, the low abundance recommender program (3750) retrieves, from system database (3270) or the patient data database (1500), collected and processed collected data and determines genetic and proteomic information including expression levels of one or more retrieved markers that require imputation to create, modify, or removes information from either the collected and processed collected data. In some implementations, the system automatically selects one or more imputation steps to be performed. In other cases, the low abundance recommender program (3750) generates one or more recommendations for additional tests that may be run to create the missing markers and stores these recommendations in recommended tests database (3775). In some implementations, the recommendations may be produced in a form of a prescription that is subsequently made available to a testing provider, or is presented on an output device.
[0195] Prediction program (3290) receives, from system database (3270), one or more of collected and processed collected data and receives, from external trained model database (2300), one or more trained prediction models. The prediction program (3290) uses the one or more trained prediction models to generate, based on the collected and processed collected data, one or more predicted outcomes including, for example, DX, RX, outcomes, and projected survival duration. The prediction program (3290) stores the predictions in prediction database (3600).
[0196] In some embodiments, the prediction program compares successively generated stored predictions to subsequent predictions and outcomes to identify those cases where earlier predictions were not accurate. Those prediction cases where the prediction is less accurate than a specified threshold (stored in a configuration) and the underlying patient dataset may be copied to a collected data database (1100) for use in producing new iterations of trained models.
[0197] The computer system may optionally further comprise a local instance of one or more system database/cache (3270) in which the computer stores information about one or more patient data and trained models with which it is configured to inter-operate. The database/cache (3270) may be stored in an internal memory of the computer system (not shown) or may be stored in an external database such as the patient database (1500), the markers and statistics database (1120), the trained model database (2300), the prediction database (3600), the recommended tests database (3775), and the imputation possibilities database (2600) to which the system is operably connected. In some embodiments, the computer system may further comprise a secure key storage mechanism (not shown) such as a secure processor, a TPC chip, or other mechanism in which it stores cryptographic materials that protect patient confidential information.
[0198] When operating, the exemplary computer processor executes one or more of the exemplary systems’ programs from persistent or transient memory in order to convert the general purpose computer into a processor-controlled role-specific device that implements the one or more defined processes of the executing programs.
[0199]
[0200] On the path for training the multi-task model, in step 4100, the system selects one or more training datasets from either the collected data database (1100) or a training data storage database (2275) to encode/process. The selection may be on the basis of one or more identified DX, RX, outcomes, or on the basis of other factors determined by the user. In step 4110, the training dataset(s) are then processed, extracting and encoding the features needed to train a predictive model according to the technologies described herein. The extraction and encoding may include DX, RX, outcome, and marker extraction and encoding.
[0201] In step 4120, the resulting training dataset is then filtered based upon user input to exclude anomalous data, and the resulting filtered training dataset is combined with the processed statistical significance information from step 4210 to produce a combined training dataset.
[0202] In step 4300, the combined training dataset is then down-selected on the basis of the identified information present. The down-selection process identifies the information in the combined training dataset that will be useful in training a multi-task model for a selected DX/RX/genetic pair. If no data is selected, the current training session ends.
[0203] If there are training dataset(s) associated with to the specific training activity, the system selects those training dataset(s) and proceeds to step 4350 where it performs imputation steps to produce imputed markers from among the genetic information identified in step 4300 as useful for training the multi-task model. The system performs the imputation using previously generated imputation trained models. Alternatively, the system may select an imputation possibilities dataset which may select and use imputation rules from the selected dataset in the imputation process.
[0204] The system then proceeds to step 4400, where it creates a trained model for the specific DX/RX/marker using the selected elements of the combined training dataset. In step 4402, the system determines whether the multitask model has been optimized, for example by generating predictions using the model and labeled testing data, calculating a cost function based on comparison of the generated predictions and the testing data labels, and determining whether the cost function has been minimized. If the system determines that the model has not been optimized, in some embodiments it proceeds to step 4405 to update imputation models and/or imputation rules. The system then repeats imputation step 4350 and to produce new imputation results performs additional training on the multitask model using training data including the new imputation results.
[0205] If the system determines, at step 4402, that the multitask model has been optimized, it proceeds to step 4410 and the resulting trained model is then stored in a database such as the trained model database (2300), following which the training process completes. The training process may be repeated as often as necessary as new training dataset(s) or useful statistical significance data is made available, or when a new trained model is desired for a different DX/RX/genetic pair.
[0206]
[0207] The process starts with step 5010, where the user selects genetic sequences to include in the prediction.
[0208] In step 5020, the system processes the patient’s collected dataset from the patient database (1500), selecting and extracting, on the basis of user input, the patient’s DX, RX, treatment stage/outcome(s) to date, one or more collected datasets (e.g., VCFs), and, in some exemplary embodiments, microbiome features, e.g., alpha or beta diversity). User inputs may include the following:
[0209] User selection of results as absolute or relative survival curves.
[0210] User selection prioritizing or omitting different genetic features.
[0211] User selection prioritizing or omitting different proteomic features.
[0212] User option to manually enter clinical information (i.e., single gene mutations).
[0213] At step 5025, the system performs any imputation necessary (1650 of
[0214] This permits the user to have control over the types of information used by the trained models. The selected information is then used to select one or more trained models (e.g., 1410a, 1410b of
[0215] In step 5040, the selected trained model(s) are used to predict the outcomes based upon specific treatments and based upon at least some of collected data and/or processed collected data, including collected and, in some cases, imputed markers (1700 of
[0216] In a particular exemplary embodiment, the system uses one or more selected trained models to generate multiple predicted outcomes, each corresponding to a specific treatment, thereby generating a set of treatment and predicted outcome pairs. The system further uses the trained models to generate a likelihood or probability metric associated with each of the predicted outcome/treatment pair and generates a ranked list comprising treatment/predicted outcome pairs, ordered according to the predicted outcomes and their probability metrics. The system then generates one or more treatment recommendations based on the ranked list. In an exemplary use case, the system uses the selected trained models to generate, for a particular cancer, predicted outcomes of up to four different treatments and a probability metric associated with each predicted outcome.
[0217] The selected trained models also operate as a recommender engine and produce one or more recommended treatments, in addition to predicting outcome, survival, and disease progression milestones and disease progression timelines. In an exemplary embodiment, the system uses the predicted outcomes to generate one or more treatment recommendations. The system may filter the predicted outcomes to eliminate any predicted poor outcomes, for example patient death or non-response. In one exemplary embodiment, the one or more treatment recommendations may include each treatment that was not associated with a poor outcome. In another exemplary embodiment, the one or more treatment recommendations include recommendations selected based on one or more additional criteria, for example treatments associated with a probability metric have a value greater than a threshold value or are the top treatments on a list of treatments ranked by probability metric values, e.g., the top two treatments or the top three treatments. These outcome predictions and treatment recommendations (combined model 1415 of
[0218] In an exemplary embodiment, when newly collected patient outcome data is added to a patient’s EHR, the system retrieves from the prediction database any predicted outcomes previously generated by the system and compares the newly collected outcome data to the previously predicted outcomes. If the newly collected outcome data does not agree with the previously predicted outcomes, the system may retrain the model(s) used to generate the outcome(s) or use one or more known reinforcement learning methods to update the trained models.
[0219] The system may also generate recommendations for additional testing to be performed. For example, the system may recommend additional testing to generate missing markers of low abundance data that it is unable to generate using imputation methods of the technology, as described herein.
[0220] If the system produced one or more recommended treatments and/or recommended additional testing, a user of the system may take actions based upon the recommendation to convert one or more recommendations into a prescription/testing order for the identified additional tests and/or treatments. Referring back to
[0221] Similarly, by using RX and DX data compared against the outcome predictions, the system may provide an assessment of the current progression of the patient’s disease in relation to the predicted disease progression predictions. Updated predictions may also be made on the basis of this assessment.
5.4 Example 1. Imputation of Missing Genomic Data
[0222] The system collects multi-omic (germline sequence data, mutation data, single cell RNA sequence (scRNAseq) data, microbiome sequence data, metabolomic data, epigenomic, and genetic and genomic test results) single cell data Single cell data is often quite sparse, in some cases because of frequent “dropout events”, meaning that the sequence of a gene that is expressed even at a relatively high level may not be detected because of technical limitations of the assay, such as, for example, the relative inefficiency of reverse transcription. This type of error can lead to significant problems with cell-type identification with the machine learning and other downstream analyses. Completion and improvement of the collected data set provides improved machine learning (and downstream analysis) results.
[0223] Table EX1 illustrates the input and output results of imputing the expression of a particular gene based upon the expression of other genes obtained through single cell RNA sequencing (scRNAseq) per cell.
TABLE-US-00003 INPUT (pre imputation) Cells scRNAseq data CD8 EM CD9 memory CD8 CM CD8 temra PDCD1 High High Low Medium GZMB Medium Low Low High ZEB2 High Low High Low LAG3 (no data) (no data) (no data) (no data) OUTPUT (post imputation) Cells scRNAseq data CD8 EM CD9 memory CD8 CM CD8 temra PDCD1 High High Low Medium GZMB Medium Low Low High ZEB2 High Low High Low LAG3 HIGH LOW LOW MEDIUM
5.5 Conclusions
[0224] It will also be recognized by those skilled in the art that, while the technology has been described above in terms of exemplary embodiments, it is not limited thereto. Various features and aspects of the above described technology may be used individually, jointly or in combination. Further, although the technology has been described in the context of its implementation in a particular environment(s), and for particular application(s), those skilled in the art will recognize that its usefulness is not limited thereto and that the present technology can be beneficially utilized in any number of environments and implementations where it is desirable to provide a highly accurate treatment recommender and condition monitoring system or other implementations. Accordingly, the claims set forth below should be construed in view of the full breadth and spirit of the technology as disclosed herein.