CROSS PLATFORM TRANSFORMATION OF GENE EXPRESSION DATA
20170249422 · 2017-08-31
Inventors
- YEE HIM CHEUNG (NEW YORK, NY, US)
- Wilhelmus Franciscus Johannes Verhaegh (Eindhoven, NL)
- Nevenka Dimitrova (Pelham Manor, NY, US)
Cpc classification
G16B25/10
PHYSICS
G16B50/00
PHYSICS
G16B40/10
PHYSICS
C12N15/1089
CHEMISTRY; METALLURGY
International classification
G06N99/00
PHYSICS
C12N15/10
CHEMISTRY; METALLURGY
Abstract
Data-driven generalized regression-based frameworks that support the transformation of measurements, applicable but not limited to gene expressions, from one platform to another over a wide dynamic range, with selected summary statistics/feature values as predictors for the model parameters. The framework consists of primary model training and transformation, and additional levels of categorical regression and transformation processes.
Claims
1. A method for transforming gene expression data, the method comprising: constructing a primary model utilizing sample expression data for transforming gene expression data from a first profiling platform to a second profiling platform.
2. The method of claim 1, wherein constructing the primary model comprises: identifying at least one common expression between a first set of nucleic acid expression data derived using a first profiling platform and a second set of nucleic acid expression data derived using a second profiling platform, each common expression associated with a sample present in both the first set and second set; performing regression analysis on the at least one common expression, resulting in one set of regression parameters for each sample; selecting at least one candidate feature from the first profiling platform that predicts the at least one set of regression parameters; and identifying a primary model for sample-wise data transformation associated with each of the at least one selected candidate features.
3. The method of claim 2, further comprising generating at least one set of expression data using a profiling platform, the at least one set of expression data being at least one of the first and second sets of expression data.
4. The method of claim 1, further comprising: transforming the sample expression data using the constructed primary model; and constructing a categorical model by regression analysis from at least one of: (a) at least some of the transformed sample expression data and (b) at least some of the common expressions.
5. The method of claim 4, wherein at least one of the: (a) selection of at least some of the transformed sample expression data and (b) selection of at least some of the common expressions, is based on phenotypic data or any factor known to introduce cross-platform bias.
6. The method of claim 4, further comprising iterating claim 4 using the categorical model constructed from the transformed sample expression data to transform the transformed sample expression data and constructing another categorical model therefrom.
7. The method of claim 6, further comprising transforming a set of expression data from the first profiling platform to the second profiling platform by applying the constructed categorical models in the order of their construction.
8. The method of claim 1, wherein the first profiling platform or the second profiling platform is selected from the group consisting of Agilent Gene Expression Microarrays, Affymetrix Gene Profiling Array cGMP U133 P2/Human Genome U133 Plus 2.0/U133A 2.0, Illumina Genome Analyzer/MiSeq/NextSeq/HiSeq, NanoString nCounter SPRINT/MAX/FLEX, and Oxford Nanopore MinION/PromethION/GridION.
9. The method of claim 2, wherein the at least one common expression is identified by at least one of matching genomic positions, matching exons, matching isoforms, and matching transcripts.
10. The method of claim 2, wherein the at least one candidate feature is selected from the group consisting of mean transcript expression, mean normalized probe intensity, number of detected genes, number of reads per sample, average number of reads per exon/gene/isoform, read coverage, and a sample statistic.
11. The method of claim 6, wherein each of the models is selected from the group consisting of a linear model, a logarithmic model, a piecewise linear model, and a regression model.
12. An apparatus for transforming gene expression data, the apparatus comprising: a processor; an interface; and computer executable instructions operative on said processor for: constructing a primary model utilizing sample expression data for transforming gene expression data from a first profiling platform to a second profiling platform such that the overall distribution of the transformed data resembles that of the second platform.
13. The apparatus of claim 12, wherein the computer executable instructions for constructing the primary model comprise computing executable instructions for: identifying at least one common expression between a first set of nucleic acid expression data derived using a first profiling platform and a second set of nucleic acid expression data derived using a second profiling platform, each common expression associated with a sample present in both the first set and second set; performing regression analysis on the at least one common expression, resulting in one set of regression parameters for each sample; selecting at least one candidate feature from the first profiling platform that predicts the at least one set of regression parameters; and identifying a primary model associated with each of the at least one selected candidate features.
14. The apparatus of claim 13, wherein the interface is configured to receive at least one set of expression data from a profiling platform, the at least one set of expression data being at least one of the first and second sets of expression data.
15. The apparatus of claim 12, further comprising computer executable instructions operative on said processor for: transforming the sample expression data using the constructed primary model; and constructing a categorical model by regression analysis from at least one of: (a) at least some of the transformed sample expression data and (b) at least some of the common expressions.
16. The apparatus of claim 15, wherein at least one of the: (a) selection of at least some of the transformed sample expression data and (b) selection of at least some of the common expressions, is based on phenotypic data or any factor known to introduce cross-platform bias.
17. The apparatus of claim 15, further comprising computer executable instructions operative on said processor for iterating claim 15 using the categorical model constructed from the transformed sample expression data to transform the transformed sample expression data and construct another categorical model therefrom.
18. The apparatus of claim 17, further comprising computer executable instructions operative on said processor for transforming a set of expression data from the first profiling platform to the second profiling platform by applying the constructed categorical models in the order of their construction.
19. The apparatus of claim 12, wherein the first profiling platform or the second profiling platform is selected from the group consisting of Agilent Gene Expression Microarrays, Affymetrix Gene Profiling Array cGMP U133 P2/Human Genome U133 Plus 2.0/U133A 2.0, Illumina Genome Analyzer/MiSeq/NextSeq/HiSeq, NanoString nCounter SPRINT/MAX/FLEX, and Oxford Nanopore MinION/PromethION/GridION.
20. The apparatus of claim 13, wherein the computer executable instructions for identifying at least one common expression comprise computer executable instructions for identifying at least one common expression by at least one of matching genomic positions, matching exons, matching isoforms, and matching transcripts.
21. The apparatus of claim 13, wherein the at least one candidate feature is selected from the group consisting of mean transcript expression, mean normalized probe intensity, number of detected genes, number of reads per sample, average number of reads per exon/gene/isoform, read coverage, and a sample statistic.
22. The apparatus of claim 17, wherein each of the models is selected from the group consisting of a logarithmic model, a linear model, a piecewise linear model, and a regression model.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0020] The accompanying drawings are not intended to be drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures may be represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing. Various embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
DETAILED DESCRIPTION
[0032] Cross-platform compatibility of gene expression data is a crucial and active topic of research. It can be inefficient to manage and analyze sample data originated from a mixture of platforms. For instance, the Cancer Genome Atlas (TCGA) currently has five different platforms for RNA expression: Agilent G4502A, Affymetrix HT-HG_U133A, HG-U133_Plus_2, Illumina GA and Illumina HiSeq 2000, thus making it difficult to leverage the full potential of the data through a combined analysis. Furthermore, the dynamic ranges of gene expressions can vary considerably depending on the choice of profiling platform.
[0033] With the huge amount of legacy data generated throughout the years based on former technologies, the diversity of existing platforms, and the emergence of new ones, it can be advantageous to provide compatibility of data across various platforms. Breaking platform barriers means saving the cost of re-profiling of samples in order to perform combined analysis. It can also solve the backward compatibility problem and facilitate the adoption of new profiling technologies by allowing legacy data to be readily transformed for use with data from newer platforms. Specifically, tremendous resources have been spent on microarray studies, and it is desirable to transfer the knowledge and insights from these studies onto new platforms such as next-generation sequencing (NGS) technologies.
[0034] Embodiments of the present invention facilitate cross-platform compatibility of gene expression data using models that transform expression data from one platform to another. These embodiments can also, in the clinical research setting, be applied across different cohorts available to a clinical researcher in order to evaluate many signatures on a new cohort, either by transforming the input data to the primary platform or by adapting the parameters of a signature to the alternative platforms.
[0035] With reference to
[0036] However, in some embodiments, the model construction process may include additional levels of iteration. In these embodiments, an additional model, e.g., a categorical regression model, may be constructed from transformed expression data (Step 108). This additional model may be used, in turn, to transform additional expression data (Step 104). This process may be iterated through additional rounds of categorical model construction (Step 108) and the application of those categorical models to transform expression data (Step 104), which may then be used to construct additional categorical models (Step 108) and so on. When multiple categorical models are constructed and subsequently used to transform expression data, the models are applied in the order of their construction—i.e., the first model constructed is the first model used to transform data, the second model constructed is used to transform the data transformed by the first model, and so on.
Construction of Primary Model
[0037] As discussed above, embodiments of the present invention typically construct a primary model for transforming data sets between platforms (Step 100). With reference to
[0038] If there is no direct mapping between the two sets of targets, the targets could be mapped from set to another by their genomic positions. For example, if the source data are RNA-Seq exon expressions and the destination data are microarray gene expressions, then exons that overlap with the microarray probe-sets can be identified and summarized into gene expressions before applying regression.
[0039] With {S.sub.i}.sub.i=1 . . . N referring to the N training samples available to construct the model for transforming gene expression data from Platform X to Platform Y, for each sample S.sub.i, regression is performed between x.sub.i and y.sub.i using expressions that are detected on both platforms (Step 204).
[0040] The target model for the regression process is assumed a priori to be defined by M parameters. Depending on the observed relationship between the training data from the source and destination platforms, any regression model can be chosen that results in the least error, such as non-linear, logarithmic, LOESS (local regression) or errors-in-variables (EiV) models. In addition, an optimization function can be applied to choose a model with the least error. This choice may be the decision of a human operator, or it may be the outcome of an automated or a semi-automated process. With an appropriate model selected, the output of the regression process is N sets of parameters r.sub.i={r.sub.k}.sub.i, k=1 . . . M.
[0041] Given the regression parameters r.sub.i for each sample S.sub.i, candidate features f are selected from the data generated by Platform X that can be good predictors for the regression parameters (Step 208). For example, if Platform X is a microarray platform, the candidates f may include mean expression, mean normalized probe intensity, etc. If Platform X is an RNA-Seq platform, the candidates f may include mean expression, number of detected genes, total number of reads, read coverage, etc. As with the choice of regression model, the identification of candidate features f may be performed by a human operator or by an automated or semi-automated process.
[0042] It is not necessary for the predictive features to be extracted only from the source data. Sometimes features from the target data may have good performance in predicting the regression parameters. In some embodiments, such target platform features may also be included in the model and, e.g., assigned with the mean value of the feature in the training data for the transformation process.
[0043] Having identified possible candidate features f from Platform X (Step 208), those features f.sub.k that actually predict the regression parameters r.sub.i, need to be identified from the set of possible candidate features f (Step 212). In one embodiment, the predictive features may be identified by means of, e.g., stepwise regression or other automated, manual, or semi-automated methods. If the goal is to select a single predictive feature for each parameter instead of a subset, the feature with the highest correlation with the parameter can be selected.
[0044] The output of the model construction process consists of the identified predictive features f.sub.k and their corresponding models y.sub.k for the prediction of the regression model parameters r.sub.i for each sample S.sub.i (Step 216).
[0045] In some embodiments, the expression data for a particular platform (e.g., x.sub.i for Platform X, y.sub.i for Platform Y, etc.) with appropriate normalization is generated for the training samples {S.sub.i} (not shown) prior to the identification of the common expressions (Step 200).
Primary Model Transformation
[0046] Once the primary model has been created, it may be used to transform subsequent samples from Platform X to Platform Y. Assume for the following discussion that there exists data generated on Platform X for a new sample P.sub.n. That data includes the expression profile z.sub.n and sets of predictive feature values {v.sub.k}.sub.n that correspond to the predictive features {f.sub.k}.sub.k=1, . . . , M discussed above in connection with
[0047] With reference to
[0048] The predicted regression model parameters r.sub.n can be applied to a pre-defined regression model (Step 304), enabling the estimation of the expression profile as {circumflex over (z)}.sub.n.sup.(0)={{circumflex over (z)}.sub.g.sub.
Categorical Model Construction and Transformation
[0049] In some embodiments, the primary model may suffice to transform expression data between profiling platforms. As discussed above, in other embodiments additional levels of categorical modeling and transformation may be employed to convert data between platforms.
[0050] In particular, if there are one or more factors that introduce additional cross-platform discrepancies, then additional levels of regression related to the factors can be performed on the data, with the transformed data from the previous level of regression serving as the input to the next level of regression.
[0051] For example, assume the existence of one factor of well-defined categories c.sub.l={c.sub.m}.sub.l, m−1, . . . , O.sub.
[0052] With reference to
[0053] With reference to
[0054] This process of categorical modeling and transformation of the training data shown in
Exemplary Embodiments
[0055] According to one embodiment, there is provided a system and method for the transformation of gene expression data (in log.sub.2 scale) from Affymetrix GeneChip HT Human Genome U133 Array Plat Set (RMA) to Illumina HiSeq 1000 RNA-Seq (RSEM) using 545 TCGA samples that have data generated on each of the respective platforms. Some sample-wise statistics are summarized for the two platforms in Table 1. The mean correlation per sample is 0.713 and higher expressions show stronger correlation in general.
TABLE-US-00001 TABLE 1 Summary statistics of TCGA samples with expression data generated on both Affymetrix microarray and Illumina RNA-Seq platforms Affymetrix Microarray Illumina RNA-Seq Across (log2 RMA) (log2 RSEM) Samples Mean Variance Mean Variance Correlation Min. 5.143 2.939 7.307 7.211 0.425 Max. 6.687 4.862 8.232 9.699 0.782 Mean 5.804 4.116 7.795 8.363 0.713 Variance 0.201 0.139 0.017 0.178 0.001
[0056] By generating scatterplots of RNA-Seq vs. microarray expressions for each sample, it can be seen that their relationship can be suitably approximated by a piecewise linear model. In an exemplary implementation using the R programming language, the ‘1m’ function for linear regression and the ‘segmented’ function of the ‘segmented’ package are applied for breakpoint (x.sub.b) estimation. This resulted in four regression parameters {m.sub.1, c.sub.1, m.sub.2, c.sub.2} for the linear models before and after the estimated breakpoint: y.sub.1=m.sub.1x.sub.1+c.sub.1 for x≦x.sub.b and y.sub.2=m.sub.2x.sub.2+c.sub.2 for x>x.sub.b. Summary statistics of the regressed piecewise linear models are summarized in Table 2 below.
TABLE-US-00002 TABLE 2 Summary statistics of the regressed piecewise linear models. Before Breakpoint After Breakpoint (x ≦ x.sub.b) (x > x.sub.b) Across Slope Intercept Slope Intercept Breakpoint Samples 1 (m.sub.1) 1 (c.sub.1) 2 (m.sub.2) 2 (c.sub.2) (x.sub.b) Min. 1.704 −14.570 0.338 3.378 4.221 Max. 4.892 −0.866 0.876 7.044 5.747 Mean 3.849 −9.810 0.641 5.421 4.778 Variance 0.154 2.264 0.004 0.344 0.134
[0057] Next, a small set of candidate features for predicting the four regression model parameters are generated, and it can be determined that mean expression level was a plausible single linear predictor, with moderately strong correlations of R=−0.55 with m.sub.1 and R=0.74 with c.sub.2, despite lesser correlations of R=−0.27 with c.sub.1 and R=0.19 with m.sub.2, which has a small variance of 0.04. The linear models between mean expression level and the four regression parameters are shown in
[0058] Using mean expression level as a predictor, the piecewise-linear model for each sample can be predicted.
[0059] As illustrated, for moderate-to-high microarray expressions, the predicted RNA-Seq expressions have a root-mean-square error e.sub.rms=1.4, which is very close to that of 1.39 based on the estimated values obtained by direct regression. To further improve the accuracy, an additional level of regression and transformation can be applied on the primarily transformed values stratified by genes across all samples using the categorical approach as described above.
[0060] Referring to
[0061] Processor 1008 is configured as discussed above to build primary and categorical models for transforming gene expression data from a first profiling platform to a second profiling platform such that the overall distribution of the transformed data resembles that of the second platform.
Applications
[0062] Embodiments of the present invention may be extended to compute unified expressions from data measured by multiple platforms. For instance, all data may be transformed to one specific platform using, e.g., an EiV regression model, and then for each target combine the transformed values with weighted averaging using weights that are inversely proportional to the estimated noise variances of the respective source platforms.
[0063] While the above embodiments of present invention are described with respect to measurements performed on genomics platforms, the same process and procedures can be applied to physiology modelling, imaging, personal continuous health data and others.
[0064] Although above embodiments of the present invention are described with respect to gene expression data, the process and procedures described herein are applicable to solving the compatibility problem across different platforms or analytical pipelines of any numerical readings. For example, methylation levels, protein expressions or even sensor measurements, with structural discrepancies due to the inherent differences of the underlying systems.
Equivalents, Definitions, etc.
[0065] While several embodiments of the present invention have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the functions and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the present invention. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the teachings of the present invention is/are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, the invention may be practiced otherwise than as specifically described and claimed. The present invention is directed to each individual feature, system, article, material, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, and/or methods, if such features, systems, articles, materials, and/or methods are not mutually inconsistent, is included within the scope of the present invention.
[0066] The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
[0067] The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, e.g., elements that are conjunctively present in some cases and disjunctively present in other cases. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified unless clearly indicated to the contrary. Thus, as a non-limiting example, a reference to “A and/or B,” when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A without B (optionally including elements other than B); in another embodiment, to B without A (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
[0068] As used herein in the specification and in the claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, e.g., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of” or “exactly one of,” or, when used in the claims, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (e.g. “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law.
[0069] As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
[0070] In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” and the like are to be understood to be open-ended, e.g., to mean including but not limited to.
[0071] Only the transitional phrases “consisting of” and “consisting essentially of” shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03.
[0072] Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
[0073] It should also be understood that, unless clearly indicated to the contrary, in any methods claimed herein that include more than one step or act, the order of the steps or acts of the method is not necessarily limited to the order in which the steps or acts of the method are recited.