Method of predicting chromatographic elution order of compounds

11651213 · 2023-05-16

Assignee

Inventors

Cpc classification

International classification

Abstract

Disclosed is a method for predicting an elution order of compounds in a mixture. The method includes (a) building a quantitative structure-retention relationship (QSRR) model and (b) predicting a chromatographic elution order of the compounds in the mixture on the basis of the QSRR model using mathematical programming. The mathematical programming is a non-linear programming technique in which a predicted elution order of the compounds is used as a constraint or a multi-objective optimization (MOO) in which a retention time prediction error and an elution order prediction error are used as objective functions. With the use of the method of the present disclosure, it is possible to optimize separation of complex mixtures in reversed-phase chromatography by enabling identification of accurate positions of individual compounds that provides higher certainty in identifying a given compound, e.g., during an “omics” analysis (proteomics, metabolomics, etc.).

Claims

1. A computer-implemented method of predicting a chromatographic elution order of compounds in a mixture, the method comprising: generating, by a computer processor, a quantitative structure-retention relationship (QSRR) model; receiving, by the computer processor, mixture data representing one or more mixtures of a plurality of compounds; computer-modeling, by the computer processor, the QSRR model on the mixture data using artificial neural networks (ANN) to generate a non-linear programming (NLP) model comprising one or more inequality constraints; predicting, by the computer processor, a chromatographic elution order of the compounds in the mixture data using the NLP model, the predicting comprising obtaining a low retention time prediction error and an elution order prediction error, the one or more inequality constraints associated with the elution order prediction error; detecting, by the computer processor, positions of the compounds in the mixture data based on the predicted chromatographic elution order; and identifying, by the computer processor, one or more of the compounds based on the detected positions of the compounds, wherein the one or more inequality constraints comprise a positive relaxation parameter and a molecular descriptor associated with the elution order prediction error.

2. The method according to claim 1, wherein, in the predicting, the chromatographic elution order of the compounds in the mixture is predicted by the computer processor executing instructions comprising the non-linear programming (I) under the inequality constraints (II): min a _ { .Math. j - 1 m ( t R , j - a 1 x j , 1 - a 2 x j , 2 - a 3 x j , 3 ) 2 + .Math. j - 1 m α j } ( I ) a 1 ( x j , 1 - x j + 1 , 1 ) + a 2 ( x j , 2 - x j + 1 , 2 ) + a 3 ( x j , 3 - x j + 1 , 3 ) - α j 0 ( II ) wherein the inequality constraints further comprise an a vector, and wherein: x represents the molecular descriptor, t.sub.R represents a retention time, m represents a number of the mixtures, j represents the positive relaxation parameter, and ā represents the a vector comprising a1, a2, a3, and a.sub.j (j=1, 2, . . . , m−1).

3. The method according to claim 2, wherein the molecular descriptor comprises dipole moment (μ), excess charge of an atom that is most negatively charged (δ.sub.min), solvent-accessible surface area (SASA), sum of retention times of respective 20 naturally occurring amino acids (Sum.sub.AA), Van der Waals volume (vDW.sub.vol.), computerized octanol-water coefficient (c log P), or any combination thereof.

4. A computer-implemented method of predicting a chromatographic elution order of compounds in a mixture, the method comprising: generating, by a computer processor, a quantitative structure-retention relationship (QSRR) model, wherein the QSRR model comprises a molecular descriptor; receiving, by the computer processor, mixture data representing one or more mixtures of a plurality of compounds; computer-modeling, by the computer processor, the QSRR model on the mixture data to generate a linear model; predicting, by the computer processor, a chromatographic elution order of the compounds in the mixture data using the linear model by performing multi-objective optimization (MOO) on the basis of an objective function representing a retention time prediction error represented by Formula (III) and an elution order prediction error represented by Formula (IV): % RMSE ( t R ) = .Math. ( t R - t ^ R t R ) 2 / m × 100 , and ( III ) % RMSE ( order ) = .Math. ( order pred . - order obs . order obs . ) 2 / m × 100 ; ( IV ) detecting, by the computer processor, positions of the compounds in the mixture data based on the predicted chromatographic elution order; and identifying, by the computer processor, one or more of the compounds based on the detected positions of the compounds, wherein: t.sub.R and {circumflex over (t)}.sub.R respectively represent a retention time measured in an analytical experiment and a retention time predicted from the model, m represents a number of the mixtures measured for each column, and order.sub.obs. and order.sub.pred. respectively represent an elution order determined from the analytical experiment and an elution order determined from predicted retention times.

5. The method according to claim 4, wherein in the predicting, the MOO selects a Pareto optimal solution, selecting the Pareto optimal solution comprising: selecting a knee point which is an optimal compromise between the retention time prediction error and the elution order prediction error from a Pareto front including the Pareto solutions; moving to the next Pareto solution to reduce the elution order prediction error; verifying the solution using an applicability domain; and repeating the knee point selection and the moving until an increase in the retention time prediction error reaches a first predetermined threshold or an outlier in the applicability domain exceeds a second predetermined threshold.

6. The method of claim 4, wherein the linear model is represented by the following formula:
t.sub.R,j=a.sub.1x.sub.j,1+a.sub.2x.sub.j,2+ . . . +a.sub.nx.sub.j,n wherein: t.sub.R,j are retention times of respective compounds j sorted in ascending order, x.sub.j,i (i=1, . . . , n) are the molecular descriptors of respective compounds j, and a.sub.i (i=1, . . . , n) are regression coefficients.

7. The method according to claim 6, wherein the molecular descriptor comprises dipole moment (μ), excess charge of an atom that is most negatively charged (δ.sub.min), solvent-accessible surface area (SASA), sum of retention times of respective 20 naturally occurring amino acids (Sum.sub.AA), Van der Waals volume (vDW.sub.vol.), computerized octanol-water coefficient (c log P), or any combination thereof.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a conceptual explanatory diagram illustrating a method of predicting chromatographic elution order of compounds by using a multi-objective optimization (MOO) technique;

(2) FIGS. 2A, 2B and 2C illustrate the results of a multi-objective optimization technique in which a retention time prediction error and an elution order prediction error are used as objective functions and more particularly illustrates the results of comparison between %RMSE values of MLR and MLR-MOO for each column in two examples wherein FIG. 2A is retention time, FIG. 2B is elution order, and FIG. 2C is %RMSE value difference between an MLR model and an MLR-MOO model;

(3) FIGS. 3A, 3B, 3C, 3D, 3E and 3F illustrate the results of a multi-objective optimization technique in which a retention time prediction error and an elution order prediction error are used as objective functions and more particularly illustrate the performance of the present disclosure for a linear QSRR (MLR) wherein FIG. 3A is retention time, FIG. 3B is elution order, and FIG. 3C is an applicability domain in Example 1 (CS1); and FIG. 3D is retention time, FIG. 3E is elution order, and FIG. 3F is applicability domain in Example 2 (CS2); and

(4) FIGS. 4A, 4B and 4C illustrate the results of non-linear programming using predicted elution order as a constraint and, more particuarly, illustrate the results of comparison between %RMSE values of MLR and NLP (nonlinear programming) for each column in two examples in which FIG. 4A is retention time, FIG. 4B is elution order, and FIG. 4C is %RMSE value difference between an MLR model and an MLR-MOO model;

(5) FIGS. 5A, 5B and 5C illustrate the results of non-linear programming using the predicted elution order as a constraint and more particularly shows the performance of the present disclosure (Example 2 (CS2)) for a linear QSRR (MLR) in which FIG. 5A is retention time prediction, FIG. 5B is elution order prediction, and FIG. 5C is applicability domain;

(6) FIGS. 6A and 6B are graphs showing an MLR-MOO Pareto front wherein FIG. 6A is Example 1 (CS1) and FIG. 6B is Example 2 (CS2).

DETAILED DESCRIPTION

(7) In describing embodiments of the present disclosure, well-known functions or constructions will not be described in detail when they may obscure the gist of the present invention.

(8) Embodiments in accordance with the concept of the present invention can undergo various changes to have various forms, and only some specific embodiments are illustrated in the drawings and described in detail in the present disclosure. While specific embodiments of the present disclosure are described herein below, they are only for illustrative purposes and should not be construed to limit the scope of the present disclosure. The present disclosure should be construed to cover not only the specific embodiments but also cover all modifications, equivalents, and substitutions that fall within the concept and technical spirit of the present disclosure.

(9) The terminologies used herein are for the purpose of describing particular embodiments only and is not intended to limit the scope of the present disclosure. As used herein, the singular forms “a”, “an”, and “the” are intended to include the plural forms as well unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “includes”, or “has” when used in the present disclosure specify the presence of stated features, regions, integers, steps, operations, elements and/or components, but do not preclude the presence or addition of one or more other features, regions, integers, steps, operations, elements, components and/or combinations thereof.

(10) The present disclosure describes a method for solving problems with prediction of elution order through QSRR-based mathematical programming in which a retention time error and an elution order error are defined as parts of an objective function. For a linear model, the gist of the present disclosure is a general QSRR model defined by Formula 1.
t.sub.R,j=a.sub.1x.sub.j,1+a.sub.2x.sub.j,2+ . . . +a.sub.nx.sub.j,n  (1)

(11) In Formula 1, t.sub.R, j are retention times of compounds j arranged in ascending order, x.sub.j, i (i=1, . . . , n) are molecular descriptors of compounds j, and a.sub.i (i 1, . . . , n) are regression coefficients, in which t.sub.R, j and x.sub.j, i are set such that their mean is adjusted to 0. The a.sub.i can be obtained through multiple linear regression (MLR). When the QSRR is non-linear, a non-linear modeling technique, such as e.g., artificial neural networks (ANN) may be used.

(12) (1) Constrained Non-Linear Programming (NLP) Using Predicted Elution Order as Constraint

(13) QSRR can be expressed as Formula 2 which is the formula of mathematical programming.

(14) min a .Math. j ( t R , j - t ^ R , j ) 2 ( 2 )

(15) In Formula 2, {circumflex over (t)}.sub.R,j=f(x) the modeled QSRR relation f(x) is a function with a.sub.i as a parameter. As in the embodiment to be described below, for example, when n=3 and when MLR is used as a relation model when=3, Formula 1 can be expressed as Formula 3.

(16) min a .Math. j ( t R , j - t ^ R , j ) 2 = min a .Math. j ( t R , j - a 1 x j , 1 - a 2 x j , 2 - a 3 x j , 3 ) 2 ( 3 )

(17) That is, a typical problem with QSRR is attributed to non-linear programming. By arranging the predicted retention times in ascending order, it is easy to predict the elution order therefrom. However, as described above, although it is possible to predict retention times of a number of peaks with adequate accuracy (i.e., within in a tolerable prediction error range) with the QSRR model obtained from Formula 2, prediction of elution order with the QSRR may often result in low accuracy. This is obvious when considered on the basis of the regression equation.

(18) In terms of mathematical programming, the problems mentioned above seem to be solved by introducing the following inequality constraints:

(19) min a .Math. j ( t R , j - a 1 x j , 1 - a 2 x j , 2 - a 3 x j , 3 ) 2 subject to : t ^ R , j t ^ R , j + 1 or a 1 x j , 1 + a 2 x j , 2 + a 3 x j , 3 a 1 x j + 1 , 1 + a 2 x j + 1 , 2 + a 3 x j + 1 , 3 ( 4 )

(20) For m compounds, the inequality constraints are represented as a vector-matrix notation X.Math.a≤0, where X is a ((m−1)×3) matrix with [(x.sub.j,1−x.sub.j+1,1) (x.sub.j,2−x.sub.j+1,2) (x.sub.j,3−x.sub.j+1,3)] as the j-th row (j=1, 2, . . . , m−1), and a=[a.sub.1 a.sub.2 a.sub.3].sup.T is established.

(21) However, some numerical experiments have shown that the above constraints are so excessive that even simple mixtures have bad results. As a result, a low retention time prediction error and an elution order prediction error are simultaneously obtained using the relaxed inequality constraints shown below.

(22) min a _ { .Math. j = 1 m ( t R , j - a 1 x j , 1 - a 2 x j , 2 - a 3 x j , 3 ) 2 + .Math. j = 1 m α j } subject to : a 1 ( x j , 1 - x j + 1 , 1 ) + a 2 ( x j , 2 - x j + 1 , 2 ) + a 3 ( x j , 3 - x j + 1 , 3 ) - α j 0 ( 5 )

(23) Where α.sub.j is a positive relaxation parameter, ā is a decision vector composed of a1, a2, a3 and α.sub.j (j=1, 2, . . . , m−1). The inequality constraints for m compounds can be expressed as a vector-matrix notation shown below.

(24) [ x 1 , 1 - x 2 , 1 x 1 , 2 - x 2 , 2 x 1 , 3 - x 2 , 3 - 1 0 0 .Math. .Math. 0 x 2 , 1 - x 3 , 1 x 2 , 2 - x 3 , 2 x 2 , 3 - x 3 , 3 0 - 1 0 .Math. .Math. 0 .Math. .Math. .Math. 0 0 - 1 .Math. .Math. 0 x , 1 - x j + 1 , 1 x j , 2 - x j + 1 , 2 x j , 3 - x j + 1 , 3 .Math. .Math. .Math. .Math. .Math. .Math. .Math. .Math. .Math. .Math. .Math. x m - 1 , 1 - x m , 1 x m - 1 , 2 - x m , 2 x m - 1 , 3 - x m , 3 0 0 0 .Math. 0 - 1 ] [ a 1 a 2 a 3 α 1 α 2 .Math. α j .Math. α m - 1 ] 0 ( 6 )

(25) (2) Multi-Objective Optimization Using Retention Time Prediction Error and Elution Order Prediction Error as Objective Function (MOO)

(26) The problem with multi-objective optimization (MOO) is attributed to optimization with multiple objective functions. The general formula thereof is Formula 7.
min(g.sub.1(α.sub.1),g.sub.2(α.sub.2), . . . ,g.sub.k(α.sub.k))  (7) subject to: α.sub.i∈A

(27) In Formula 7, an integer k (≥2) represents the number of objective functions g and a set A is a possible set of decision vectors α. In a multi-objective optimization, normally there are no solutions that minimize all objective functions. Therefore, attention is paid to the Pareto optimal solution, which is a solution that cannot improve objective functions without degrading at least one of the objective functions.

(28) In the present disclosure, two objective functions are used, one representing the error of the retention time prediction and the other representing the error of the elution order prediction. The Pareto optimal solution is then selected according to the following procedure: (1) selecting the knee point which is the best compromise between the retention time prediction error and the elution order prediction error from the Pareto front consisting of the Pareto solutions; (2) moving to the next Pareto solution to reduce the elution order prediction error; (3) validating the solution using the applicability domain; and (4) repeating (2) and (3) until an increase in the retention time prediction error reaches a first predetermined threshold or until an outlier in the applicability domain exceeds a second predetermined threshold. This is conceptually illustrated in FIG. 1.

(29) Hereinafter, the present disclosure will be described in more detail with reference to Examples.

(30) The examples presented herein are merely illustrative of the present disclosure and are not intended to limit the scope of the present disclosure.

Example

(31) The following two examples demonstrate the applicability of the present disclosure: (i) CS1 which is a mixture of 62 organic compounds and (ii) CS2 which is a mixture of 98 synthetic peptides. Analysis for the first example CS1 was performed using a Supelcosil LC column with a gradient time of 10 minutes at 35° C. Analysis for the second example CS2 was performed using seven chromatographic columns (i.e., Xterra, Licrospher, PRP, Discovery RP-Amide C-16, Licrospher CN, Discovery HS F5-3 and Chromolith) at different gradient settings and temperatures. Chromatographic analysis data were obtained from the references.

(32) The molecular descriptors used in each example for QSRR relation modeling are listed in Table 1 below.

(33) TABLE-US-00001 TABLE 1 Molecular descriptors used in Examples (CS1 and CS2) Molecular descriptors Explanation CS1 μ dipole moment δ.sub.min excess charge of the most negatively charged atom SASA solvent-accessible surface area CS2 Sum.sub.AA sum of retention times of respective 20 naturally occurring amino acids νDW.sub.vol. Van der Waals volume clogP computerized octanol-water coefficient

(34) In both examples, a linear model was considered as a specific form of the QSRR relation model, and control MLR model coefficients were calculated using a least-square method for comparison. The solution of a non-linear programming problem with relaxed constraints in Formula 5 was obtained using the interior-point method. The solution of a multi-objective optimization problem of Formula 7 was obtained using a genetic algorithm. In both methods, the coefficients of a control MLR model obtained for comparison were used as initial values in the optimization.

(35) For the multi-objective optimization, the percentage root mean square error (%RMSE) of the retention time was used as an objective function representing a retention time prediction error.

(36) % RMSE ( t R ) = .Math. ( t R - t ^ R t R ) 2 / m × 100 ( 8 )

(37) Where t.sub.R and {circumflex over (t)}.sub.R respectively represent the retention time measured through the analytical experiment and the retention time predicted by the model and m represents the number of mixtures measured for each column. The elution order prediction can be performed after sorting the retention times predicted by the QSRR model in ascending order, and %RMSE was used as the objective function representing the accuracy of the elution order.

(38) 0 % RMSE ( order ) = .Math. ( order pred . - order obs . order obs . ) 2 / m × 100 ( 9 )

(39) Where order.sub.obs. and order.sub.pred. respectively represent the elution order determined through the analysis and the elution order determined from the predicted retention time.

(40) When both of the methods NLP and MOO used a linear QSRR model (MLR), the accuracy of the elution order prediction was significantly increased (see FIGS. 2B and 4B) at the expense of the accuracy of the retention time prediction in both examples (see FIGS. 2A and 4A). As illustrated in FIGS. 2C and 4C, the maximum increases in %RMSE (tR) were about 15% and 20%, respectively while the maximum decreases in %RMSE (order) were about 80% and 260%, respectively.

(41) Of the seven RP-LC columns used for both methods in both examples, FIGS. 3 and 5 illustrate the prediction results for two columns in both examples (CS1: Supelcosil LC, tG=10 min, T=35° C.; CS2: Xterra, tG=20 min, T=40° C.). That is, the prediction performance for each of the retention time and the elution order and the corresponding applicability domain were shown. The two examples CS1 and CS2 (CS1 (FIGS. 3A and 3B); CS2 (FIGS. 3D, 3E, 5A, and 5B)) show reasonable retention time prediction performance and elution order prediction performance. Nearly all analyte compounds included in both examples were well predicted and structurally important analytes included in a training set were within each applicability domain (FIGS. 3C, 3F, and 5C). The developed model is therefore considered to be stable and robust for the structurally distant analytes.

(42) The optimal solutions obtained according to the procedure of finding the Pareto optimal solution while starting from the knee point are shown in FIG. 6A (CS1) and FIG. 6B (CS2). Predetermined thresholds were 10% (for %RMSE (tR)) at the knee point and 2 outliers in the applicability domain, respectively. In addition, as shown in Table 2 below, even in the case of the multi-objective optimization method, it was possible to attain a large decrease in elution order prediction error with a small increase in retention time prediction error in both of the examples.

(43) TABLE-US-00002 TABLE 2 % RMSE at knee point and optimal point % RMSE(t.sub.R) % RMSE(order) CS1 Knee point 8.67 43.7 Optimal point 9.33 42.0 CS2 Knee point 11.6 19.8 Optimal point 12.1 18.1

(44) While exemplary embodiments of the present disclosure have been described with reference to the accompanying drawings, those skilled in the art will appreciate that the present disclosure can be implemented in other different forms without departing from the technical spirit or essential characteristics of the exemplary embodiments. Therefore, it is noted that the exemplary embodiments described above are only for illustrative purposes and are not restrictive in all aspects.