METHOD FOR DETERMINING PROCESS VARIABLES IN CELL CULTIVATION PROCESSES

Abstract

High throughput cultivation systems are used in pharmaceutical research and development. In this connection, samples are taken and analyzed for important parameters using external analysis. The results of the analysis serve to assess the cultivation process and provide important information about the process. Especially with cultivations carried out in parallel, the manual effort of sample preparation is great and can lead to errors. In order to avoid the need for sampling and thus to minimize the errors, a method is described in the present patent application which makes desired target parameters accessible in the form of soft sensors by means of previously recorded process variables. Herein is described a method for determining process-relevant parameters in CHO processes (Chinese hamster ovary) in high-throughput cultivations, in particular glucose, lactate and the live cell density or the live cell volume.

Claims

1. A method for adjusting the glucose concentration to a target value during the mammalian cell cultivation, comprising the following steps a) determine the current values at least for the process variables ‘Time’, ‘CHT.PV’, ‘ACOT.PV’, ‘FED2T.PV’, ‘GEW.PV’, ‘CO2T.PV’, ‘ACO.PV’, ‘AO.PV’, ‘N2.PV’, ‘CO2.PV’, ‘FED3T.PV’, ‘OUR’ and ‘PH.PV’ in the cultivation, b) determine the current glucose concentration in the cultivation medium using the measured values from a) by means of a data-driven model for the mammalian cell cultivation, which was generated using a feature matrix comprising the process variables ‘Time’, ‘CHT.PV’, ‘ACOT.PV’, ‘FED2T.PV’, ‘GEW.PV’, ‘CO2T.PV’, ‘ACO.PV’, ‘AO.PV’, ‘N2.PV’, ‘CO2.PV’, ‘FED3T.PV’, ‘OUR’ and ‘PH.PV’, and c) adding glucose until the target value is reached if the current glucose concentration from b) is lower than the target value, and thus adjusting the glucose concentration to a target value.

2. The method according to claim 1, characterized in that the process variable(s) is/are selected from the group comprising the process variables viable cell density, viable cell volume, glucose concentration in the cultivation medium, and lactate concentration in the cultivation medium.

3. The method according to one of claims 1 through 2, characterized in that the method is carried out without sampling and exclusively using on-line measured values from this cultivation.

4. The method according to one of claims 1 through 3, characterized in that the data-driven model is generated by means of machine learning.

5. The method according to one of claims 1 through 4, characterized in that the data-driven model is generated with the random forest method.

6. The method according to one of claims 1 through 5, characterized in that the data-driven model is generated with a training dataset, which comprises at least 10 cultivation runs.

7. The method according to any one of claims 1 through 6, characterized in that a) the datasets available for modeling are randomly divided into training and test datasets in a ratio between 70:30 and 80:20, b) the model is formed, c) the mean value and the standard deviation for the determination of the process variable for the datasets from the training dataset is determined and the mean value and the standard deviation for the determination of the process for the datasets is determined from the test dataset, d) steps a) to c) are repeated until comparable mean values and standard deviations with regard to the division between test and training dataset are achieved, wherein the division obtained under a) is different with each new run.

8. The method according to one of claims 1 through 7, characterized in that the datasets, which are used to generate the data-driven model, each contain the same number of data points.

9. The method according to one of claims 1 through 8, characterized in that the data points in the datasets, which are used to generate the data-driven model, are each for the same times of the cultivation.

10. The method according to one of claims 1 through 9, characterized in that missing data points in the datasets are supplemented by interpolation.

11. The method according to claim 10, characterized in that missing data points for the glucose concentration and/or viable cell volume are obtained by a third degree polynomial fit, that missing data points for the lactate concentration are obtained by univariate spline fit, and/or that missing data points for the viable cell density can be obtained through a Peleg fit.

12. The method according to one of claims 1 through 11, characterized in that the datasets contain a data point at least every 144 minutes.

13. The method according to any one of claims 1 through 12, characterized in that the mammalian cell is a CHO-K1 cell.

14. The method according to any one of claims 1 through 13, characterized in that the mammalian cell expresses and secretes an antibody.

15. The method according to any one of claims 1 through 14, characterized in that the data-driven model is generated with a training dataset that contains complex IgG cultivation runs and standard IgG cultivation runs.

16. The method according to any one of claims 1 through 15, characterized in that the cultivation volume is 300 mL or less.

Description

DESCRIPTION OF THE FIGURES

[0252] FIG. 1: Linear interpolated measured value using the example of ACO.PV. Interpolation range from day 0.5 to day 13.5.

[0253] FIG. 2: Interpolated measurement curve of the live cell density of an exemplary cultivation. Interpolation and coefficient of determination: Peleg fit (R2=0.957), univariate spline (R2=0.998) and third degree polyfit (R2=0.864).

[0254] FIG. 3: Exemplary correlation analysis of the dataset of an ambr250 run from Project 2. Comparison of the correlation coefficients on the different interpolation strategies. The diagram shows the scatter plots of the individual on-line parameters for the VCD.

[0255] FIG. 4: Information content calculated according to the mutual information for the target variable VCD on the entire dataset.

[0256] FIG. 5: Estimation of the random forest VCD for two separate runs. An estimate with an R2 of 0.20317 could be achieved in the upper portion of the figure. In the lower portion of the figure, an estimate of R2 of 0.54896 could be achieved.

[0257] FIG. 6: Histogram of the prediction of the newly created test dataset of the models MLPRegressor (a), random forest (b) and XGBoost (c) for the target variable ‘VCD’. The error for the fitted VCD values for the predicted values is shown on the X-axis. The Y-axis indicates the relative frequency of the errors.

[0258] FIG. 7: Estimation of the VCD of the random forest for two exemplary runs of the test dataset. In the upper portion of the figure, an estimate of R2 of 0.98944 was achieved. An estimate of R2 of 0.99837 could be achieved in the lower portion of the figure.

[0259] FIG. 8: Information content calculated for the entire dataset according to mutual information for the target variable glucose.

[0260] FIG. 9: Estimation of the glucose from the random forest for two exemplary runs of the test dataset. In the upper portion of the figure, an estimate of R2 of 0.99 could be achieved. In the lower portion of the figure, an estimate of R2 of 0.97 could be achieved.

[0261] FIG. 10: Information content calculated for the entire dataset according to mutual information for the target variable lactate.

[0262] FIG. 11: Histograms of the prediction for the test dataset of the MLPRegressor (a), random forest (b) and XGBoost (c) for the target variable lactate. The error for the lactate values added to the predicted values is shown on the X-axis. The Y-axis indicates the relative frequency of the errors.

[0263] FIG. 12: Estimation of lactate by the XGBoost for two exemplary runs of the test dataset. In the upper portion of the figure, an estimate of R2 of 0.99 could be achieved. In the lower portion of the figure, an estimate of R2 of 0.98 could be achieved.

[0264] FIG. 13: Calculated RMSE for MLPRegressor, random forest and XGBoost with a different number of training datasets.

[0265] FIG. 14: Estimation of the random forest VCD for a single cultivation. The Peleg fit of the VCD is shown in blue, the estimated values for the VCD in orange.

[0266] FIG. 15: Representations of the average diameter for each sampling over the entire cultivation period. Projects 1 and 3 have a complex molecular format (shown here in blue, left) as a product. Projects 2 and 4 have a Y-shaped Ig-G format (shown here in green, right) as the target product. Box plots contain the mean; the units were shown standardized.

[0267] FIG. 16: Left portion of the figure: Estimation of the random forest on the VCD. In red, the estimated values for the test dataset against the true values. In blue, the estimated values for the training dataset against the true values. An ideal estimate for the test and training datasets is shown in black. Right portion of the figure: Estimation of the random forest on the VCV. In red, the estimated values for the test dataset against the true values. In blue, the estimated values for the training dataset against the true values. An ideal estimate for the test and training datasets is shown in black.

[0268] FIG. 17: Representation of the average diameter for each sample over the entire cultivation period for each project. Project 1=purple, Project 2=red, Project 3=green and Project 4=blue. Box plots contain the mean.

[0269] FIG. 18: Compare VCD/VCV using the random forest model (best model).

[0270] FIG. 19: Behavior of the RMSE considering all models (MLPRegressor, random forest, XGBoost) with the training dataset depending on the target parameter VCV.

[0271] FIG. 20: Bar chart of the difference of the RMSE for the test and training datasets, the best models for the target variable VCV.

REFERENCES

[0272] [1] J. Glassey, et al., Biotechnol. J. 6 (2011) 369-377. [0273] [2] F. Garcia-Ochoa, et al., Biochem. Eng. J. 49 (2010) 289-307. [0274] [3] E. Trummer, et al., Biotechnol. Bioeng. 94 (2006) 1033-1044. [0275] [4] Y.-M. Huang, et al., Biotechnol. Prog. 26 (2010) 1400-1410. [0276] [5] B C Mulukutla, et al., Met. Eng. 14 (2012) 138-149. [0277] [6] R. Luttmann, et al., Biotechnol. J. 7 (2012) 1040-1048. [0278] [7] T. Becker and D. Krause, Chem. Ing. Tech. 82 (2010) 429-440. [0279] [8] L Z Chen, et al., Bioproc. Biosys. Eng. 26 (2004) 191-195. [0280] [9] P. Kroll, et al., Biotechnol. Lett. 39 (2017) 1667-1673. [0281] [10] S. Raschka and V. Mirjalili, Machine Learning with Python and Scikit-Learn and Tensor-Flow: The comprehensive practice manual for data science, deep learning and predictive analytics. Frechen: mitp, 2., updated and expanded edition, 2018. [0282] [11] Hsu, Chih-Wie, et al., “A practical guide to support vector classification,” Taipei, pp. 1-16, 2003. [0283] [12] R. Kohavi et al., Ijcai, 14 (1995) 1137-1145. [0284] [13] W S McCulloch and W. Pitts, Bull. Math. Biophys. 5 (1943) 115-133. [0285] [14] F. Rosenblatt, Psychol. Rev. 65 (1958) 386-408. [0286] [15] Kriesel David, ed., A brief overview of neural networks. [0287] [16] W. Lu, “Neural network models for distortional buckling behavior of cold-formed steel compression members,” 2000. [0288] [17] R. Rojas, Neural Networks: A Systematic Introduction. Berlin and Heidelberg: Springer, 1996. [0289] [18] L. Breiman, “Random forests,” Machine Learn. 45 (2001) 5-32. [0290] [19] RO Duda, et al., Pattern Classification. sl: Wiley Interscience, 2. Ed., 2012. [0291] [20] L. Breiman, Machine Learn. 24 (1996) 123-140. [0292] [21] Y. Freund and R E Schapire, J. Comp. Syst. Sci. 55: 119-139 (1997). [0293] [22] T. Chen and C. Guestrin, “Xgboost,” in the 22. ACM SIGKDD International Conference (B. Krishnapuram, M. Shah, A. Smola, C. Aggarwal, D. Shen, and R. Rastogi, eds.), pp. 785-794. [0294] [23] L. Fahrmeir, et al., Statistics: The path to data analysis. Springer textbook, Berlin, Heidelberg and s.1.: Springer Berlin Heidelberg, fourth, improved edition ed., 2003. [0295] [24] Kozachenko, L F, et al., Prob. Peredachi Informat. 23 (1987) 9-16. [0296] [25] A. Kraskov, et al., Phys. Rev. E 69 (2004) Pt 2, p. 066138. [0297] [26] BC Ross, PLoS one 9 (2014) e87357. [0298] [27] M. Peleg, J. Sci. Food Agric. 71/2 (1996) 225-230.

LIST OF ABBREVIATIONS

[0299] ambr automated microscale bioreactor [0300] ATP Adenosine triphosphate [0301] Bagging Bootstrap aggregating [0302] CHO Chinese Hamster Ovary [0303] CIP Cleaning in Place [0304] FDA Food and Drug Administration [0305] GMP Good Manufacturing Practice [0306] ANN Artificial Neural Network [0307] MLP Multi Level Perceptron [0308] NADPH Nicotinamide adenine dinucleotide phosphate [0309] OUR Oxygen Uptake Rate [0310] OTR Oxygen Transfer Rate [0311] PAT Process Analytical Technology [0312] RF Random Forest [0313] SIP Sterilization in Place [0314] VCD Viable Cell Density [0315] VCV Viable Cell Volume [0316] XGBoost eXtreme Gradient Boosting

TABLE-US-00013 List of Symbols Symbol Description Unit ACO.PV Exhaust gas CO2 in % ACOT.PV Exhaust gas CO2 (total) in % Exhaust gas O2 AO.PV Chiller temperature in % CHT.PV — CO2.PV CO2 in mL/min CO2T.PV CO2 (total) in mL/min D Average diameter in μm e Residues — DRZ.PV Rotational speed in rpm fk Estimation of a decision tree — FED2T.PV FEED2 (total) in mL FED3T.PV FEED3 (total) in mL Glucose Glucose in mg/L Glucose Normalized glucose in AU (Standardized standard Fermenter volume values) GEW.PV in g K Number of decision trees — L Loss function — Lactate Lactate in mg/L Lactate Standardized lactate in AU (Standardized standard Hydroxide solution values) LGE.PV in mL N2.PV N2 in mL/min O Output neural network — O2.PV O2 in mL/min OUR OUR in mol/(L * h) PH.PV pH in — PO2.PV pO2 in % Process time Process time in days (d) T Number of leaves — TEF.PV Temperature in ° C. VCD in German Live cell density in 105 cells/mL VCDNorm Standardized VCD in AU (standardized values) VCV Viable cell volume In 10−7 mL (cells)/mL (cell suspension) VCDnorm standardized VCD in AU (Standardized values) w1, . . . , wn weightings — X feature matrix of influencing — parameters x1, . . . , xn influencing parameters — y Actual real target value — y{circumflex over ( )} Estimated target value — y Mean of the real target value — o Propagation function — ϕ Activation function — Ω Regularization term — λ Regularization parameter — γ Error minimization —

Materials

Software:

[0317] The programming language Python was used in the Spyder development environment for the entire work. The implementation was carried out in object-oriented programming. Several classes were written, which implemented individual tasks within the project.

TABLE-US-00014 Software used Surname Function Version Company Spyder Programming interface 3.2.8 Anaconda scikit-learn Python library for machine tbd sklearn learning PI system Database system ??? OSIsoft Spotfire Visualization tool 7.7 TIBCO JMP ® Stats program 12.1.0 SAS Excel ® Spreadsheet 2016 Microsoft notepad ++ Text editor 7.5.4 notepad ++

Methods

Data Processing

[0318] The entire dataset included 155 cultivation runs. These were broken down into on-line and off-line data. Data processing was implemented with Spyder in the Python programming language. The data were available as csv files. The data were read in with the “csv” program library. This allows data to be read in quickly and easily and converted into new data structures within the development environment. A “PlFileParser” class for the on-line data and an “Off-lineDataParser” class for the off-line data have been implemented.

Interpolation

[0319] Since the data were available with different data densities, they had to be interpolated accordingly. A linear interpolation and an interpolation using the moving average method were used for this purpose. Both functions have been implemented with the “scipy” library: “linear-interpolate. interpld” and “moving-average-convolve”. This ensured that the interpolated values were always between two raw measured values. The interpolation is therefore always within the natural fluctuation range of the measurement signal of the process variables. Since each process variable had different time stamps within a file, it was necessary to create another CSV file. The “TimelineMapping” contained all start and end times of the respective cultivation and was created by another database query. Three different intervals were selected for the resolution of the data: [0320] Time stamp of the associated sampling times of the off-line data [0321] 1/10 days [0322] Five minutes

[0323] Due to the considerably lower data density and the non-linear data course, no linear interpolation was applied to the off-line data. Here three different interpolation strategies were used for fitting: [0324] Peleg fit [0325] Polynomial fit [0326] Spline

[0327] The interpolation according to M. Peleg is able to map biological growth through additional functional terms and thus to describe the course of growth well [27]. Therefore, the raw data of the live cell density were fitted with all three interpolations. For glucose and lactate, an interpolation was carried out using the polynomial and spline method, since no biological behavior was to be assumed here. The on-line and off-line datasets were merged for the different intervals and saved as a CSV file for each cultivation. The correlation analysis was then carried out based on these datasets.

Correlation Analysis

[0328] The correlation analysis was carried out with JMP®. With JMP® it is possible to apply statistical analysis to datasets. Multivariate statistics of the on-line data (feature) related to the respective target variable (lactate, glucose, VCD, VCV) was applied. The data are analyzed both for statistical significance in describing the target variables and for linear relationships. The correlation analysis shows linear relationships between an independent and a dependent variable, in the form of the correlation coefficient according to Bravais-Pearson.

Mutual Information

[0329] Another method of identifying suitable features has been used in the form of mutual information. In determination by means of mutual information, the information content is determined, which is contained in an independent variable X in order to describe the target variable Y. The dependencies were calculated and implemented with “sklearn” by means of “mutual information regression”. Based on the size of the datasets with a resolution of five minutes, the information content was calculated separately for each cultivation and then the mean of the values obtained across all cultivations was formed.

Creation of a Feature Matrix/Results Vector

[0330] The creation of the feature matrix came from the result of the correlation analysis and the statistical evaluation based on the information content. This can be represented as a matrix and contains one feature per column and one point in time with the respective version of the feature. The feature matrix was saved as a Panda DataFrame. Thus, a suitable file format was available for training and testing the models.

Modeling and Evaluation

[0331] With the help of the results of the correlation analysis, a separate dataset was created for each target variable. A division of the feature matrix into a training and test dataset was necessary to train the models. A later use for on-line prediction required the withholding of a complete validation project. The training dataset contained 80% and thus 123 cultivation runs of the entire dataset.

[0332] Since the prediction of all target variables is a constant target parameter, only regressors were used as models. A number of hyperparameters, which differed from model to model, were available for the models. The training of the models thus served to adapt the hyperparameters so as to map the target variable as precisely as possible.

[0333] For the training itself, the entire feature matrix was standardized with the standard scaler of the Scikit-Learn library.

Optimization of Hyperparameters

[0334] The hyperparameters were optimized with a randomized search (RandomizedSearchCV) and a grid-based search (GridSearchCV) from the Scikit-Learn library. All models were trained using the randomized search (RandomizedSearchCV) of the Scikit-Learn library in combination with a tenfold cross-validation of the training dataset. The various areas of the hyperparameters were examined for the smallest RMSE. The randomized search was carried out 30 times. Accordingly, a different, randomly selected set of hyperparameters was used in each iteration. The hyperparameters of the ten models with the smallest RMSE were output. Then, based on the hyperparameters from the randomized search, the hyperparameters for the grid search were more finely graded. The grid search was carried out again with a tenfold cross-validation of the dataset. The model with the least error (smallest RMSE) was saved and then used to estimate the target variables from the test dataset.

Multilayer Perceptron

[0335] The Scikit-Learn library was used to implement the multilayer perceptron (MLP). The following list contains the hyperparameters that were used to train the models: [0336] Number of neurons in the input layer [0337] Number of neurons in the hidden layer [0338] Solver algorithms (adam, lbfgs, sgd) for setting the weights [0339] Activation functions (identity, logistic, tanh, relu) [0340] Learning rate [0341] Maximum number of iterations

Random Forest

[0342] The random forest was also implemented by the Scikit-Learn library. The following candidates were available as hyperparameters within this optimization: [0343] Number of decision trees [0344] Number of features per decision tree [0345] Maximum depth of the decision tree [0346] Minimum number of datasets to create a new node [0347] Methodology for selecting the datasets (bootstrap=true/false)

XGBoost

[0348] The XGBoost algorithm was integrated into the project structure through the XGBoost library. The following hyperparameter space corresponded to: [0349] Number of regression trees in the ensemble [0350] Maximum depth of the regression trees [0351] Learning rater η [0352] Number of datasets per decision tree [0353] Minimum weight of a child node in the decision tree [0354] γ error evaluation
as the hyperparameters used.

Model Evaluation

[0355] The model evaluation was primarily implemented by displaying an error histogram. This shows the errors (residuals) that the model has when predicting the test dataset for the actual value of the target parameter.

[0356] The RMSE was calculated for the accuracy of the estimation of the target parameter and compared with the mean value of the target parameter.

[0357] In order to examine the models for overfitting, the RMSE was calculated for the entire training and test dataset. The difference between the two errors was used as an indication of overfitting of the models:

Overfit=RMSE.sub.test−RMSE.sub.train

[0358] To further describe the model quality, the coefficient of determination for the entire test dataset and for each cultivation considered individually was used.

Example 1

Ambr250-Cultivation

[0359] 155 datasets based on cultivations in the ambr250 system were collected. The eukaryotic cells used were CHO cells that expressed a target molecule extracellularly. The cultivations were carried out using the fed-batch process. The ambr system used enables twelve cultivations to be carried out simultaneously. The cultivation time of the main culture was 13 to 14 days. The single-use bioreactors (250 mL) provided the reaction space for this. The pre-culturing was carried out in shake flasks and lasted three weeks. The starting conditions in terms of volume and number of cells at the time of inoculation were comparable in each reactor. The media used were exclusively chemically defined media. Only one medium batch was used per cultivation.

[0360] In order to provide optimal cultivation conditions within this system, a number of process variables were available. The parameters to be controlled were pH, temperature and the dissolved oxygen concentration in the medium. The following table contains a complete list of all process variables used for this work.

TABLE-US-00015 TABLE 12 On-line measured parameters. Process variable PI TAG Unit Exhaust gas CO.sub.2 ACO.PV in % Exhaust gas CO.sub.2 ACOT.PV in % Exhaust gas O.sub.2 AO.PV in % Chiller temperature CHT.PV in — CO.sub.2 CO2.PV in mL/min CO.sub.2 (total) CO2T.PV in mL/min Rotational speed DRZ.PV in rpm FEED2 (total) FED2T.PV in mL FEED3 (total) FED3T.PV in mL Fermenter volume GEW.PV in g Hydroxide solution LGE.PV in mL N.sub.2 N2.PV in mL/min O.sub.2 O2.PV In mL/min OUR OUR in mol/(L .Math. h) pO.sub.2 PO2.PV in % Process time Process time (d) in days Temperature TEF.PV in ° C.

[0361] All measured variables were recorded over the entire cultivation period by what is termed a PI system. The PI system only contains on-line measured variables.

[0362] The parameters listed here were available for monitoring optimal cultivation conditions. Exhaust gas analysis from BlueSens was also available for each reactor. This detects the O.sub.2 and CO.sub.2 content in the exhaust gas flow from the bioreactors and thus provided another important component in the process control. These two measured variables in the exhaust gas flow can be used to determine OUR and OTR.

[0363] Samples were taken daily during cultivation. These were then analyzed for various concentrations of the metabolites and product titers using Cedex Bio HAT® (Roche Diagnostics GmbH, Mannheim, Germany).

[0364] Furthermore, the cell count measurement was carried out. The measurement provides information about live cell density, total cell density, viability, aggregation rate and cell diameter. These parameters can be used to infer the growth behavior of the culture. The off-line sizes were measured by the Cedex HiRes® (Roche Diagnostics GmbH, Mannheim, Germany) cell counter. The error from these cell counting and cell analysis systems is in a range of 10%. All off-line measured quantities used are shown in the following table.

TABLE-US-00016 TABLE 13 Off-line measured variables. Process variable Description Unit Live cell density (Cedex HiRes o.sub.R) VCD in cells/mL Cell diameter (Cedex HiRes o.sub.R) Avg. in μm Lactate (Cedex Bio HT o.sub.R) Diameter Glucose (Cedex Bio HT o.sub.R) Lactate in mg/L. glucose

METHOD FOR DETERMINING PROCESS VARIABLES IN CELL CULTIVATION PROCESSES

Assignee

Inventors

Cpc classification

Classification Explorer

C12M41/36

CHEMISTRY; METALLURGY

Classification Explorer

C12M41/30

CHEMISTRY; METALLURGY

Classification Explorer

C12M41/32

CHEMISTRY; METALLURGY

Classification Explorer

C12M41/12

CHEMISTRY; METALLURGY

Classification Explorer

G06N20/20

PHYSICS

Classification Explorer

C12M41/34

CHEMISTRY; METALLURGY

Classification Explorer

C12M41/46

CHEMISTRY; METALLURGY

Classification Explorer

C12N5/00

CHEMISTRY; METALLURGY

Classification Explorer

C12M41/48

CHEMISTRY; METALLURGY

Classification Explorer

C12N2510/02

CHEMISTRY; METALLURGY

International classification

Classification Explorer

C12M1/36

CHEMISTRY; METALLURGY

Classification Explorer

C12M1/34

CHEMISTRY; METALLURGY

Classification Explorer

C12N5/071

CHEMISTRY; METALLURGY

Classification Explorer

C12P21/00

CHEMISTRY; METALLURGY

Classification Explorer

G06N20/20

PHYSICS

Abstract

Claims

Description