METHOD AND SYSTEM FOR PREDICTION OF A PERFORMANCE OF A STRAIN IN A PLANT

Abstract

A method and system for predicting performance of strains in processes, the strains being capable of fermentation of biomass for production of at least bio-ethanol, the method including the steps of: receiving a first process data set related to a performance of a first strain in a first process for producing bio-ethanol at a first site, receiving a second process data set related to a performance of a second strain in the first process for producing bio-ethanol at the first site, receiving a third process data set related to a performance of the first strain in a second process for producing bio-ethanol at a second site, the second site being different from the first site, and wherein the first, second and third process data sets each include one or more process profiles and/or process responses, determining a first correlation between the first process data set and the second process data set, and determining a second correlation between the first process data and the third process data, and reconstructing a fourth process data set related to a performance of the second strain in the second process for producing bio-ethanol at the second site by missing data imputation, wherein the fourth process data set is estimated based on the first correlation and the second correlation.

Claims

1. A computer-implemented method for predicting performance of one or more strains in one or more processes, the strains being capable of fermentation of biomass for production of at least bio-ethanol, the method comprising: receiving a first process data set related to a performance of a first strain in a first process for producing bio-ethanol at a first site, receiving a second process data set related to a performance of a second strain in the first process for producing bio-ethanol at the first site, receiving a third process data set related to a performance of the first strain in a second process for producing bio-ethanol at a second site, the second site being different from the first site, and wherein the first, second and third process data sets each include one or more process profiles and/or process responses, determining a first correlation between the first process data set and the second process data set, and determining a second correlation between the first process data and the third process data, reconstructing a fourth process data set related to a performance of the second strain in the second process for producing bio-ethanol at the second site by missing data imputation, wherein the fourth process data set is estimated based on the first correlation and the second correlation, and using the reconstructed fourth process data set as a prediction of the performance of the second strain in the second process at the second site.

2. The method according to claim 1, wherein the reconstructed fourth process data set is used for fitting a predictive model configured to predict the performance of the second strain in the second process at the second site.

3. The method according to claim 2, wherein a predictive model is employed for adjusting operational parameters in order to improve the performance of the second strain in the second process at the second site.

4. The method according to claim 1, wherein the first process at the first site is carried out in a laboratory, and wherein the second process at the second site is carried out in a plant, the plant optionally being an industrial-scale bio-ethanol production plant.

5. The method according to claim 4, wherein one or more small-scale laboratory experiments are carried out in the laboratory for determining at least one of the first process data set or the second process data set.

6. The method according to claim 1, wherein the first process at the first site is modelled by means of a computational model, wherein the computational model is used for determining at least one of the first process data set or the second process data set.

7. The method according to claim 1, wherein missing data related to the performance of the second strain in the second process at the second site is predicted at least in part using a regression model.

8. The method according to claim 7, wherein the regression model includes at least one of: multivariate regression, principal component regression, partial least squares regression, or trimmed scores regression for missing data imputation.

9. The method according to claim 1, wherein prior to determining the second correlation, data arrays in the data set relating to different batches in the first process data and the third process data are shuffled with respect to each other.

10. The method according to claim 9, wherein the data arrays are shuffled randomly or pseudo-randomly.

11. The method according to claim 1, wherein missing data related to the performance of the second strain in the second process at the second site is predicted at least in part using a trained artificial neural network model.

12. The method according to claim 1, wherein the first process at the first site and the second process at the second site are carried out in industrial-scale bio-ethanol production plants different from each other, optionally also at remote locations with respect to each other.

13. The method according to claim 1, wherein the process data sets include for a plurality of time points a value indicative for at least one of a sugar consumption, ethanol production, pH value, reaction temperature, composition of biomass, enzyme composition, yeast cell count, or glycerol production, wherein optionally the process data sets further includes data relating to a plurality of batch processes.

14. A system for predicting performance of one or more strains in one or more processes, the system including computational means for carrying out the method according to claim 1.

15. A computer program product configured to be run on a computer for predicting performance of one or more strains in one or more processes, the strains being capable of fermentation of biomass for production of at least bio-ethanol, the computer program product being configured to perform the method according to claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0058] The invention will further be elucidated on the basis of exemplary embodiments which are represented in a drawing. The exemplary embodiments are given by way of non-limitative illustration. It is noted that the figures are only schematic representations of embodiments of the invention that are given by way of non-limiting example.

[0059] In the drawings:

[0060] FIG. 1 shows a schematic diagram of an embodiment of a method for predicting performance of strains in processes;

[0061] FIG. 2 shows a schematic diagram of an embodiment of input parameters and process data sets;

[0062] FIG. 3 shows a schematic diagram of an embodiment of a method wherein the first process at the first site is a laboratory environment, and wherein the second process at the second site is a plant;

[0063] FIG. 4 shows a schematic diagram of an embodiment of a method wherein the reconstructed fourth process data set is used for fitting a predictive model configured to predict the performance of the second strain in the second process at the second site; and

[0064] FIG. 5 shows a schematic diagram of a method.

DETAILED DESCRIPTION

[0065] FIG. 1 shows a schematic diagram of an embodiment of a method for predicting performance of a strain in a process. Values denoted by X represent process data sets related to a performance of a first strain, values denoted by Y represent process data sets related to a performance of a second strain. An index 1 in said values indicate that the process data set is related to a performance in a first process, an index 2 in said values indicate that the process data set is related to a performance in a second process.

[0066] X1 thus represents a first process data set related to a performance of a first strain in a first process for producing bio-ethanol at a first site, Y1 represents a second process data set related to a performance of a second strain in the first process for producing bio-ethanol at the first site, X2 represents a third process data set related to a performance of the first strain in a second process for producing bio-ethanol at a second site, the second site being different from the first site. The vertical oval, comprising X1 and Y1, represents a first correlation between the first process data set and the second process data set, the horizontal oval, comprising X1 and X2, represents a second correlation between the first process data and the third process data. Y2 represents a fourth, reconstructed, process data set related to a performance of the second strain in the second process for producing bio-ethanol at the second site by missing data imputation, wherein the fourth process data set is estimated based on the first correlation and the second correlation.

[0067] As used herein, “strains” are microbial strains, i.e. strains of a microorganism. In a preferred embodiment, the strains are bacterial or fungal strains, more preferably fungal strains and most preferably yeast strains. The “first strain” and the “second strain” are preferably different strains. In a preferred embodiment, the first strain and the second strain are from the same microorganism, preferably both fungi, more preferably both yeasts. Preferably, the first strain and the second strain are from the same genus, more preferably from the same species.

[0068] Examples of microorganisms used in bio-ethanol production include Saccharomyces cerevisiae, Kluyveromyces marxianus, Pichia stipites, Issatchenkia orientalis and Zymomonas mobilis, among others.

[0069] The performance of strains in processes can be predicted by means of a method including: receiving a first process data set X1 related to a performance of a first strain in a first process for producing bio-ethanol at a first site; receiving a second process data set Y1 related to a performance of a second strain in the first process for producing bio-ethanol at the first site;

[0070] receiving a third process data set X2 related to a performance of the first strain in a second process for producing bio-ethanol at a second site, the second site being different from the first site, and wherein the first, second and third process data sets X1, X2, Y1 each include one or more process profiles and/or process responses; determining a first correlation between the first process data set X1 and the second process data set Y1, and determining a second correlation between the first process data set X1 and the third process data set X2; and reconstructing a fourth process data set Y2 related to a performance of the second strain in the second process for producing bio-ethanol at the second site by missing data imputation, wherein the fourth process data set Y2 is estimated based on the first correlation and the second correlation.

[0071] The data sets may include a plurality of profiles in time, e.g. a consumption of sugar or a production of ethanol. Profiles of various other quantities can also be monitored and used. The profiles can be observed in function of time for each batch in each plant. The correlations between the measurements can be determined by means of comparing data of one plant versus data of another plant. This provides a first relationship correlating different plants for a particular strain, e.g. plant 1 and plant 2 for strain Y. Furthermore, another relationship can be determined by comparing data of one plant with a first strain and data of the same plant with another strain.

[0072] FIG. 2 shows a schematic diagram of an embodiment of input parameters and process data sets. Each process data set comprises multiple process parameters (process variables) and output parameters (responses), represented by rectangles behind each process data set. Multiple parameters can be visualized in a multidimensional grid, here depicted as a three-dimensional grid, wherein each axis represents a different parameter. Examples of parameters include pH, temperature, enzyme-concentration, feedstock composition, time, batch number, performance (measured in volume bio-ethanol per volume feedstock), CO.sub.2-concentration. Missing data is solved by missing data imputation. Many missing data imputation methods exist, both single imputation and multiple imputation methods.

[0073] In an example, a first correlation between strain X and strain Y in a same plant is determined. Furthermore, a second correlation of strain Y in one plant and strain Y in a different plant is determined. These correlations can be used for predicting the performance of strain X in the second plant by means of missing data imputation. The technique can assume that a part of the data is missing and based on the above relationships/correlations, the missing part can be inferred. For instance, a regression model can be employed. However, as indicated above, a wide variety of models and techniques can be employed.

[0074] The data sets may be represented by matrices containing data (cf. profiles) collected at the first and second sites. X1 may correspond to data related to strain X in plant 1; X2 may correspond to data related to strain X in plant 2; Y1 may correspond to data related to strain Y in plant 1; and Y2 may correspond to data related to strain Y in plant 2. A correlation between X1 and X2, and a correlation between X1 and Y1 can be determined for predicting Y2 by means of missing data imputation.

[0075] A plurality of responses and profiles can be collected in function of time (t). The measurements may for instance be performed over a total duration of 48 hours with one measurement per hour (t=1:1:48). This can be done for a plurality of batches. The performance of the batches can be determined based on the profiles. The collected data can be represented in a 3D array. The collected 3D array can be unfolded in different ways, among them in a plurality of 2D arrays per time point or process variable.

[0076] Although in this example a 3D data set is retrieved, it is also possible to observe less parameters, obtaining a 2D data set. A combination of 2D and 3D data sets is also envisaged.

[0077] Data relating to strain X in plant 1 (cf. X1) can be unfolded, resulting in three data matrices concatenated batch-wise. This unfolding is analogous for strain X in plant 2 (cf. X2), and strain Y in plant 1 (cf. Y1). Matrices for strain Y in plant 1 (Y1), strain X in plant 1 (X1), and strain Y in plant 2 (Y2) can be used for determining a matrix for strain Y in plant 2 (Y2) via a missing data imputation algorithm (data of matrix block Y2 is missing). Based on correlations between columns of Y1 and Y2, and the relationships between Y1 and X1, a regression model can be used to predict the values in the missing block Y2. Similarly, X2 can be predicted if Y2 is already known.

[0078] Advantageously, the invention enables to predict how strain Y is going to perform and/or function in plant 2. It can be highly important to be able to accurately predict or estimate how the strain Y will perform in plant 2. For instance, plant 2 may be a new site necessitating the need to estimate or determine how a new strain is going to perform there. Based on this, better strains can be selected on new sites, or even process optimization can be performed.

[0079] Although many examples in the invention relate to bio-ethanol production processes, the invention is also applicable to other processes involving the use of strains for producing a product.

[0080] The data relating to batches for X1 and X2 may come from different sites (e.g. plants), making the only commonality the strain. Hence, a batch of one site may not correspond to a batch of another site. Optionally, data can be shuffled in X1 and/or X2. The shuffle can generate sufficient variation to better capture the differences. In this way, the accuracy of the prediction may be improved. The rows can be permuted in different ways, e.g. randomly.

[0081] FIG. 3 shows a schematic diagram of an embodiment of a method wherein the first process at the first site is a laboratory environment P1 (e.g. small-scale), and wherein the second process at the second site is a plant P2, for instance a large-scale industrial plant. A laboratory environment P1 comprises sites wherein only small volumes of bio-ethanol are being produced and are generally not intended for production and/or being sold. The laboratory environment P1 is preferably a biochemical laboratory environment, wherein yeast and/or other microbes can survive, e.g. due to the presence of an incubator. A plant P2 is preferably an industrial production plant, wherein large volumes of feedstock can be inserted, either in batch or in a continuous feed. Generally, the produced bio-ethanol is intended to be sold and is required to be of the same quality in each batch, or continuously of the same quality.

[0082] In some examples, existing strains may be adapted and/or improved or new strains may be made. The invention enables predicting how an (improved) strain may work at another site. For example, strain X can be a well-known strain that is currently used by some bio-ethanol producers in a P2 environment. From measured data of X and an improved strain Y in a P1 environment, it can be predicted how strain Y will perform at another site which can use a different process (process setting). It will be appreciated that different types of biomass may be used for bio-ethanol production. The biomass may for instance be corn. Different enzymes and/or pretreatments can be used in the process of producing the bio-ethanol.

[0083] FIG. 4 shows a schematic diagram of an embodiment of a method wherein the reconstructed fourth process data set is used for fitting a predictive model configured to predict the performance of the second strain in the second process at the second site. In the upper diagram, it is schematically shown that process parameters are inputted into the site, which is preferably at least one of a laboratory environment or an industrial production plant.

[0084] The lower diagram shows a schematic representation of the decisions that are taken to optimize the production process of the second strain in the second process. A model is chosen for each of the process data sets. Many computational models exist that could be adequate. The missing data can be modelled by at least one of a regression model, which can optionally include multivariate regression, principal component regression, partial least squares regression, or trimmed scores regression for missing data imputation.

[0085] After modeling, process parameters are modified in the model and output parameters are simulated, resulting in a prediction of the performance. If the performance of the process has been improved, compared to before modification of the process parameters, the process parameters of the process itself can be modified in the same way in order to improve the process. If the process has not been improved, the process parameters of the model can be changed again and the decision process starts anew. The changing of the process parameters can be performed randomly, but more preferred is via a design of experiments. Various optimization algorithms can be employed.

[0086] FIG. 5 shows a schematic diagram of a method 100 for predicting performance of strains in processes. In some examples, the method is a computer implemented method configured to be run on a machine. In a first step 101, a first process data set is received related to a performance of a first strain in a first process for producing bio-ethanol at a first site. In a second step 102, a second process data set is received related to a performance of a second strain in the first process for producing bio-ethanol at the first site. In a third step 103, a third process data set is received related to a performance of the first strain in a second process for producing bio-ethanol at a second site, the second site being different from the first site, and wherein the first, second and third process data sets each include one or more process profiles and/or process responses. In a fourth step 104, a first correlation between the first process data set and the second process data set, and a second correlation between the first process data and the third process data, are determined. In a fifth step 105, a fourth process data set is reconstructed related to a performance of the second strain in the second process for producing bio-ethanol at the second site by missing data imputation, wherein the fourth process data set is estimated based on the first correlation and the second correlation.

[0087] It will be appreciated that the method may include computer implemented steps. All above mentioned steps can be computer implemented steps. Embodiments may comprise computer apparatus, wherein processes performed in computer apparatus. The invention also extends to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice. The program may be in the form of source or object code or in any other form suitable for use in the implementation of the processes according to the invention. The carrier may be any entity or device capable of carrying the program. For example, the carrier may comprise a storage medium, such as a ROM, for example a semiconductor ROM or hard disk. Further, the carrier may be a transmissible carrier such as an electrical or optical signal which may be conveyed via electrical or optical cable or by radio or other means, e.g. via the internet or cloud.

[0088] Some embodiments may be implemented, for example, using a machine or tangible computer-readable medium or article which may store an instruction or a set of instructions that, if executed by a machine, may cause the machine to perform a method and/or operations in accordance with the embodiments.

[0089] Various embodiments may be implemented using hardware elements, software elements, or a combination of both. Examples of hardware elements may include processors, microprocessors, circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, microchips, chip sets, et cetera. Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, mobile apps, middleware, firmware, software modules, routines, subroutines, functions, computer implemented methods, procedures, software interfaces, application program interfaces (API), methods, instruction sets, computing code, computer code, et cetera.

[0090] Herein, the invention is described with reference to specific examples of embodiments of the invention. It will, however, be evident that various modifications, variations, alternatives and changes may be made therein, without departing from the essence of the invention. For the purpose of clarity and a concise description features are described herein as part of the same or separate embodiments, however, alternative embodiments having combinations of all or some of the features described in these separate embodiments are also envisaged and understood to fall within the framework of the invention as outlined by the claims. The specifications, figures and examples are, accordingly, to be regarded in an illustrative sense rather than in a restrictive sense. The invention is intended to embrace all alternatives, modifications and variations which fall within the spirit and scope of the appended claims. Further, many of the elements that are described are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, in any suitable combination and location.

[0091] In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word ‘comprising’ does not exclude the presence of other features or steps than those listed in a claim. Furthermore, the words ‘a’ and ‘an’ shall not be construed as limited to ‘only one’, but instead are used to mean ‘at least one’, and do not exclude a plurality. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to an advantage.

METHOD AND SYSTEM FOR PREDICTION OF A PERFORMANCE OF A STRAIN IN A PLANT

Inventors

Cpc classification

Classification Explorer

Y02E50/10

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

C12P7/06

CHEMISTRY; METALLURGY

Classification Explorer

G16B40/20

PHYSICS

Classification Explorer

G16C20/10

PHYSICS

Classification Explorer

G16B5/00

PHYSICS

International classification

Classification Explorer

G16B5/00

PHYSICS

Classification Explorer

G16B40/20

PHYSICS

Abstract

Claims

Description