Computer implemented method for predicting true agronomical value of a plant
11430542 · 2022-08-30
Assignee
Inventors
- Nicolas Heslot (Riom, FR)
- Stéphanie Chauvet (Aubière, FR)
- Chloé Boyard (Saulzet-le-Chaud, FR)
- Pascal Flament (Aubière, FR)
Cpc classification
G16B40/00
PHYSICS
G06N7/01
PHYSICS
G16B20/20
PHYSICS
International classification
G16B20/20
PHYSICS
G16B20/00
PHYSICS
G06N7/00
PHYSICS
G16B40/00
PHYSICS
Abstract
A computer implemented method for predicting an agronomical value and a breeding value of a plant belonging to a population, the method includes the steps of: obtaining at least some genotypic data from a subset of lines from the population, obtaining at least some phenotypic data from a subset of lines from the population, providing a statistical model receiving in input the genotypic data and phenotypic data, using the statistical model to output at least an agronomical value estimated for the plant. More particularly, the statistical model is a mixed model combining fixed effects and random effects.
Claims
1. A computer implemented method for predicting an agronomical value and a breeding value of a plant belonging to a population, the method comprising: obtaining at least some genotypic data from a first subset of lines from the population; obtaining at least some phenotypic data from a second subset of lines from the population; providing a statistical model; inputting the genotypic data and the phenotypic data in the statistical model; using the statistical model to output an agronomical value estimated for the plant, and a predicted breeding value estimated for the plant as a parent, the statistical model being a mixed model combining fixed effects and random effects, the random effects including at least non-additive genetic effects; and predicting a crop performance based on the agronomical value.
2. The method according to claim 1, wherein the genotypic data comprise genetic markers data.
3. The method according to claim 1, wherein the mixed model further receives, as input, phenotypic data of the plant.
4. The method of claim 3, wherein the phenotypic data comprise field trial information data.
5. The method according to claim 1, wherein the mixed model comprises a Genomic Best Linear Unbiased Prediction (GBLUP).
6. The method according to claim 1, wherein marker-based kinships including at least a Gaussian Kernel are used to estimate a covariance between plant individuals and capture non-additive genetic effects.
7. The method according to claim 1, further comprising estimating at least one of a genetic dominance effect and a non-additive effect based on genotypic data to capture heterosis, so as to predict hybrid crop performance.
8. The method according to claim 1, wherein the statistical model further receives, as input, phenotypic data of the plant, and wherein the statistical model analyzes simultaneously a plurality of phenotype traits with at least one covariance estimated among: a covariance estimated between traits, and a covariance estimated for a same trait, between a plurality of different environments.
9. The method according to claim 1, wherein the mixed model uses at least one additional fixed effect covariate to model an effect of genotypic data on phenotypic data of the plant.
10. The method according to claim 1, wherein the mixed model has heterogeneous residuals.
11. The method according to claim 1, wherein the step of providing comprises selecting and then providing the statistical model, the model being selected among a plurality of candidates of statistical models, based on tests to fulfill a criterion of predictiveness.
12. The method according to claim 1, further comprising, before inputting the genotypic data and the phenotypic data in the statistical model, filtering at least the genotypic data so as to remove at least samples having incomplete data among the genotypic data and the phenotypic data.
13. The method according to claim 1, wherein, before inputting the genotypic data and the phenotypic data in the statistical model, outlier phenotypic observations are detected in the phenotypic data and removed from the phenotypic data before inputting the phenotypic data in the statistical model.
14. A device for predicting an agronomical value of a plant, comprising a circuit for performing the method of claim 1, the circuit including at least: an input configured to receive at least some genotypic data and some phenotypic data; a storage medium storing program instructions; a processor configured to cooperate with the storage medium to execute the method; and an output configured to deliver at least the agronomical value of the plant and a predicted breeding value estimated for the plant.
15. A non-transitory computer storage medium storing computer program instructions which, when executed by a computer, are configured to cause the computer to carry out a method for predicting an agronomical value and a breeding value of a plant belonging to a population, the method comprising: obtaining at least some genotypic data from a first subset of lines from the population; obtaining at least some phenotypic data from a second subset of lines from the population; providing a statistical model; inputting the genotypic data and the phenotypic data in the statistical model; using the statistical model to output an agronomical value estimated for the plant, and a predicted breeding value estimated for the plant as a parent, the statistical model being a mixed model combining fixed effects and random effects, the random effects including at least non-additive genetic effects; and predicting a crop performance based on the agronomical value.
Description
(1) The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which like reference numerals refer to similar elements and in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12) Until now, genomic predictions were focused on the breeding value (value in a cross) for unobserved plants. However, according to one aspect of the invention, the total genetic value (i.e. agronomical or commercial value, both being synonymous in the present context) is of interest in a product development context.
(13) Typically, characters of a plant, such as crop yield, flowering date, biotic and abiotic tolerance, height, etc., are related to its agronomical value. A plant might have superior agronomical value in the field of farmers but may have a poor breeding value and hence generate low quality progenies and conversely.
(14) The inclusion of markers enables to smooth the year and environment effect as if the candidate line, in early yield trial, was tested over more plots (more replicates and more environments). In the presence of epistasis though, the inclusion of marker data in the model can be counterproductive to the accuracy of the agronomical value estimation if the wrong statistical model is used (e.g. if non-additive effects are present and not adequately modelled). Therefore several statistical models are tested in turn on the particular data set including one that does not use marker info at all to make sure that the retained agronomical value estimates (the retained fitted model) are at least as accurate as those provided by the traditional basic model that uses phenotypic data only.
(15)
(16) 1) white disks are related to the expected precision (simulation) of the phenotype “P” depending on the numbers of plots observed for a plant,
(17) 2) black disks are related to the expected precision when markers add to phenotype “P+G” are used to analyze the trial on a same number of plots for the same plant (based on simulation results with the following hypothesis: repeatability 0.3 and epistasis variance=20% of total genetic variance) so as to predict a commercial value of that plant.
(18) To efficiently implement the systematic use of markers, methods are needed to make sure that the analysis model is always at least as good as a model without markers to predict the total genetic value of the plant.
(19) A separate prediction of the breeding value is needed to select the best parents, in a model always including markers to separate additive from non-additive effects.
(20) A number of other issues to solve are linked to data quality control, optimal use of the available data and automatic data cleaning.
(21) Beyond the breeding value/total genetic value prediction task, there are many alternatives to improve the model. An efficient model selection procedure is needed however for automation.
(22) The present invention makes it possible to analyze field trials systematically with molecular markers not only data generated on the last year but all relevant data generated in the past by breeding activities and use them for selection decision.
(23) A description of a mixed model that can be used to that end is given now, in light of embodiments that have been presented in the prior art mainly in the field of animal breeding, an overview of which is presented hereafter.
(24) The use of molecular markers is disclosed for example in: S
(25) Use of molecular markers as an aid to selection has been an active area of research for several decades now (S
(26) As another way of identifying plants with desired characteristics, genomic selection (“GS” hereafter) fostered great hopes and opened new ways to use molecular markers in breeding for complex traits: M
(27) Initially, most of the research was conducted in the animal breeding community, where the high cost of phenotyping (e.g., here progeny testing in dairy cattle breeding), as well as the impossibility to replicate individuals, made it attractive. In addition, partly because of the impossibility to replicate individuals, animal breeders implemented mixed model methodology early on to analyze their data using the available pedigree information: H
(28) In plant breeding, the use of mixed models is more recent: S
(29) Marker-assisted recurrent selection (hereafter “MARS”) refers to several breeding schemes using markers to select unphenotyped individuals and then crossing them to generate the next generation from selected candidates. Initial work with MARS used biparental or multi parental population and QTL detection (QTL for “Quantitative Trait Loci”) and then tried to pyramid QTL: S
(30) Only markers significantly associated with the trait were used in the recurrent selection process. As a consequence, some genomic selection (GS) reports make a distinction between MARS and GS but it is rather more logical to consider GS as a tool to carry out MARS and other uses.
(31) The genomic selection (GS) or prediction is defined here as the simultaneous use of genome-wide markers to predict an individual's total genetic or breeding value for both observed and unobserved individuals. Multi-trait GS models can make use of information on correlated traits to improve prediction accuracy, as presented in: J
(32) Early work on GS in plants was mainly focused on unobserved individuals, in the MARS context. GS can be beneficial for observed individuals as well if entry-mean heritability is low: E
(33) GS can be performed with a variety of statistical methods. Those methods are concerned with the same so-called “large P small n” problem: there are many more predictor P (marker) effects to be estimated than there are observations n. Most approaches to this problem involve some type of penalized regression. Currently, the most widely used model is the genomic best linear unbiased prediction model (GBLUP): H
(34) With GBLUP, markers are used to estimate the covariance between individuals. That information is further used in a mixed model analysis to predict performance of observed and unobserved individuals. The GBLUP model has the advantage of relative simplicity, limited computing time and well-known optimality properties of mixed models for selection: F
(35) A classic GBLUP mixed model for genetic evaluation can be written:
y=Xβ+Zu+ε (1) y is a vector of phenotype, X is a design matrix, β is a vector of non-genetic effects such as environments with design matrix X. u is a vector of genetic effects with design matrix Z and ε is a vector of residuals.
(36) If all m individuals are replicated in all t locations and β is a vector of location (or, more generally, environment) effects (weather, soil, etc.), then:
β=I.sub.t.Math.l.sub.m and
u=l.sub.t.Math.I.sub.m I.sub.t is an identity matrix with t rows and l.sub.m is a vector of ones with length m. .Math. is the Kronecker product.
(37) The error variance is usually:
var(ε)=Iσ.sub.ε.sup.2
(38) The simplest form of this model does not use pedigree or markers and assumes that individuals are unrelated such that:
var(u)=Iσ.sub.g.sup.2 σ.sub.g.sup.2 is the genetic variance.
(39) The estimate of u would then be a simple adjusted phenotypic mean. This estimate can be further refined by assuming that individuals are related such that their performances are not independent from each other.
(40) Then, the variance is such that:
var(u)=Kσ.sub.a.sup.2 and
σ.sub.a.sup.2 is an additive genetic variance, K being kinship which can be based on pedigree (often called then relationship matrix) or on markers (then often called the “realized relationship matrix”). If W is the centered markers score matrix (representing thus a “genetic similarity” matrix) with m rows and as many columns as markers (W′ being the transposed matrix of W), then one way of calculating K is
(41)
p.sub.k is the frequency of the minor allele: V
(42) There are some other possibilities for kinship K to capture some non-additive effects: L
(43) If some individuals are not phenotyped, they still can be predicted by the model. In that case, some columns of Z contain only zero elements. Predictions are obtained by solving the mixed model equations once the variances are estimated:
(44)
where the exponent “.sup.−” designates the generalized inverse matrix.
(45) This model can be further extended to predict hybrid performance by adding an additional random effect such that:
y=Xβ+Z.sub.1u.sub.1+Z.sub.2u.sub.2+ε (2)
y, β and ε are as before, while u.sub.1 and u.sub.2 are, respectively, male and female genetic effect (which can be observed in plants such as maize, sunflower or rapeseed for example), such that
var(u.sub.1)=K.sub.1σ.sub.a1.sup.2 and var(u.sub.2)=K.sub.2σ.sub.a2.sup.2 with: σ.sub.a1.sup.2 and σ.sub.a2.sup.2 additive genetic variance in the male and female groups, respectively, and, K.sub.1 and K.sub.2 relationship matrices based on pedigree or markers for each group: A
(46) When Model (1) is calibrated on non-inbred material and Model (2) can accommodate an additional random effect u.sub.3 to capture dominance effects, such that:
var(u.sub.3)=K.sub.3σ.sub.d.sup.2 with σ.sub.d.sup.2 dominance variance, and
(47)
V is a centered marker design matrix for the dominance effect such that each column of that matrix V corresponds to a marker with minor allele a of frequency p coded:
{aa,Aa,AA}={−2p.sup.2,2p(1−p),−2(1−p).sup.2}
as explained in: V
(48) While K.sub.1 and K.sub.2 correspond to kinship between males and females for example, K.sub.3 has one row per hybrid.
(49) Model (2) (with or without markers) can be described as a GCA model (general combining ability)
(50) From the overview given above, it appears that genomic selection has been mainly described as a way to shorten selection by selecting on the basis of markers only. It was recently realized (E
(51) The method of the invention brings a conceptual change with the introduction of separate analysis for total agronomical value (“genetic value” as usually named in the prior art) and breeding value (value as a parent). This distinction was not possible without markers and is believed as novel, at least in the field of plant breeding. In fact, in the agronomical field, a seed of a fixed genotype has in itself an agronomical value, for the future plant growing directly from that seed is identical to the mother plant. In the field of animal breeding, the breeding value (for crossing) is usually the sole and important value to take into account so as to assign a value (e.g. a commercial value) to a genitor (bull, dam, etc.).
(52) It is however a major improvement over current practices, in the field of agronomy. Plant breeders can then select individuals based on agronomical value prediction to advance toward release, and furthermore pick potentially different individuals as parents of new individuals based on breeding value prediction.
(53) There are some potential issues with the introduction of molecular markers. However, the invention provides a very flexible and robust pipeline able to automatically deal with potentially low quality data and return reliable prediction, always at least as good as the baseline analysis that does not use markers, as shown below with reference to
(54) Statistical developments in embodiments of the invention include systematic detection of outliers, automatic choice of the most adapted statistical model, better exploitation of low quality trial data and systematic use of historical data generated by breeding programs.
(55) Taken separately, each of those developments is not efficient. A combination of them in a pipeline is needed to maximize breeding efficiency as well as their use in a context of systematic analysis of field trials with molecular markers.
(56) The aforesaid pipeline, using both phenotypic and genetic data, is presented below when referring to
(57) Before presenting such figures in details, an example of context where the invention can be advantageously used is given hereafter with reference to
(58) In this context given by way of an example, a cross between plant parents is performed at step S10 and generates new candidates (represented as step S11 on
(59) First phenotypic data can be obtained at step S12, such as, for example for maize: ear insertion height, plant height, flowering time, disease resistance, etc.
(60) A rapid cycling based on a genomic selection after step S11 can be performed (based essentially in practice on molecular markers only, and not on phenotype parameters) as represented with arrow A12. At an early stage after obtaining new candidates (at step S11) and before obtaining a first yield trial (at step S13) early stage tests can be also performed at step S12 using marker assisted selection for majors QTLs (Quantitative Trait Loci) such as for example major QTLs for disease resistance.
(61) Usually, next step S13 corresponding to a first yield trial is performed and data which are really observed on crop plots are gathered (including agronomical data such as for example plant earliness, resistance to disease, actual yield, weight per kernel, etc.). Traits starting to be collected at this stage are the most important agronomical ones. They are usually the lower heritability traits (lower than those of step S12), with few replicates and few locations (up to a dozen of plots, usually). However, these data can be used for implementing a model (a mixed model) using molecular markers which can estimate better the true agronomical value and the true breeding value of the individuals than the phenotype alone (arrows A13 and A10 in dot lines representing a major contribution of the invention), on the basis of: Genetic data obtained from step S11, typically including molecular markers combined to Phenotypic data (from step S12 and being completed at step S13) to estimate an estimate of the true agronomic value and breeding value that the mixed model can output.
(62) The model output can be used to select potentially different individuals for enhanced agronomical value (arrow A10) and optionally enhanced value as a parent (breeding value) (arrow A13).
(63) Furthermore, the model results (arrows A10 and A13) can be used to: give predictions including non-additive effects (such as epistasis for example) to improve true agronomical value prediction predict additive and non-additive genetic values, so as to enhance predictions of individuals resulting from a breeding of the parent plants crossed at step S10, and further individuals resulting from further crossings at next cycle step S15.
(64) The advanced tests performed at step S14, can use phenotypic data collected at step S14 and genetic data (that can be generated at any stage and potentially collected on ascendants or descendants of the plant tested in step S14) for plants advanced directly from step S11 (arrow A11). For plants advanced from step S13 (arrow A10) the tests can use phenotypic data collected in steps S12 and S13.
(65) Here, estimates of the agronomical value for low heritability traits determined in step S13 can be improved, with numerous replicates observed in numerous locations.
(66) A further benefit of the invention is that selection decision using phenotype alone are usually based on only one year of data whereas the invention provides means through the use of molecular markers to use multiple years and data and smooth the impact of the year of trialing of the candidates.
(67) With reference now to
(68) Of course, the content of
(69) An exemplary embodiment of a computer implemented process (the aforesaid “pipeline”) is now described with reference to
(70) A marker pre-processing can be performed also so as to take into account for example a call rate (percentage of missing data), pedigree errors (inconsistencies between information related to parent plants and the marker), heterozygosity, etc. in order to improve the quality of the marker data used in the remaining of the analysis and avoid spurious results.
(71) A second group of steps S44 to S46 are related to procedures to determine a convenient choice of analysis options. At step S44, analysis options are chosen involving selections among for example (and not limited to these items): an environment type (such as “year×location”, or “year×location×trial” in case of field heterogeneities), additional fixed effects per trait, e.g. major genes or QTLs (for example, a gene can provide a better and total resistance to a virus, influencing thus phenotype features data), use of commercial control checks in the analysis, crops types (specific options such as for maize, or rapeseed, etc.) so as to add relevant information (model with testers for maize for example, hybrid model for rapeseed, etc.).
(72) Those selections of information are intended to give relevant information so as to build the model parameters of the vector β (environment effect—year effect—commercial control checks, tester effect), of non-genetic effects in equation (2) given again hereafter and of genetic effects u1 and u2 (u2 is optional):
y=Xβ+Z.sub.1u.sub.1+Z.sub.2u.sub.2+ε
(73) Plant gender information (N1 males and N2 females for example) can be used to define u1 and u2 in a case where hybrid data is analyzed.
(74) As a further option (dashed lines in step S45), a non-additive model can be included in the model selection procedure, for example with additional random effects (more complex model to fit).
(75) The next group of steps of
(76) In step S47, kernels (kinship: matrix K1 and K2, Gaussian) and their inverse are computed. In step S48, a rough Estimation of the Genetic Value (EGV) based on phenotype alone is calculated so as to perform basic phenotype analysis. Then, in step S49 a “one-step” GBLUP (Genomic Best Linear Unbiased Prediction) model calculation is performed (as a first rough estimation of the genetic and breeding value with markers, same model at this stage). In step S50, a one-step GBLUP model is calculated now with heterogeneous residuals, so as to test an error variance e for example per environment (in case of a particular crop having suffered from storm or hail, or crop having been eaten by wild boars, etc.) or per any other parameter having a non-genetic effect. Less weight is then given in the model to such bad quality results. In step S51, the model is tested according to a chosen criterion such as the Akaike's Information Criterion (AIC). Then, if heterogeneous residuals are improving the model based on a selection criterion such as the AIC, the corresponding option should be kept for future steps. Once this part of the model is selected, it can be chosen to remove outliers according to a selected threshold (step S46). If outliers are removed, the reduced dataset is an input to step S48 to refit the different models.
(77) Preferably, variance components can be used from a previous model to generate better starting values for the next model to fit (so as to get a faster convergence of the new model). If a model does not converge, the algorithm stops and does not fit more complex models.
(78) In step S52, once the model is validated and outliers potentially removed, a more complex model is fitted, with one (or two in the case of hybrids, like in equation (2) above) additive effect being based on marker and another effect capturing independent epistatic deviation with covariance proportional to identity.
(79) If this model converges, another model is fitted with one additive effect being based on markers and one non-additive effect based on the Hadamard product of the kinship based on markers to capture pairwise epistatic effects.
(80) Optionally, more complex non-additive models using for example the Gaussian kernel can be fitted in step S53 (decision taken in step S45) or models capturing dominance effects. Here again, variance components can be used from previous models to generate better starting values for the new model (so as to get a faster convergence). In step S54, the best models are selected according to the AIC criterion and convergence. Separate selection procedures are used to identify the best model to pick the best parents (using the GEBV: Genomic Estimate Breeding Value) and to identify the best model to predict the agronomical value (also called the GEGV: Genomic Estimated Genetic Value). This “Genetic Value” is the total genetic value. It entails all the genetic components: additive, epistasis and dominance whenever relevant. The workflow can be stopped if a given model does not converge. In step S54, the best model identified for the agronomical value can simply be the EGV (the model for rough Estimation of the Genetic Value, without markers).
(81) Step S55 relates to a prediction rescaling to correspond to phenotype scale.
(82) Preferably, multi-threading can be used to speed up matrix operations, using for example supernodal sparse factorization and inversion in all model fit steps.
(83) Finally, the process outputs in step S56 the GEBV and the agronomical value (GEGV) (predictions with respect to parents and with respect to commercial expectations), as well as: Quality results of the prediction estimates (including details of the analysis, graphs, diagnostics), Standard errors, Graphs of residuals.
(84) Further details of the models which can be used are given below.
(85) For the EGV calculation, a BLUP without markers is used (Best Linear Unbiased Prediction).
(86) For the GEBV calculation, a basic GBLUP (Genomic Best Linear Unbiased Prediction) using Van Raden kinship is performed.
(87) For the Genomic Estimation of Agronomical Value calculation, different approaches are tested. The results from the best one only are output among the following ones: “simple EGV”; same as GEBV (K) (additive effects only); “Classic” (K+K.sup.2); “Independent” (K+diagonal matrix as explained below); “Gaussian” (optional; 10 different covariance matrices are tried successively to retain the best fitting one or a quadratic approximation is used. Calculation times are equivalent to 10 fits of basic GBLUP); k-kernel (optional; 90 different covariance matrices are tried successively to retain the best fitting one. Calculation times are equivalent to 90 fits of basic GBLUP).
(88) Therefore, several models can be used to capture non-additive effects and use different covariance structures. Model selection is preferably based on AIC. For more details about the models, the following expressions can be used:
EGV: y=Xβ+Z.sub.1u.sub.I+ε
GBLUP: y=Xβ+Z.sub.1u.sub.K+ε
GBLUP-GBLUP2 (classic): y=Xβ+Z.sub.1u.sub.K+Z.sub.2u.sub.K.sub.
GBLUP-IDE (independent): y=Xβ+Z.sub.1u.sub.K+Z.sub.2u.sub.I+ε
GBLUP-gaussian: 1≤i≤10,y=Xβ+Z.sub.1u.sub.Kg(i)+ε
GBLUP-kkernel: 1≤j≤90,y=Xβ+Z.sub.1u.sub.Kk(j)+ε
where: y is a known vector of observations β is an unknown vector of fixed effects
u.sub.M is an unknown vector of random effects with u.sub.M˜N(0, M σ.sup.2.sub.M) ε is an unknown vector of random errors X and Z.sub.1 are known design matrices relating the observations to β and u.sub.M, respectively.
(89) For fully epistatic effects (u.sub.M with M=I), an n×n identity matrix I can be provided.
(90) For additive effects only, a VanRaden kinship matrix can be used (as described above):
(91)
(92) For the GBLUP-Gaussian and for GBLUP-kkernel models, one factor accounts for the sum of additive and non-additive effects. The variance-covariance structure for that factor involves a n×n positive definite (non-diagonal) matrix chosen amongst 10 or 90, respectively, as said above indexed by kernel parameters to be estimated:
(93) ##STR00001##
(94) For the Gaussian kernel, the covariance structure is such that the i,j element of that covariance is given by exp (−(D.sub.ij/θ).sup.2) where the matrix of Euclidean distances between individuals D is calculated with markers and normalized to the interval [0,1] and θ is a so-called bandwidth parameter estimated by REML using a grid search or quadratic approximation.
(95) For GBLUP-k-kernel model, it is indexed by two parameters k and h such that it is equal to hS.sub.k+(1−h)S. h is a mixture parameter that varies between 0 (GBLUP model) and 1 (block-diagonal covariance reduced to the target S.sub.k). S.sub.k is a block diagonal matrix with individuals ordered into k clusters, if individuals i and j belong to the same cluster, the corresponding element of S.sub.k is equal to the element of K and 0 otherwise. For a given k, the population is split in k clusters that are determined by transforming K in a distance matrix and using a classical clustering algorithm. That is for a given k, the assignment of individuals to each k clusters is done by a clustering algorithm. The values of k and h are estimated by REML using a grid search. It can be seen as a multiple-kernel model or a simplified multi-trait model.
(96) For the GBLUP-GBLUP2 (classic) model, the u.sub.K.sub.
(97) Theses matrices are multiplied by their respective variance component to provide the factor's variance-covariance structure.
(98) Environment can be modeled as fixed or random effect. If environment is random, an additional random term Z.sub.envu.sub.env can be part of the model. As an example of workflow, environment can be treated as fixed if it has less than six levels. The term P always contains the intercept and can contain additional fixed effects provided by the user. Tester and commercial checks can be automatically fitted as fixed effects if needed.
(99) The models to choose from are reported in the following table:
(100) TABLE-US-00001 Epistastic Model Method Number of genetic effect egv EGV 1 No GBLUP 1 Classic GBLUP-GBLUP2 2 independent GBLUP-IDE 2 gaussian GBLUP-gaussian 1 kkernel GBLUP-kkernel 1
(101) If the phenotype is collected on hybrid data, analysis can be done using the parent information, the number of genetic effects is multiplied by two compared to the above description plus other optional effects capturing specific combining ability.
(102) For the EGV, GEBV and GEAV (Genomic Estimate of the Agronomical Value, corresponding to the GEGV), the same models are calculated and compared for each parent independently.
(103) Molecular profiles of parents are used. Several models with different covariance structures are fitted. Model selection is based on AIC.
(104) For the EGV model, a BLUP without markers (parent1 and parent2 being independent) can be used.
(105) For a GEBV model, additive part is to be taken into account. Different models are tested, and only the best one for each component is retrieved. Results can be a mix of several approaches, among: basic GBLUP (one kinship for each group of parents: parent1 and parent2) Independent for parent2 (K(parent1)+diagonal matrix (parent2)) Independent for parent1 (diagonal matrix (parent1)+K (parent2))
(106) The following equations can be used:
For EGV: y=Xβ+G.sub.1u.sub.I1+G.sub.2u.sub.I2+G.sub.3u.sub.I3+ε
For GBLUP: y=Xβ+G.sub.1u.sub.K1+G.sub.2u.sub.K2+ε
For GBLUP-IDE: y=Xβ+G.sub.1u.sub.K1+G.sub.2u.sub.I2+ε or
y=Xβ+G.sub.1u.sub.I1+G.sub.2u.sub.K2+ε
For GBLUP-SCA: y=Xβ+G.sub.1u.sub.K1+G.sub.2u.sub.K2+G.sub.3u.sub.K3+ε
where:
y is a known vector of observations
β is an unknown vector of fixed effects
u.sub.Mi is an unknown vector of random effects with u.sub.Mi˜N(0, M σ.sup.2.sub.Mi)
ε is an unknown vector of random errors
X and G.sub.i are known design matrices relating the observations to β and u.sub.Mi, respectively.
(107) A specific combining effect (SCA) based on markers capturing epistasis between parents or dominance can be added (random effect u.sub.K3) or not using markers with covariance identity (random effect u.sub.I3) The covariance of u.sub.K3 is proportional to the covariance of dominance effect describe above for the hybrids or to the Hadamard product of the VanRaden kinship of the hybrids. Model selection for the inclusion of the SCA component and the choice of its covariance structure is based on convergence of the model and AIC.
(108)
(109)
(110) Like
(111)
(112)
(113)
(114) Of course, the invention can be performed by a computer device or a computer system including one or several data entries (from one or several databases for example), and having a core architecture as shown in the very schematic and simplified
(115) The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the method described herein, and which, when loaded in an information processing device or system, causes the information processing device or system to perform the method of the invention. Computer program means or computer program in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a computer system or device having an information processing capability to perform a particular function either directly or after the conversion to another language. Such a computer program can be stored on a computer or machine readable medium allowing data, instructions, messages or message packets, and other machine readable information to be read from the medium. The computer or machine readable medium may include non-volatile memory, such as ROM, Flash memory, USB key, Disk drive memory, CD-ROM, and other permanent storage. Additionally, a computer or machine readable medium may include, for example, volatile storage such as RAM, buffers, cache memory, and network circuits. Furthermore, the computer or machine readable medium may comprise computer or machine readable information in a transitory state medium such as a network link and/or a network interface, including a wired network or a wireless network, that allow a device to read such computer or machine readable information.
(116) The present invention is not limited to the description provided above as an example of embodiment; it encompasses possible variants.
(117) An automatic selection of the best statistical model for measuring the genetic value (and breeding value, as well) has been described above with reference to
(118) The true agronomical value can be estimated furthermore for non phenotyped plants. For example, the model can also work to predict individuals with no observations on their own.
(119) Furthermore, the Gaussian kernel in the method above can be used to estimate the covariance between individuals and capture non-additive effects, or another covariance estimator based on a direct product of the marker-based kinship can be used to capture pairwise epistatic interactions. However, other covariance estimator capturing non-additive effects can be used.
(120) Advantageously, when the method of the invention is applied to hybrid crops performance prediction (with a male and female effect), a dominance effect can be fitted, and/or possibly also with an epistatic component based on markers fitted to capture heterosis, and/or possibly also with an effect capturing heterosis based on phenotype alone.
(121) The model described above may include an additional fixed effect covariate to model the effect of major QTLs (QTL for “Quantitative Trait Loci”) or genes on the trait. The model may give simultaneous analysis of several traits, with a covariance estimated between traits, and/or possibly with the same trait in different locations, years or groups of locations being treated as different traits. The covariance estimated between traits can then be used to analyze the trial network.
(122) The model described above may include further a tester effect for testcross data (like shown on
(123) As described above, the method includes in the first steps an automatic quality control of molecular marker data (on data such as a percentage of missing data per individual and marker, on congruence with pedigree data, on expected heterozygosity rate). It may include further an automatic quality control of the phenotype data to remove data points with value out of an expected range, and possibly also an automatic quality control of the phenotype data to remove data points that might create numerical problems with the algorithm (e.g. locations with very few individuals). In a same or alternative embodiment, other tests to control data quality can be performed, such as an automatic fitting of a factor as fixed or random depending on the number of levels of that factor and elimination of data from the model if a quality score collected by experimenters is below a threshold. A cut-off step can be provided also to prevent from false results if the dataset is too small to be analyzed correctly. Moreover, the automatic detection and removal of outliers can be applied to outlier observations and/or to outlier individuals based on standardized residuals.
(124) Finally, backsolving in the method according to the embodiment shown on