ASSOCIATING PEDIGREE SCORES AND SIMILARITY SCORES FOR PLANT FEATURE PREDICTION
20210392836 · 2021-12-23
Assignee
Inventors
Cpc classification
A01H1/04
HUMAN NECESSITIES
G16B10/00
PHYSICS
International classification
A01H1/02
HUMAN NECESSITIES
A01H1/04
HUMAN NECESSITIES
Abstract
The invention relates to a computer-implemented method comprising: receiving (102) a set of pedigree scores (300, 512) of pairs of plant breeding units over two or more generations; receiving (104) an incomplete set of similarity scores (200, 510) of the pairs of the plant breeding unit pairs; aligning (106) the pedigree scores and the similarity scores of identical plant breeding unit pairs; automatically analyzing (108) the aligned pedigree scores and similarity scores for computing a predictive model (508) based on associations of the similarity scores and of the pedigree scores; using the predictive model for creating (112) a complete set of similarity scores (400, 518); and using (114) the complete set of similarity scores for computationally predicting a feature (522) of a plant breeding unit or of an offspring thereof.
Claims
1. A computer-implemented method for predicting a feature of one or more plants, the method comprising: receiving a set of pedigree scores, the pedigree scores being indicative of known genealogical relationships of pairs of plant breeding units over two or more generations, the plant breeding unit pairs comprising pairs of plant breeding units within the same generation and comprising pairs of plant breeding units of different ones of the two or more generations, wherein a plant breeding unit is an individual plant or a group of plants; receiving an incomplete set of similarity scores, each similarity score being indicative of observed similarities between the two members of a respective one of the pairs of the plant breeding units, wherein the incomplete set of similarity scores is devoid of similarity scores of at least a sub-set of the plant breeding unit pairs; aligning the pedigree scores and the similarity scores of identical plant breeding unit pairs; automatically analyzing the aligned pedigree scores and similarity scores for determining associations of the similarity scores and of the pedigree scores, thereby computing a predictive model, the predictive model being adapted to estimate a similarity score as a function of a pedigree score; applying the predictive model on pedigree scores of the sub-set of the plant breeding unit pairs for computing missing similarity scores for each of the plant breeding unit pairs of the sub-set; creating a complete set of similarity scores from the incomplete set of similarity scores and the computed missing similarity scores; and using the complete set of similarity scores for computationally predicting a feature of at least one of the plant breeding units or of an offspring of at least one of the plant breeding units.
2. The computer-implemented method of claim 1, wherein the pedigree scores are indicative of known genealogical relationships of all the pairs of plant breeding units over three or more generations.
3. The computer-implemented method of claim 1, wherein the predictive model is selected from: a linear or non-linear function that has been fitted on the pedigree scores and the similarity scores such that it returns an estimated similarity score of a plant unit pair in dependence on a pedigree score of the plant breeding unit pair, the function being preferably a polynomial function having a polynomial order preferably of 3; and/or a trained machine-learning model, the trained machine learning model having learned during a training phase to estimate a similarity score of a plant unit pair in dependence on a pedigree score of the pair of plant breeding unit pair.
4. The computer-implemented method of claim 1, further comprising: creating a pedigree score matrix, and using the pedigree score matrix as the set of pedigree scores: and/or creating a similarity score matrix, and using the similarity score matrix as the incomplete set of similarity scores.
5. The computer-implemented method of claim 1, further comprising: computing the set of pedigree scores from a genealogical pedigree tree and from predefined scores for different genealogical relationships.
6. The computer-implemented method of claim 1, the pedigree scores being selected from: coefficients of coancestry, each coefficient of coancestry indicates the probability that one feature, derived from the same common ancestor, is identical by descent in two individuals; and scores computed as a function of the coefficients of coancestry, in particular inbreeding coefficients, each inbreeding coefficient being a measure of inbreeding derived from a known genealogical relationship of the parents expressed in the form of coefficients of coancestry.
7. The computer-implemented method of claim 1, further comprising: computing each of the similarity scores in the incomplete set of similarity scores as a function of genetic, metabolic, transcription-related, protein-related and/or phenotypic markers of the two plant breeding units comprised in the plant breeding unit pair for which the similarity score is computed, the similarity scores being indicative of a degree of similarity of the markers of the two plant breeding units.
8. The computer-implemented method of claim 1, the similarity score being selected from: a marker-based similarity score, in particular a genomic relationship score computed from DNA marker information; and/or a marker co-occurrence score; wherein the marker is selected from: a genetic, metabolic, transcription-related, protein-related, phenotype-related marker and/or breeding value of a plant used as one of the plant breeding units; or an aggregate value derived from genetic, metabolic, transcription-related, protein-related, phenotypic markers and/or or breeding value of a group of plants used as one of the plant breeding units;
9. The computer-implemented method of claim 1, the plant breeding unit being groups of plants, each one of the groups of plants being selected from: a group of plants having the same or a highly similar genotype that is different from the genotype of some or all other ones of the plant groups; and/or a group of plants belonging to the same cultivar, the cultivar being different from the cultivar to which the plants of some or all of the other plant groups belong to.
10. The computer-implemented method of claim 1, further comprising: performing a cluster analysis on a base population of plant breeding units, thereby identifying a number n of clusters, each cluster comprising a sub-set of plant breeding units whose genetic, metabolic, transcription-related, protein-related phenotype-related and/or breeding-related markers are more similar to one another than to respective markers of plant breeding units of other ones of the clusters; for each of the number n of identified clusters: identifying pairs of plant breeding units comprised in this cluster; receiving pedigree scores for each of the identified pairs; receiving similarity scores of at least some of the identified pairs; aligning the pedigree scores and the similarity scores of identical plant breeding unit pairs selectively for the pairs in the cluster; performing an automated analysis of the aligned pedigree scores and similarity scores for determining associations of the similarity scores and of the pedigree scores in the cluster, thereby computing a cluster-specific predictive model, the cluster-specific predictive model being adapted to estimate a similarity score as a function of a pedigree score.
11. The computer-implemented method of claim 10, wherein the complete set of similarity scores is a set of preliminary similarity scores computed using the predictive model as a preliminary global predictive model, the method further comprising: applying the cluster-specific predictive models on pedigree scores of the sub-set of the plant breeding unit pairs of the one of the clusters from which the cluster-specific predictive model was derived for computing missing similarity scores for intra-cluster plant breeding unit pairs of the cluster; supplementing the received incomplete set of similarity scores with the similarity scores computed for the intra-cluster plant breeding unit pairs of the one or more clusters, thereby providing an intermediate incomplete set of similarity scores, the intermediate incomplete set of similarity scores being devoid of similarity scores of at least some of the inter-cluster plant breeding unit pairs; supplementing the intermediate incomplete set of similarity scores by using the preliminary similarity scores similarity scores as the missing similarity scores of the inter-cluster plant breeding unit pairs, thereby providing a refined complete set of similarity scores; and using the refined complete set of similarity scores for performing the computational prediction of the feature.
12. The computer-implemented method of claim 10, further comprising: applying the cluster-specific predictive models on pedigree scores of the sub-set of the plant breeding unit pairs of the one of the clusters from which the cluster-specific predictive model was derived for computing missing similarity scores for intra-cluster plant breeding unit pairs of the cluster; supplementing the received incomplete set of similarity scores with the similarity scores computed for the intra-cluster plant breeding unit pairs of the one or more clusters, thereby providing an intermediate incomplete set of similarity scores, the intermediate incomplete set of similarity scores being devoid of similarity scores of at least some inter-cluster plant breeding unit pairs; performing the method according to claim 1, thereby using the intermediate incomplete set of similarity scores as the received incomplete set of similarity scores, whereby the predictive model is computed by analyzing the aligned pedigree scores and the similarity scores of the intermediate incomplete set of similarity scores, whereby the computed predictive model is applied on the pedigree scores of inter-cluster plant breeding unit pairs for creating the complete set of similarity scores that is used for computationally predicting the feature.
13. The computer-implemented method of claim 1, wherein a base population of plant breeding units is used as the founding population of a pedigree tree from which the pedigree scores are derived, wherein the base population comprises at least two genetically distinct groups of plant breeding units.
14. The computer-implemented method of claim 1, wherein the predicted feature is selected from: a breeding value of one or more of the plant breeding units; an identifier of one or more of the plant breeding units having the highest likelihood of comprising a favorable genomic, metabolic, or phenotypic marker; an identifier of one or more of the plant breeding units having the highest likelihood of comprising an undesired genomic, metabolic, or phenotypic marker; an identifier of at least one plant breeding unit pair comprising a favorable combination of genomic, metabolic, or phenotypic markers; an identifier of at least one plant breeding unit pair comprising an undesired combination of genomic, metabolic, or phenotypic markers; and/or the likelihood of occurrence of a favorable or of an undesired genomic, metabolic, or phenotypic marker in an offspring of two of the plant breeding units.
15. A method for conducting a plant breeding project, the method comprising: providing a group of candidate plant breeding units, wherein a candidate plant breeding unit is an individual plant or a group of plants potentially to be used in the plant breeding project, wherein a known genealogical relationship of pairs of the candidate plant breeding units over two or more generations is available; performing the method according to claim 1 for computationally predicting a feature of at least one of the candidate plant breeding units or of an offspring of at least one of the plant breeding units, wherein the candidate plant breeding units are used as the plant breeding units whose pedigree scores and incomplete set of similarity scores are received, wherein the feature is indicative of whether the at least one candidate breeding unit comprises a favorable genomic, metabolic, or phenotypic marker and/or a favorable breeding value; selecting one or more of the candidate breeding units in dependence on the at least one predicted feature; and selectively using the selected one or more candidate breeding units for generating offspring in the plant breeding project.
16. A computer-system configured for predicting a feature of one or more plants, the computer system comprising: one or more processors; a volatile or non-volatile storage medium comprising: a set of pedigree scores, the pedigree scores being indicative of known genealogical relationships of pairs of plant breeding units over two or more generations, the plant breeding unit pairs comprising pairs of plant breeding units within the same generation and comprising pairs of plant breeding units of different ones of the two or more generations, wherein a plant breeding unit is an individual plant or a group of plants; an incomplete set of similarity scores, each similarity score being indicative of observed similarities between the two members of a respective one of the pairs of the plant breeding units, wherein the incomplete set of similarity scores is devoid of similarity scores of at least a sub-set of the plant breeding unit pairs; a software comprising computer-interpretable instructions which, when executed by the one or more processors, cause the processors to perform a method comprising: aligning the pedigree scores and the similarity scores of identical plant breeding unit pairs; analyzing the aligned pedigree scores and similarity scores for determining associations of the similarity scores and of the pedigree scores, thereby computing a predictive model, the predictive model being adapted to estimate a similarity score as a function of a pedigree score; applying the predictive model on pedigree scores of the sub-set of the plant breeding unit pairs for computing missing similarity scores for each of the plant breeding unit pairs of the sub-set; creating a complete set of similarity scores from the incomplete set of similarity scores and the computed missing similarity scores; and using the complete set of similarity scores for computationally predicting a feature of at least one of the plant breeding units or of an offspring of at least one of the plant breeding units.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0156] In the following, only exemplary forms of the invention are explained in more detail, whereby reference is made to the drawings in which they are contained. They show:
[0157]
[0158]
[0159]
[0160]
[0161]
[0162]
[0163]
DETAILED DESCRIPTION
[0164]
[0165] In the following, the method according to
[0166] The examples described herein may allow imputing missing data points in a non-pedigree-derived similarity score dataset, such as a marker-based similarity score matrix (which is sometimes used to estimate kinship and hence is sometimes also referred to as “kinship matrix” although no pedigree information was used for constructing this matrix). The imputed data is computed from a pedigree score dataset. Missing similarity data in this context can be detrimental, as the lack of this information leads to situations in which predictions cannot be made using routine methods for genomic prediction.
[0167] In the example described here, genotypic information of a set of individuals in recombination cycle D was used together with a portion of the genotypic information from the recombination cycles of the parental generation cycles (A, B, and C—though many parents were ungenotyped). Based on this data, the incomplete similarity score matrix was computed based on molecular marker data. Additionally, pedigree score data was obtained for those genotypes from recombination cycle D. The data also contains parental information for several generations (potentially around 5 generations or less).
[0168] In a first step 102, a computer system 500 or the analysis software 502 receives the above-mentioned set of pedigree scores 512. For example, the pedigree scores can be read in the form of a pedigree score matrix 300, 512 as depicted, for example in
[0169] The aim of the plant breeding project used for this example was to identify a sub-set of the plant breeding units of the youngest generation and a further sub-set of plant breeding units of the “parent” or “founder” generation (or genetically equivalent plants) in order to generate offspring having one or more desirable traits. For example, a big leaf size and high heat stress tolerance may be considered desirable traits. The leaf size and heat stress tolerance in the youngest generation may already be close to optimum, but in order to get rid of an undesired trait that is common in the youngest generation, it may be desirable to cross selected ones of the plants of the youngest generation with selected ones of the plants of the parent or founder generations (or genetically equivalent plants) in order to obtain plants having the desired properties and at the same time being free of the undesired property. The problem may be that marker-based similarity information, e.g. information on SNPs known to correlate with the desired traits, may only be available for the youngest generation “Geno1-3”, not for the parent or founder generation.
[0170] The matrix depicted in
[0171] Next in step 104, the system 500 receives an incomplete set of similarity scores 200, 510. An example of an incomplete similarity score matrix 200 is presented in
[0172] Next in step 106, the system having received the pedigree scores and incomplete similarity scores aligns the pedigree scores and the similarity scores of identical plant breeding unit pairs. For example, the pedigree score matrix and the incomplete similarity score matrix may be received as or transformed into symmetrical matrices that can be aligned to each other easily. Alternatively, the pedigree scores and the similarity scores are aligned in the form of score vectors. In this case, the matrix alignment is an illustration of the ones of the scores which are aligned to each other while in fact the alignment process is performed based on an alignment of pedigree score vectors and similarity score vectors.
[0173] Next in step 108, the system automatically analyzes the aligned pedigree scores and similarity scores for determining associations of the similarity scores and of the pedigree scores for computing a predictive model 508 which is able to computationally estimate a similarity score as a function of a pedigree score.
[0174] In order to create the predictive model, the scores of those plant breeding unit pairs for which all pairwise information for both K (similarity scores in the form of marker-based similarity coefficients) and A (pedigree scores in the form of coefficients of coancestry) exists, are analyzed. In the specific example depicted in
[0175] Next in step 110, the system, the system 500 applies the predictive model (the polynomial function of order 3 created in the previous step) on the pedigree scores of at least the sub-set of the plant breeding unit pairs currently lacking a similarity score for computing the missing similarity scores.
[0176] Then in step 112, a complete similarity score matrix as depicted in
[0177] Next in step 114, the system inputs the complete set of similarity scores into a prediction software 520 that is adapted to computationally predict a feature 522 of a plant breeding unit or of an offspring thereof based on the complete set of similarity scores. For example, the prediction software 520 can use the GBLUP algorithm for computing a predictive value based on the complete similarity score matrix 400.
[0178] According to some examples, the pedigree score matrix is transformed into a pedigree score vector x and the incomplete similarity score matrix is transformed into an incomplete similarity score vector y. The alignment of scores and the score analysis for creating the predictive model is performed on vectors rather than a matrix structure. Transforming the matrices into vectors may further increase the performance as some programs for statistical (regression) analysis expect to receive two or more data vectors as input.
[0179] For example, the transformation of the matrix into a vector can be performed as follows:
[0180] 1. Start with a matrix, e.g. a similarity score matrix
TABLE-US-00001 Geno1 Geno2 Geno3 Geno4 Geno1 1.5 0.4 0.9 0.7 Geno2 0.4 1.5 0.5 0.4 Geno3 0.9 0.5 1.5 0.4 Geno4 0.7 0.4 0.4 1.5
[0181] 2. Remove either the upper or lower triangle of the matrix, which doesn't matter since it is symmetrical around the diagonal
TABLE-US-00002 Geno1 Geno2 Geno3 Geno4 Geno1 1.5 Geno2 0.4 1.5 Geno3 0.9 0.5 1.5 Geno4 0.7 0.4 0.4 1.5
[0182] 3. Taking just the columns and stacking them; (one could also stack the rows; this doesn't matter as long as one is consistent when applying them to the two matrices, the pedigree score matrix derived from a pedigree dataset and the similarity score matrix derived from marker data).
[0183] This leads to a table of three columns: Genotype 1, Genotype 2, and their marker-based similarity score
TABLE-US-00003 Geno1 Geno1 1.5 Geno2 Geno1 0.4 Geno3 Geno1 0.9 Geno4 Geno1 0.7 Geno2 Geno2 1.5 Geno3 Geno2 0.5 Geno4 Geno2 0.4 Geno3 Geno3 1.5 Geno4 Geno3 0.4 Geno4 Geno4 1.5
[0184]
[0185]
[0186]
[0187]
[0188] In a first step, the analysis software 502 reads an incomplete similarity score matrix 510 from a local or remote data store. The matrix 510 is a similarity score dataset in which data is missing regarding the marker-based similarity between members of pairs of plant breeding units, e.g. individual plants, genotypes or cultivars. In addition, pedigree scores 512 available for the above-mentioned pairs of plant breeding units is received by the software 502. In the ideal case, the pedigree score data set is deep, meaning it cover several previous generations (here: “Anc” and “Par”). The pedigree score data is received or transformed into a pedigree score matrix (referred to as A) as depicted, for example, in
[0189] The matrix comprising the incomplete similarity score data set is referred to as y and the matrix comprising the pedigree scores is referred to as x. The software 502 comprises an alignment module 504 configured to align (or map) pedigree scores to similarity scores (if any) assigned to the same pair of plant breeding units. For example, the alignment of matrices can be implemented as an alignment of vectors.
[0190] An analysis module is adapted to analyze the association of the aligned pedigree scores and similarity scores for automatically creating a predictive model that is adapted to predict a similarity score from a given pedigree score. For example, the association module may perform a regression analysis for fitting a polynomial model to the aligned scores (having been placed in the proper format for the analysis module). For example, a polynomial function of order 3 may be fitted by regressing the incomplete similarity score matrix y on the pedigree score matrix x.
[0191] Preferably, the regressing of the incomplete similarity score matrix y on the pedigree score matrix is implemented based on a score vector alignment and regression. For example, the regression process may comprise a) representing the pedigree score matrix x as a vector vx (see description of
[0192] Then, the association module 506 outputs the created predictive model 508.
[0193] A further module of the analysis software 502, the completion module, 516, applies the predictive model 508 on the pedigree scores 512 for computing the missing similarity scores. The empty cells of matrix 510 are filled with the newly computed similarity scores and a complete similarity score matrix 518 is provided. A concrete example of this matrix 518 is depicted in
[0194] The prediction software computes a prediction of one or more features 522 of one or more plant breeding units or the offspring thereof based on the completed similarity score matrix 518. The feature 522 is output to a user, e.g. via a GUI or a printer. The feature can be, for example, a breeding value, a predicted likelihood of the presence of one or more desired or undesired genotypic, metabolic and/or phenotypic traits, or the like.
[0195]
[0196]
[0197] To obtain the curve of
[0198] According to some examples, the clustering of plant breeding units is performed based on biological markers of these plant breeding units (e.g. the biological markers used for computing the incomplete similarity scores). According to another example, the clustering is performed on the pedigree scores.
[0199] According to some examples, the clustering is performed after the steps 102-112 have been performed on the totality of originally received and imputed similarity scores. For example, the clustering algorithm k-means or a similar clustering algorithm can be used for identifying the clusters. According to some further examples, the clustering can also be performed semi-automatically based on prior information related to population structure, geographic structure, or any other information being characteristic for the plant breeding units used in a plant breeding project.
[0200] In the given example, the cluster analysis was performed on the totality of originally received and imputed similarity scores. The cluster analysis identified nine different clusters of plant breeding units. Pairs of plant breeding units belonging to the same cluster formed clusters of pairs of plant breeding units.
[0201] Then, an alignment of the similarity scores and pedigree scores and the creation of a predictive model as described e.g. with reference to steps 106-108 was performed on a per-cluster basis on the plant breeding unit pairs belonging to a particular one of the nine clusters.
[0202] This may have the advantage that cluster-specific predictive models may be able to describe potentially different score relationships of plant breeding unit pairs among and between different clusters. One example of these intra-cluster model fits is shown in
[0203] For example, the determination of n different clusters during cluster analysis may be used to split the score data into n (complete or incomplete) vectors comprising the similarity score values of the respective clusters and n further vectors comprising the pedigree scores of the plant breeding units of the clusters. Curve fitting and regression analysis is performed for each of the n clusters and respective vector pairs separately for creating n different predictive models. This may provide a greater level of detail that allows a better fit of data that deviate from pedigree relatedness due to selection. The combined similarity score matrix is created by using the n different predictive models for computing the missing similarity scores (in case a cluster does not comprise missing similarity score, executing the respective predictive model may not be necessary).
[0204] In a final, optional step, a global predictive model may be created by regressing the—meanwhile completed—set of similarity scores to the pedigree scores of the whole data set. Hence, a single predictive model is obtained that integrates the cluster-specific knowledge on the relationship of pedigree scores and similarity scores.
[0205] According to one example, missing similarity score values are computed and placed in the designated table or matrix at the end of the procedure in the following order: First the received and already existing similarity scores are added to the matrix. Then, if the plant breeding units and respective score values were clustered, the clustered predicted pairwise similarity scores are placed into the matrix, thereby providing a completed matrix of similarity scores. Then, a global predictive model is obtained by analyzing the completed similarity score matrix and the pedigree score matrix aligned to the completed similarity score matrix. For example, the analysis may be based on fitting a polynomial curve, by applying a machine learning algorithm or the like. And finally, the global predictive model is applied on the pedigree data of all plant breeding unit pairs originally missing a similarity value to obtain final similarity scores. The combination of the originally received similarity scores and the similarity scores computed by the global predictive model is used as the final, completed similarity score matrix. For example, this final, completed similarity score matrix can be input to a genomic prediction software for predicting a feature of a plant breeding unit.
[0206] A comparison of the plot depicted in
LIST OF REFERENCE NUMERALS
[0207] 102-114 steps [0208] 200 incomplete similarity score matrix [0209] 300 pedigree score matrix [0210] 400 complete similarity score matrix [0211] 500 computer system [0212] 502 analysis software [0213] 504 score alignment module [0214] 506 score association module [0215] 508 predictive model [0216] 510 incomplete similarity score matrix [0217] 512 pedigree score matrix [0218] 514 aligned matrices 200, 300 [0219] 516 similarity score completion module [0220] 518 completed similarity score matrix [0221] 520 prediction software