Establishment of Identification and Screening Method of Cows with A2 Beta-Casein Genotype of Producing A2 Milk and Applications Thereof

20230089466 · 2023-03-23

    Inventors

    Cpc classification

    International classification

    Abstract

    A fast, batch and non-invasive identification and screening method of cows with A2 β-casein genotype of producing A2 milk belongs to the cow genotyping field and includes steps of: 1) collecting milk samples of cows; 2) detecting and acquiring mid-infrared spectral data; 3) data preprocessing to remove outliers; 4) dividing the dataset after the preprocessing into a training set and a testing set according to random sampling principle; 5) screening modeling spectral wavebands; and 6) combining different spectrum preprocessing methods with a modeling algorithm to establish classification models, using accuracy, sensitivity, specificity and AUC to evaluate the models, and determining the best classification model. The method can realize rapid and batch identification of cows of producing A2 milk and non A2 milk, and may have advantages of fast, high precision, low cost, simple operation, batch determination, no trauma (no blood collection, hair plucking, and tissue extraction), and strong practicability.

    Claims

    1. An identification and screening method of cows with A2 β-casein genotype of producing A2 milk, comprising following steps: (1) collecting milk samples, comprising: collecting raw milk from a dairy farm of breeding cows with A2 β-casein genotype of producing A2 milk and cows with non-A2 genotype of producing non-A2 milk, to obtain A2 β-casein milk (A2 milk) samples and non-A2 β-casein milk (non-A2 milk) samples; (2) acquiring mid-infrared spectra, comprising: scanning collected milk samples by using a milk composition detector within a wave number range of 4000-400 cm.sup.−1, and outputting contents of ordinary milk ingredients and a transmittance of each of the collected milk samples by a computer connected to the milk composition detector; (3) data preprocessing, comprising: converting original spectral data of the mid-infrared spectra from transmittance to absorbance, and removing outliers; (4) dataset dividing, comprising: randomly dividing a modeling dataset of the A2 milk samples and the non-A2 milk samples obtained after the step (3) into a training set and a testing set respectively accounted for 80% and 20% of the modeling dataset; (5) determining modeling spectral wavebands, comprising: removing an absorption region of water, and screening difference wavebands between the non-A2 milk samples and the A2 milk samples; (6) model establishment, comprising: taking the mid-infrared spectra of milk samples in the training set as input and classes of non-A2 milk and A2 milk as output, and establishing a classification model by using combinations of a random forest algorithm and different spectrum preprocessing methods; and (7) screening cows with A2 β-casein genotype of producing A2 milk by using the established classification model, comprising: substituting detected mid-infrared spectral data of milk into the established classification model for identification of A2 milk and non-A2 milk, identifying and screening the cow with a classification result of A2 milk as a A2A2 type cow with A2 β-casein genotype of producing A2 milk, otherwise identifying and screening the cow as non-A2A2 type cow.

    2. The identification and screening method according to claim 1, wherein the step (3) comprises: converting the transmittance T to the absorbance A as per a formula A=log 10(1/T), and reserving data of milk samples with normal contents of ordinary milk ingredients; wherein the data of milk samples with normal contents of ordinary milk ingredients refer to data meeting conditions of a milk fat percentage in a range of 2-9 g/100 g, a milk protein percentage in a range of 1-7 g/100 g, a somatic cell count equal to or less than 1 million cells/mL, and a Mahalanobis distance equal to or less than 3; and a calculation method of the Mahalanobis distance is expressed as MD=sqrt[(x−μ).sup.TΣ.sup.−1(x−μ)], where x represents a spectral value, μ represents a sample mean value, Σ represents a covariance matrix, and T represents transposition.

    3. The identification and screening method according to claim 1, wherein in the step (5), the modeling spectral wavebands comprise: 925.92-1076.382 cm.sup.−1, 1134.252-1257.708 cm.sup.−1, 1473.756-1639.65 cm.sup.−1, 1736.1-2465.262 cm.sup.−1, and 2847.204-2970.66 cm.sup.−1; wherein the step (6) comprises: using first-order differentiation with a parameter of 1, standard normal variable transformation (SNV), multivariate scatter correction (MSC), and Savitzky-Golay (SG) convolution smoothing with a window size being a combination of 17 in length and 3 in width individually to preprocess mid-infrared spectral data and compare with the mid-infrared spectral data without the preprocess, taking accuracy, sensitivity, specificity, and area under curve (AUC) as evaluation indexes, and determining to select the first-order differentiation and the random forest algorithm to establish a model A for identification of cows with A2 β-casein genotype.

    4. The identification and screening method according to claim 1, wherein in the step (5), the modeling spectral wavebands comprise: 925.92-1076.382 cm.sup.−1, 1138.11-1269.282 cm.sup.−1, 1323.294-1377.306 cm.sup.−1, 1439.034-1469.898 cm.sup.−1, 1504.62-1539.342 cm.sup.−1, 1766.964-2835.63 cm.sup.−1, and 2854.92-2966.802 cm.sup.−1; wherein the step (6) comprises: using first-order differentiation with a parameter of 1, SNV, MSC, and SG convolution smoothing with a window size being a combination of 17 in length and 3 in width individually to preprocess mid-infrared spectral data and compare with the mid-infrared spectral data without the preprocess, taking accuracy, sensitivity, specificity, and AUC as evaluation indexes, and determining to select none preprocessing of mid-infrared spectral data and the random forest algorithm to establish a model B for identification of cows with A2 β-casein genotype.

    5. The identification and screening method according to claim 1, wherein in the step (7), in the classification model established by using the random forest algorithm, a classification result with more than half of decision trees is used as a final result; when samples to be tested are divided into two classes, variable matrices in an output matrix are replaced by ‘0’ and ‘1’ to mark different classes; and when a number of decision trees with the decision of ‘0’ is more than a number of decision trees with the decision of ‘1’, the sample to be tested is determined as ‘0’ and thus β-casein genotype of a cow corresponding to the sample to be tested is determined as non-A2A2 type, otherwise the sample to be tested is determined as ‘1’ and thus the β-casein genotype of the cow is determined as A2A2 type.

    6. A use of the model established by the identification and screening method according to claim 1 in identifying cows with A2 β-casein genotype of producing A2 milk.

    7. A use of two models respectively being the model A established by the identification and screening method according to claim 4 and the model B established by the identification and screening method according to claim 5, comprising: judging a genotype of each of cows as one of A2A2 type and non-A2A2 type by the two models, individually; screening ones of the cows each with classification results of the two models both being a cow of A2A2 genotype; and determining the ones of the cows as A2A2 type cows with A2 β-casein genotype of producing A2 milk.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0039] FIG. 1 illustrates a schematic spectrum of a model A according to embodiment 1.

    [0040] FIG. 2 illustrates a schematic ROC curve of a testing set of the model A according to the embodiment 1.

    [0041] FIG. 3 illustrates a schematic spectrum of a model B according to embodiment 2.

    [0042] FIG. 4 illustrates a schematic ROC curve of a testing set of the model B according to the embodiment 2.

    DETAILED DESCRIPTION OF EMBODIMENTS

    [0043] The following embodiments are intended to further illustrate the invention, but should not be construed as limiting the invention. Any modification or substitution made to the invention without departing from the spirit and essence of the invention belongs to the scope of the invention.

    [0044] Unless otherwise specified, technical means used in the following embodiments are conventional means well known to those skilled in the art, and parameter adjustments of techniques such as first-order differential (Diff), standard normal variate transformation (SNV), multivariate scatter correction (MSC) and Savitzky-Golay (SG) convolution smoothing are conventional adjustments made by those skilled in the art according to research objects. Reagents or materials as used, unless otherwise specified, come from commercial sources.

    [0045] In the following embodiments of the invention, mid-infrared spectral data preprocessing, and model construction and verification were all realized through Python 3.8.3. Detection of ordinary milk ingredients and a light transmittance of each sample was performed through a milk composition detector MilkoScantm7RM (operated as per its product instructions manual) produced by the FOSS company.

    Embodiment 1

    [0046] Establishment of identification model A of cows with A2 β-casein genotype of producing A2 milk, may include specific steps as follows.

    [0047] (1) Collecting Milk Samples

    [0048] Raw milk uniformly mixed in a whole milking process were collected from a dairy farm of breeding non-A2 cows and A2 cows. In particular, 492 pieces of A2 milk were collected from 492 numbers of cows with β-casein genotype being A2A2 (shorted as A2), and 866 pieces of non-A2 milk were collected from 866 numbers of cows with non-A2A2 genotype (shorted as non-A2).

    [0049] (2) Acquiring Mid-Infrared Spectra

    [0050] Each of the collected milk samples was poured into a cylindrical sampling tube with a diameter of 3.5 cm and a height of 9 cm, a liquid level was ensured to be greater than 6 cm, then was bathed in a 42° C. water bath for 15-20 min, afterwards a solid optical fiber probe was extended into the liquid for sample aspirating detection, and a milk composition detector was used to scan the milk sample within a wave number range of 4000-400 cm.sup.−1, and contents of ordinary milk ingredients and a light transmittance corresponding to each the milk sample were output through a connected computer. The ordinary milk ingredients include milk fat, milk protein, lactose, milk solids non-fat, total solids, urea, and somatic cell count.

    [0051] (3) Data Preprocessing

    [0052] For original spectral data, the light transmittances were converted to absorbances, and abnormal values (outliers) were removed. Specifically, the light transmittances (T) were converted to the absorbances (A) according to A=log.sub.10(1/T); the abnormal values were removed by using the milk fat percentage, the somatic cell count and the Mahalanobis distance; and the data with the milk fat percentage in the range of 2-9 g/100 g, the milk protein percentage in the range of 1-7 g/100 g, the somatic cell count equal to or less than 1 million cells/mL (custom-character1 million cells/mL), and the Mahalanobis distance of spectrum equal to or less than 3 (custom-character3) were reserved. A calculation method of the Mahalanobis distance is MD=sqrt[(x−μ).sup.TΣ.sup.−1(x−μ)], where x represents a spectral value, μ represents a sample mean value, Σ represents a covariance matrix, and T represents transposition. After removing the abnormal samples, there were 814 pieces of non-A2 milk data and 454 pieces of A2 milk data.

    [0053] (4) Dataset Dividing

    [0054] 362 pieces of non-A2 milk data and 362 pieces of A2 milk data obtained in the step (3) were randomly selected as a modeling dataset which was randomly divided into a training set and a testing set, which accounted for 80% and 20% of the dataset respectively. The remaining 452 pieces of non-A2 milk data and 92 pieces of A2 milk data were used as a validation set.

    [0055] (5) Determining Modeling Spectral Wavebands

    [0056] Spectral data of the modeling dataset were performed with Pearson correlation test, correlations were analyzed for significance, and an absorption region of water was removed. FIG. 1 shows an average spectrum of the modeling wavebands. In FIG. 1, the abscissa is the spectral wave number, and the ordinate is the absorbance. In the legend, “0” represents non-A2 milk, “1” represents A2 milk; The parameter involved is significant level a=0.05, and wavebands of P<0.05 were selected for modeling. Finally, 925.92-1076.382 cm.sup.−1, 1134.252-1257.708 cm.sup.−1, 1473.756-1639.65 cm.sup.−1, 1736.1-2465.262 cm.sup.−1 and 2847.204-2970.66 cm.sup.−1 were selected for modeling.

    [0057] (6) Model Establishment

    [0058] The wavebands selected in the step (5) were used, spectral data are preprocessed by first-order differentiation (Diff, its parameter is 1), standard normal variable transformation (SNV), multivariate scatter correction (MSC) and SG convolution smoothing (a window size of the SG convolution smoothing is set to one of combinations of lengths 17, 19, 21, 23, 25, 27, 29, 31, 33 and widths 2, 3, 4, the combination with the best effect as per model evaluation indexes is selected as a modeling result of the SG convolution smoothing, and the window size selected in the illustrated embodiment of the invention is the combination of 17 in length and 3 in width) individually, and then compared with the spectral data of none preprocessing.

    [0059] A random forest (RF) algorithm was used to establish classification models using data of the training set and to predict samples in the testing set.

    [0060] Modeling results of the RF algorithm under different preprocessings are shown in Table 1 below.

    TABLE-US-00001 TABLE 1 Modeling results of RF algorithm under different preprocessings Evaluation indexes Algo- Pre- Accu- Sensi- Speci- rithm processing Dataset racy tivity ficity AUC RF None Training set 1 1 1 1 Testing set 0.83 0.86 0.79 0.91 Diff Training set 1 1 1 1 Testing set 0.95 0.96 0.95 0.97 SNV Training set 1 1 1 1 Testing set 0.86 0.90 0.82 0.90 MSC Training set 1 1 1 1 Testing set 0.80 0.83 0.77 0.85 SG (17,3) Training set 1 1 1 1 Testing set 0.95 0.96 0.95 0.97

    [0061] (7) Screening the Best Classification Model

    [0062] In the above dichotomous models, the accuracy, the sensitivity, the specificity and the AUC (area under curve) are used to evaluate the performances of the models. The accuracy denotes a probability that correct judgments accounts for all judgments, and the better the closer the value thereof is to 1. The sensitivity denotes a proportion of a class correctly classified in the dichotomous model, and the better the closer the value thereof is to 1. The specificity denotes a probability of another class correctly classified in the dichotomous model, and the better the closer the valve thereof is to 1. AUC denotes an area under ROC (abbreviation of receiver operating characteristic) curve, which intuitively reflects a classification ability expressed by the ROC curve, AUC=1 means this classifier is a perfect classifier, 0.5<AUC<1 means this classifier is better than a random classifier, and 0<AUC<0.5 means this classifier is worse than the random classifier. As seen from Table 1, the models established by Diff and SG preprocessings have achieved better results in the classification training of A2 milk and non-A2 milk, and the two models can accurately identify two types of targets in the training set and the testing set. Since the Diff has a faster rate in the operation than the SG convolution smoothing, that is, the Diff preprocessed model is more efficient in this classification, and thus the Diff preprocessed RF model is selected as the best classification model (also referred to as optimal classification model).

    [0063] The selected optimal classification model was used to predict 145 numbers of samples of the testing set, and results are shown in Table 2.

    TABLE-US-00002 TABLE 2 Testing set results of the model Number of Number of correctly Total Class true samples identified samples Accuracy accuracy Non-A2 73 69 94.52% 95.17% A2 72 69 95.83%

    [0064] A ROC curve is used to measure the performance of the model in the testing set. The abscissa FPR is the false positive rate, that is, the number of real non-A2 milk judged as A2 milk divided by the number of all real non-A2 milk in the testing set, and the ordinate TPR is the true positive rate, that is, the number of real A2 milk judged as A2 milk divided by the number of all real A2 milk in the testing set, and the ROC curve shown in FIG. 2 is obtained. AUC is an area enclosed by the coordinate axes under the ROC curve, and its value is between 0.5 and 1. The closer the AUC is to 1.0, the higher the authenticity of the embodiment of the invention. It can be seen from FIG. 2 that the AUC of the testing set in this embodiment is 1, indicating that the classification effect of the model on the testing set is very good.

    [0065] In the random forest (RF) classification model, a classification result with more than half of decision trees was used as a final result; when samples were divided into two classes, variable matrices in the Y matrix were replaced by ‘0’ and ‘1’ to mark the different classes, and when the number of decision trees with the decision of ‘0’ was more than the number of decision trees with the decision of ‘1’, the sample is determined as ‘0’, namely, the β-casein genotype of the cow corresponding to the sample to be tested is non-A2A2 type, and otherwise the sample is determined as ‘1’, i.e., the β-casein genotype of the cow is A2A2 type.

    [0066] (8) Validation of Optimal Model

    [0067] The validation set was used to externally validate the model, and predicted results were compared with actual results.

    [0068] Method: spectral data of samples in the validation set were substituted into the trained model, prediction results can be output, and external validation results are shown in Table 3.

    TABLE-US-00003 TABLE 3 validation results of model Number of Number of correctly Total Class true samples identified samples Accuracy accuracy Non-A2 452 377 83.41% 85.11% A2 92 86 93.48%

    [0069] It can be seen from the above that, the optimal model A with the highest prediction accuracy for identification of cows with A2 β-casein genotype was finally established as that: using normal milk samples (data with the milk fat percentage in the range of 2-9 g/100 g, the milk protein percentage in the range of 1-7 g/100 g, the somatic cell count equal to or less than 1 million cells/mL (custom-character1 million cells/mL), and the Mahalanobis distance of spectrum equal to or less than 3 (custom-character3) were reserved), five optimal modeling wavebands (925.92-1076.382 cm.sup.−1, 1134.252-1257.708 cm.sup.−1, 1473.756-1639.65 cm.sup.−1, 1736.1-2465.262 cm.sup.−1 and 2847.204-2970.66 cm.sup.−1), the first-order differential (Diff, 1) mid-infrared spectral data preprocessing method, and the random forest (RF) algorithm to establish an optimal model for identification of cows with A2 β-casein genotype. The model may achieve a prediction accuracy for A2A2 type cows being 95%, which may have advantages of accurate, batch, automatic, fast and non-invasive. The accuracy of the validation set of the model was 85.11% (463/544), and prediction accuracies of cows of producing non-A2 milk and A2 milk were 83.41% and 93.48% respectively.

    Embodiment 2

    [0070] Establishment of identification model B of cows with A2 β-casein genotype of producing A2 milk, may include specific steps as follows.

    [0071] (1) Collecting Milk Samples

    [0072] 318 pieces of A2 milk from 318 numbers of A2 cows and 539 pieces of non-A2 milk from 539 numbers of non-A2 cows were further collected from the sampled dairy farm of the embodiment 1.

    [0073] (2) Acquiring Mid-Infrared Spectra

    [0074] Each of the collected milk samples was poured into a cylindrical sampling tube with a diameter of 3.5 cm and a height of 9 cm, a liquid level was ensured to be greater than 6 cm, then was bathed in a 42° C. water bath for 15-20 min, afterwards a solid optical fiber probe was extended into the liquid for sample aspirating detection, and a milk composition detector was used to scan the milk sample within a wave number range of 4000-400 cm.sup.−1, and contents of ordinary milk ingredients and a light transmittance corresponding to each the sample were output through a connected computer.

    [0075] (3) Data Preprocessing

    [0076] For original spectral data, the light transmittances were converted to absorbances, abnormal values (outliers) were removed. Specifically, the light transmittances (T) were converted to the absorbances (A) according to A=log.sub.10(1/T); the abnormal values were removed by using a milk fat percentage, a milk protein percentage, a somatic cell count and a Mahalanobis distance; and normal milk samples were used, data with the milk fat percentage in the range of 2-9 g/100 g, the milk protein percentage in the range of 1-7 g/100 g, the somatic cell count equal to or less than 1 million cells/mL (custom-character1 million cells/mL), and the Mahalanobis distance equal to or less than 3 (custom-character3) were reserved. A calculation method of the Mahalanobis distance is MD=sqrt[(x−μ).sup.TΣ.sup.−1(x−μ)], where x represents a spectral value, μ represents a sample mean value, Σ represents a covariance matrix, and T represents transposition. After removing the abnormal samples, there were 510 pieces of non-A2 milk data and 299 pieces of A2 milk data.

    [0077] (4) Dataset Dividing

    [0078] 260 pieces of non-A2 milk data and 260 pieces of A2 milk data were randomly selected as a modeling dataset which was randomly divided into a training set and a testing set, which accounted for 80% and 20% of the dataset respectively. The remaining 279 pieces of non-A2 milk data and 39 pieces of A2 milk data were used as a validation set.

    [0079] (5) Determining Modeling Spectral Wavebands

    [0080] Spectral data were performed with Pearson correlation test, correlations were analyzed for significance, and an absorption region of water was removed. The parameter involved is significant level a=0.05, and wavebands of P<0.05 were selected for modeling. Finally, 925.92-1076.382 cm.sup.−1, 1138.11-1269.282 cm.sup.−1, 1323.294-1377.306 cm.sup.−1, 1439.034-1469.898 cm.sup.−1, 1504.62-1539.342 cm.sup.−1, 1766.964-2835.63 cm.sup.−1 and 2854.92-2966.802 cm.sup.−1 were selected for modeling. FIG. 3 shows an average spectrum of the modeling wavebands.

    [0081] (6) Model Establishment

    [0082] Spectral data are preprocessed by first-order differentiation (Diff, its parameter is 1), standard normal variable transformation (SNV), multivariate scatter correction (MSC) and SG convolution smoothing (a window size of the SG convolution smoothing is set to one of combinations of lengths 17, 19, 21, 23, 25, 27, 29, 31, 33 and widths 2, 3, 4, the combination with the best effect as per model evaluation indexes is selected as a modeling result of the SG convolution smoothing, and the window size selected in the illustrated embodiment of the invention is 17 in length and 3 in width) individually, and then compared with the spectral data of none preprocessing.

    [0083] A random forest (RF) algorithm was used to establish classification models using data of the training set and to predict samples in the testing set.

    [0084] Modeling results of the RF algorithm under different preprocessings are shown in Table 4 below.

    TABLE-US-00004 TABLE 4 Modeling results of RF algorithm under different preprocessings Evaluation indexes Algo- Pre- Accu- Sensi- Speci- rithm processing Dataset racy tivity ficity AUC RF None Training set 1 1 1 1 Testing set 0.97 1 0.94 0.97 Diff Training set 1 1 1 1 Testing set 0.94 0.98 0.89 0.96 SNV Training set 1 1 1 1 Testing set 0.80 0.82 0.77 0.84 MSC Training set 1 1 1 1 Testing set 0.71 0.77 0.66 0.81 SG (17,3) Training set 1 1 1 1 Testing set 0.95 0.98 0.91 0.96

    [0085] (7) Screening the Best Classification Model

    [0086] As seen from Table 4, the RF model established based on the none preprocessing achieved a better result in the classification training of A2 milk and non-A2 milk, the model can accurately identify two types of targets in the training set and the testing set, and thus the RF model without preprocessing was selected as the best classification model (also referred to as optimal classification model).

    [0087] In the random forest (RF) established classification model, a classification result with more than half of decision trees was used as a final result; when the samples were divided into two classes, variable matrices in the Y matrix were replaced by ‘0’ and ‘1’ to mark the different classes, and when the number of decision trees with the decision of ‘0’ was more than the number of decision trees with the decision of ‘1’, the sample is determined as ‘0’, i.e., the β-casein genotype of the cow corresponding to the sample to be tested is non-A2A2 type, and otherwise the sample is determined as ‘1’, i.e., the β-casein genotype of the cow is A2A2 type.

    [0088] The selected optimal classification model was used to predict 94 numbers of samples of the testing set, and results are shown in Table 5.

    TABLE-US-00005 TABLE 5 Testing set results for the model Number of Number of correctly Total Class true samples identified samples Accuracy accuracy Non-A2 47 44 93.62% 96.81% A2 47 47   100%

    [0089] A ROC curve is used to measure the performance of the model in the testing set. the false positive rate (FPR) was used as the abscissa, the true positive rate (TPR) was used as the ordinate TPR, and the ROC curve shown in FIG. 4 was obtained. It can be seen from FIG. 4 that the AUC of the testing set in this embodiment is 1, indicating that the classification effect of the model on the testing set is very good. Therefore, none preprocessing was selected as the data preprocessing method for modeling.

    [0090] (8) Validation of Optimal Model

    [0091] The validation set was used to externally validate the model, and predicted results were compared with real results.

    [0092] Method: spectral data of samples in the validation set were substituted into the trained model, prediction results can be output, and external validation results are shown in Table 6.

    TABLE-US-00006 TABLE 6 validation results of model Number of Number of correctly Total Class true samples identified samples Accuracy accuracy Non-A2 279 252 90.32% 91.19% A2 39 38 97.44%

    [0093] It can be seen from the above that, the optimal model B with the highest prediction accuracy for identification of cows with A2 β-casein genotype was finally established as that: using normal milk samples (data with the milk fat percentage in the range of 2-9 g/100 g, the milk protein percentage in the range of 1-7 g/100 g, the somatic cell count equal to or less than 1 million cells/mL (custom-character1 million cells/mL), and the Mahalanobis distance equal to or less than 3 (custom-character3) were reserved), seven optimal modeling wavebands (925.92-1076.382 cm.sup.−1, 1138.11-1269.282 cm.sup.−1, 1323.294-1377.306 cm.sup.−1, 1439.034-1469.898 cm.sup.−1, 1504.62-1539.342 cm.sup.−1, 1766.964-2835.63 cm.sup.−1 and 2854.92-2966.802 cm.sup.−1), the mid-infrared spectral data being not preprocessed (NONE), and the random forest (RF) algorithm to establish an optimal model for identification of cows with A2 β-casein genotype. The model may achieve a prediction accuracy for A2 type cows being 97%. The accuracy of the validation set of the model was 91.19% (290/318), and prediction accuracies of cows producing non-A2 milk and A2 milk were 90.32% and 97.44% respectively, and the model may have advantages of accurate, batch, automatic, fast and non-invasive.

    Embodiment 3

    [0094] The identification model A and the identification model B were used simultaneously to screen cows with A2 β-casein genotype of producing A2 milk.

    [0095] The model A and the model B were used to perform prediction on 481 numbers of cows (including 163 numbers of cows with true A2A2 genotype and 318 numbers of cows with non-A2A2 genotype), and specific methods and steps may be as follows.

    [0096] (1) mid-infrared spectral data of milk samples of the 481 numbers of cows were acquired.

    [0097] (2) the acquired spectral data were input into the model A and the model B individually, and corresponding result files each with “0” or “1” as the code were output.

    [0098] (3) a sample with results of the two models both being “0” denotes the β-casein genotype of the cow is non-A2A2 type, and a sample with results of the two models both being “1” denotes the β-casein genotype of the cow is A2A2 type.

    [0099] (4) cows determined as being with A2A2 type β-casein genes by both the two models were selected, and the accuracy of the selected cows was 100%, and the results are shown in Table 7.

    TABLE-US-00007 TABLE 7 prediction results of model A and model B Inconsistent True Non-A2 A2 Accuracy of Frequency of Screened total number A2 Screened Screened misjudged misjudged screened A2 occurrence of Model cows of two models cow A2 cow true A2 as A2 as non-A2 cows A2 cows A 481 163 165 160 5 3 98.16%(160/163) 33.26% (160/481) B 481 163 181 157 24 6 96.32%(157/163) 33.26% (157/481) A + B 481 36 154 155 154 1 0 .sup. 100%(154/154) 32.02% (154/481)

    [0100] The model A was used to predict and preliminarily screen 165 numbers of cows with A2A2 genotype. In which, 160 (true) numbers of cows with A2A2 genotype were predicted from the 163 numbers of cows with true A2A2 genotype, 5 numbers of cows with non-A2A2 genotype were misjudged as cows with A2A2 genotype, and 3 numbers of cows with A2A2 genotype were misjudged as cows with non-A2A2 genotype, so that the accuracy of predicting the cows with A2A2 genotype was 98.16% (160/163), and a screening rate of predicting the cows with A2A2 genotype was 33.26% (160/481).

    [0101] The model B was used to predict and preliminarily screen 181 numbers of cows with A2A2 genotype. In which, 157 (true) numbers of cows with A2A2 genotype were predicted from the 163 numbers of cows with true A2A2 genotype, 24 numbers of cows with non-A2A2 genotype were misjudged as cows with A2A2 genotype, and 6 numbers of cows with A2A2 genotype were misjudged as cows with non-A2A2 genotype, so that the accuracy of predicting the cows with A2A2 genotype was 96.32% (157/163), and a screening rate of predicting the cows with A2A2 genotype was 32.64% (157/481).

    [0102] The model A and the model B were simultaneously used for screening, except for 36 numbers of cows with inconsistent results of the two models, 155 numbers of cows with A2A2 genotype were screened from 291 numbers of non-A2A2 cows and 154 numbers of A2A2 cows. In which, 154 (true) numbers of cows with A2A2 genotype were predicted from the 154 numbers of cows with true A2A2 genotype, one cow with non-A2A2 genotype was misjudged as the cow with A2A2 genotype, and zero cow with A2A2 genotype was misjudged as the cow with non-A2A2 genotype, so that the accuracy of predicting the cows with A2A2 genotype was 100% (154/154), and a screening rate of predicting the cows with A2A2 genotype was 32.02% (154/481). Through the combination of the two models, although the frequency of screening A2 cows was reduced by 0.24% (only 3-6 numbers of cows were reduced), the accuracy and reliability of screening A2 were increased by 2-3%, which could reach the accuracy of 100%, and the result is shown in above Table 7.

    [0103] Therefore, during determining results and selecting A2A2 type cows, one with identical results of the model A and the model B is used as an accurate result. In particular, when both the models predict a sample to be A2 milk, the sample is determined as A2 milk and the cow producing the milk sample is determined as a cow with A2 β-casein genotype of producing A2 milk. For milk samples with inconsistent results predicted by the two models, the samples should be removed or not selected. By applying the prediction model(s), especially the simultaneous use of the two models and the selection of cows whose predicted results of the two models both are A2 milk producers, which not only can greatly save the cost of detection (compared with genetic detection), with only 0.5 min per sample to obtain the detection result, but also can have advantages of fast, high precision, simple operation, batch determination, non-invasive (no need of blood collection, hair plucking, tissue extraction) and high-level of animal welfare, thereby presenting strong practicality and applicability.