SYSTEM AND METHOD FOR CLASSIFICATION OF CROPS USING MULTI-CLASS MACHINE LEARNINGG TECHNIQUES
20230260278 · 2023-08-17
Inventors
Cpc classification
G06V10/774
PHYSICS
Y02A90/40
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
The invention relates to an agricultural analytics platform that enables farmers, agriculturists and decision makers to classify crops and invasive species using multiclass machine learning technique. The agricultural analytics platform uses data assimilation techniques to understand the changing landscape of agriculture. The invention discloses an improved set of layered solutions which help in estimating crop yields and provide insights for generating maximum output. Advanced Artificial Intelligence (AI) algorithms and statistical analyses are used to provide solutions for agricultural problems such as crop rotation, crop selection, crop yield, etc.
Claims
1. A computer implemented analytical platform for classification and prediction of different vegetation in a geographical area, the computer implemented analytical platform comprising of: a data collection module configured to aggregate data from a data source; an image processing module to convert an image data, wherein each pixel of the image data has a reflectance value, the reflectance values being stored as a matrix of numbers, wherein the matrix of numbers is utilised by a machine learning artificial intelligence algorithm; a feature engineering module configured to map a geospatial data for the geographical area; an agricultural analytical engine implementing the machine learning algorithms, which are trained using a test dataset, wherein the test data includes selection of a set of features selected by the feature engineering module to optimise the set goals; a recommendation module for prediction and classification based on the set goals, in the form of a classified matric of numbers; and a resynthesis module to convert the classified matrix of numbers into an image and assign a geospatial projection to the image as per set goals.
2. The computer implemented analytical platform of claim 1, wherein the reflectance value is a float value.
3. The computer implemented analytical platform of claim 1, wherein the reflectance value corresponds to a physical property of the analyzed surface.
4. The computer implemented analytical platform of claim 1, wherein the prediction is related to one of: crop classification, classification of invasive species, and a combination of crop classification with classification of invasive species.
5. The computer implemented analytical platform of claim 1, wherein the geospatial data of the geographical area is used for prediction.
6. The computer implemented analytical platform of claim 1, wherein the aggregated data from data collection module is tested by using a supervised classification.
7. The computer implemented analytical platform of claim 1, wherein the prediction and classification from recommendation modules are validated using a statistical technique.
8. The computer implemented analytical platform of claim 1, wherein the data collection module uses a set of remotely sensed data that includes a reflectance value, a vegetation index and a crop physiological characteristic.
9. The computer implemented analytical platform of claim 1 further comprising a multiclass relevance vector machine.
10. The computer implemented analytical platform of claim 1, wherein a set of ancillary information is used by the recommendation engine to improve the prediction and classification.
11. The computer implemented analytical platform of claim 1 further comprising a machine learning model of probabilistic nature to analyse a classification error in the classification.
12. The computer implemented analytical platform of claim 6, wherein the supervised classification is based on a statistical learning theory.
13. The computer implemented analytical platform of claim 9 wherein the multiclass relevance vector machine is trained with a set of assimilated inputs that relate to the aggregated data being classified.
14. The computer implemented analytical platform of claim 9 using a set of ancillary data along with a spectral reflectance data to improve the prediction of recommendation module, and for automatic classification of the spectral data using the multiclass relevance vector machine.
Description
BRIEF DESCRIPTION OF FIGURES
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
DETAILED DESCRIPTION
[0040]
[0041] In some embodiments, the computer implemented agricultural analytical platform 110 may reside in the server 114 or implemented on a cloud computing environment 118.
[0042] In various embodiments, the computer database 112 associated with the computer implemented agricultural analytical platform 110 may be a distributed database, a standalone database, a flat file database, a relational database or some other type of database.
[0043] The computer implemented agricultural analytical platform 110 may employ Bayesian statistics for evolutionary computation as a modeling tool and combine it with additional ancillary data related to LAI, vegetation indices (VIs), and reflectance as inputs for multi-class classification of crops accurately.
[0044]
[0045] The memory 104 may include an operating system 108, one or more applications 110, an agricultural analytics module 112 in addition to other modules. The operating system 108 may be a windows OS, Macintosh OS, Linux OS or some other type of operating system. The one or more applications 110 may be related to agricultural data collection, crop data collection and analysis, agricultural analytics and other applications related to the agricultural analysis and management.
[0046] The agricultural analytics module 112 may include machine learning algorithms, database, and other forecasting algorithms for crop analysis, crop optimization, and crop management.
[0047]
[0048] The data collection module 302 may collect data from different geographical areas and regions such as the geographical area 102. In addition, the data collection module 302 may also receive data from external sources such as, but not limited to, external database 390, which may include historical data for one or more geographical areas and geographical regions.
[0049] The image module 304 may analyse images from different agricultural regions in different formats and convert them into ASCII format for analysis. In addition, the images received from remote sensing satellite may provide additional information related to geospatial data such as data related to weather conditions, soil stratum, and atmospheric conditions.
[0050] In some embodiments, the agricultural analytics module 112 may include a feature engineering module 306. The feature engineering module 306 may extract features related to plants, plant spices, vegetation, weather conditions, ground water, soil and other aspects to be used for training the MCRVM classification model to perform multiclass classification. In some embodiments, the agricultural analytics module 112 may also perform prediction calculations related to the production (yield) of crops per square unit.
[0051] The data integration module 308 may assimilate data extracted by feature engineering module 306 and add it to the ASCII data to perform meaningful analysis of the combined data set and produce various analytical results related to framing, crops, soil, and weather. The additional data may also include crop physiological data to be used as an input within a defined level of granularity. In some embodiments, the combined use of assimilated data and location intelligence may be used to train the machine learning algorithms for accurate crop classification.
[0052] The classification module 310 may act upon the received data to produce results that allow a user to draw inferences based on the set goals. The classification module 310 is integrated with the agricultural analytics engine 320. The agricultural analytics engine 320 includes a rule-based engine 322, a recommendation module 324, an artificial intelligence module 330 and an analytics database 328. The rule-based engine 322 may implement different rules related to performing agricultural analytics to provide useful insights to the user. The analytical database 328 may include data related to farming for different geographical areas such as 102 and may also implement use of artificial intelligence algorithms. It may further include test data, training data and other data. The artificial intelligence module 330 may train and test analytical models and perform the analytics in real time.
[0053] In some embodiments, the classification module 310 and the agricultural analytics engine 320 may work in tandem to produce agricultural analytics.
[0054] The image synthesis module 312 may receive results related to machine learning, classification and agricultural analytics in a raw format such as ASCII format after the analysis of the collected data. The resultant data may be analyzed to recreate image by converting the ASCII format back to digital numbers, which may provide insights related to farming analytics. In some embodiments, the image synthesis module 312 may also use additional information from external sources such as but not limited to intelligence and date received from remote sensing satellite and may produce georeferenced, projected and classified images.
[0055] In some embodiments, the agricultural analytics engine 320 may be associated with the user interface, which may provide visual and text information related to agricultural analytics to the user.
[0056] Referring to
[0057]
[0058] In some embodiments, the process 400B may involve setting up a set of goals for optimization of the agricultural data. The set goals may be related to specific objective such as, but not limited to, identifying maximum crop yield in a set of crops or identifying the best crop under specific weather conditions. At step 432, the process 400B initiates training process of the machine learning algorithm where the machine learns an input-output relationship. The process 400B may in some implementations receive the training data comprising image data. Each pixel of the image data may correspond to the reflectance value, which is a decimal value. In software implemented program each pixel value may be represented by a float data type. The pixel value of the image data is transformed into a matrix of numbers. In some implementations, the matrix of numbers may represent the reflectance value. In embodiments, the step 434 of the process 400B, may use the training data to train one or more algorithms associated with the artificial intelligence algorithms for prediction and classification. The outcome may then be reconverted into image(s) to produce results as per the set goals. The output of the algorithm is the transformed matrix of numbers that represent the outcome in the form of a georeferenced, projected and classified image.
[0059] At next step 438, the process 400B initiates the test phase, where the posterior probabilities of class membership are generated.
[0060] At step 440, the process 400B, creates a final class based on maximum Bayesian posterior probability rule. At step 442, the process 400B converts the classified matrix into image and geospatial projection assignment is performed. At step 444 of the process 400B, an error matrix is generated by comparing the actual classes with the predicted classes. The relevance vectors generated during the training phase at step 434 of process of 400B may be utilised for retraining of the agricultural analytics engine 320. The error matrix generated at step 444 may be utilised for determining the accuracy of the classification model. Finally, the process 400B terminates at step 446.
[0061] In embodiments, the process 400B may map unseen instances to their appropriate classes. Furthermore, in other embodiments, the agricultural analytics engine 320 may perform feature engineering.
[0062]
[0063] For purpose of validation in an exemplary embodiment, the vegetation data used was downloaded from the National Snow and Ice Data Center (NSIDC) website. Several Little Washita watershed sites, which represented the dominant types of vegetation, were sampled. Sampling was performed on sites approximately 800 m×800 m in size and was concentrated in the Little Washita watershed. Reflectance and Leaf Area Index (LAI) measurements were collected at nine different sites which included measurements over a lake and a quarry for calibration purposes. The vegetation types were corn, alfalfa, soybeans, winter wheat stubble, pasture, and bare soil. Out of these, data acquired over corn, alfalfa, soybeans, bare soil, quarry and lake were used for analysis.
[0064]
[0065]
[0066] Vegetation data—the following sections provide details of the vegetation data used in the analysis in this embodiment of the present invention.
[0067] Multi-Spectral Radiometer Reflectance Measurements
[0068] The measurement for multispectral radiometers was made by equipment CropScan to measure the reflectance. The wavelengths measured were: 485, 560, 650, 660, 830, 850, 1240, 1640, and 1650 nm bands. These bands provide data for selected channels of the Landsat Thematic Mapper and Moderate Resolution Imaging Spectroradiometer (MODIS) instruments. Channels were chosen to provide a variety of vegetation water content indices. The average percent reflectance measurements in wavebands 485, 560, 660, and 1650 nm were used directly as inputs.
[0069] Leaf Area Index (LAI) Measurements
[0070] LAI is defined as the ratio of total upper leaf surface of vegetation divided by the surface area of the land on which the vegetation grows. The exemplary data was measured using LI-COR LAI-2000 plant canopy analyzers using an indirect contact method based on light transmittance through the canopy. The LAI is dimensionless (m.sup.2/m.sup.2).
[0071] Calculation of Vi's
[0072] The soil adjusted vegetation index (SAVI) and normalized difference water index (NDWI) were used as inputs. The MSR-16R multi-spectral radiometer reflectance data recorded in the bands 650, 830, 850, and 1240 nm were used to calculate the VIs. The following equations were used.
SAVI=(R.sub.NIR−R.sub.RED)(1+L)/(R.sub.NIR+R.sub.RED+L) (1)
NDWI=R.sub.NIR−R.sub.SWIR/R.sub.NIR+R.sub.SWIR (2)
[0073] where, R.sub.NIR, R.sub.RED, R.sub.SWIR are the apparent reflectance values in the near-infrared (˜0.8 μm), red (˜0.6 μm), and short-wave infrared (˜1.2-2.5 μm) wavebands, respectively. L is a calibration factor (Huete 1988). SAVI and NDWI are dimensionless.
[0074] IRIS Data Dataset
[0075] The second dataset was the Iris flower data. This is perhaps the best-known dataset found in pattern recognition. The dataset consists of three classes with 50 instances each, where each class refers to a type of Iris plant—Setosa, Versicolour, or Virginica. The dataset has four attributes: sepal length, sepal width, petal length, and petal width in cm. The classes are very similar and can only be separated by a robust classification technique.
[0076] The Agricultural Analytical Model Building
[0077] The Relevance Vector Machine was used as a machine learning and classification process in the preferred embodiment of the invention. This is an extension of the sparse Bayesian model developed to handle multiclass outputs. For preparation of the model, Thayananthan's MCRVM open access algorithm was used as the base code, which is an open source and extends Tipping's binary relevance vector machine classification scheme to a multi-class RVM, which was used for hand movement pattern recognition. This model has been used as a base to build a completely new multi-class RVM model for crop classification which uses data assimilation and produces classified crop area with projection system.
[0078] The Sparse Bayesian Learning is used to describe the application of Bayesian automatic relevance determination (ARD) concepts to models that are linear in their parameters. The approach is to infer a regression or classification model that is both accurate and sparse because it makes its predictions using only a small number of relevant basis functions that are automatically selected from a potentially large initial set. A special case of this concept is the RVM which is applied to linear kernel models.
[0079] The data set is in the form of input-output pairs, {x.sub.n,y.sub.n}.sub.n=1.sup.N. The major goal is to learn a model of dependency of the targets on the inputs with the objective of making accurate predictions for previously unseen values of x. This model is defined as some function y(x) whose parameters are found as:
[0080] where the output y(x; w) is a linearly weighted sum of M generally nonlinear and fixed basis functions, φ(x)=(φ1(x), φ2(x), . . . φM(x))T, and weights w=(w1, w2, . . . , wM)T, which are adjustable parameters. Equation (3) can result in a number of different models, of which RVMs are a special case.
[0081] This procedure is highly perceptive with a Bayesian probabilistic framework that helps in extracting predictors that are very sparse, with few non-zero w parameters. Only those basis functions that are necessary for making accurate predictions are retained.
[0082] Bayes rule states that the posterior probability of w is obtained by combining the likelihood and prior as:
p(w|t,α,σ2)=p(t|w,σ.sup.2)p(w|α)/p(t|α,σ.sup.2) (4)
[0083] where σ.sup.2 is the error variance, p(t|w,σ.sup.2) is the likelihood of target t, p(w|α) is the prior, and p(t|α,σ.sup.2) is the evidence. Applying the logistic sigmoid link function σ(y)=1/(1+e−y) to y(x) and, adopting the Bernoulli distribution for p(t|w,σ.sup.2), the likelihood can be written as:
[0084] where t.sub.n is the target class, which for this example lies in the set {1, 2, 3, 4, 5, 6}. In Zhang and Malik (2005) a true multiclass likelihood was specified. It was obtained by generalizing equation (5) to multinomial form given by,
[0085] where the predictor y.sub.k of each class was coupled with the multinominal logit function given by,
[0086] For obtaining probabilistic outputs, a sigmoid link function is applied to the output y(x), f(y)=1/(1+e). A zero mean Gaussian prior distribution is applied over w and is given by,
[0087] Here the N independent hyperparameters, α=(α.sub.0, α.sub.1, . . . , α.sub.N)T, individually control the strength of the prior distribution over the corresponding weights and are eventually responsible for the sparsity of the model.
[0088] The closed-form expression for the weight posterior p(w|t,α,σ.sup.2) and evidence of hyperparameters p(t|α,σ.sup.2) cannot be obtained since the weights cannot be integrated out of equation 5. Hence a Laplacian approximation is used. Since p(w|t,α)∝p(t|w)p(w|α), with a fixed given α, the maximum a posteriori estimate (MAP) of weights can be obtained by maximizing log(p(w|t,α,σ.sup.2)) or by minimizing the following cost function:
[0089] The Hessian of log(p(w|t,α,σ.sup.2)) is given by,
H=∇.sup.2(log(p(w|t,α)))=Φ.sup.TBΦ+A (10)
[0090] where matrix Φ is the N×(N+1) ‘design’ matrix with φ.sub.nm=k(x.sub.n,x.sub.m-1). k(x.sub.n,x.sub.m-1) is the Gaussian kernel and has the form: k(x.sub.n,x.sub.m-1)=exp(−r.sup.−2∥x.sub.n−x.sub.m-1∥.sup.2), where r is the kernel width. A=diag{α.sub.1, . . . , α.sub.n}, and B=diag(β.sub.1, β.sub.2, . . . ,β.sub.N) are diagonal matrices with β.sub.n=σ{y(x.sub.n)}[1−σ{y(x.sub.n)}]. The hyperparameters a are iteratively updated using the covariance Σ and mean μ.sub.MP of the Gaussian approximation.
[0091] The covariance Σ is given by the inverse of the Hessian (equation 10),
Σ=(H).sup.−1(Φ.sup.TBΦ+A).sup.−1 (11)
[0092] and the mean is given by,
μ.sub.MP=ΣΦ.sup.TB{circumflex over (t)} (12)
{circumflex over (t)}=Φμ.sub.MP+B.sup.−1(t−y) (13)
[0093] The following equation is used for updating the hyperparameters:
[0094] where μ.sub.i denotes the i.sup.th posterior mean weight from (equation 12), Σ.sub.ii is the i.sup.th diagonal element of the posterior weight covariance (equation 11), and the quantity 1−α.sub.iE.sub.ii is a measure of the degree to which the associated parameter w.sub.i is determined by the data (Khalil and Almasri, 2005). During the re-estimation process the α.sub.i tend to infinity making p(w.sub.i|t,α,σ.sup.2) highly peaked at zero. This makes the associated weights zero and hence the associated basis functions are discarded, thus making the machine sparse
[0095] Data Assimilation, Training and Testing of the Agricultural Analytics Module
[0096] Two different datasets are used for training and testing the model.
[0097] The first dataset is the vegetation data from SMEX 2003 which had seven inputs (LAI, SAVI, NDWI and reflectance at 485, 560, 660 and 1650 nm) and six output classes (corn, alfalfa, soybeans, quarry, lake, and bare soil).
[0098] The second was the Iris flower dataset with four attributes (sepal length, sepal width, petal length and petal width) and three classes (Setosa, Versicolour and Virginica).
[0099] The first step in developing the classification scheme was data cleaning where missing and inconsistent data were removed. The aim was to extract the structural features from the data which would be used by the classifier to assemble a robust predictor and a generalized multiclass learning machine. The purpose is to build a model for vegetation/crop discrimination. Hence, several runs were performed with different combinations of reflectance values with VIs and LAI. It was observed that reflectance at 485, 560, 660 and 1650 nm along with SAVI, NDWI and LAI produced the best results and enhanced class separability. The VIs were calculated using reflectance in bands 650, 830, 850, and 1240 nm. The bands that were already used for the calculation of VIs were not used in the input training matrix.
[0100] After the data were assimilated, a small representative set of points were selected from the vegetation dataset through stratified random sampling for training the agricultural analytics model. The vegetation data training set comprised of 70 instances, and an independent set consisting of 125 instances was used for testing. The trained machine was then used to classify the test data.
[0101] After the test results were obtained, which were the posterior probabilities of each class, the ultimate class was selected based on the maximum Bayesian posterior probability rule applied to these posterior probabilities.
[0102] Sensitivity analysis was performed wherein LAI was removed and the model was run for the remaining six inputs. Another analysis was done with just the reflectance data to observe the effect of data assimilation. A rigorous accuracy assessment was done where the Receiver Operating Characteristic (ROC) curves, confusion matrix, and Cohen's Kappa coefficient were calculated for each dataset. The classification accuracy was expressed as the percentage of the testing cases correctly classified.
[0103] The Iris dataset was used for testing the classifier generalization capability and accuracy. The data consists of 150 instances. It was divided equally into training and testing sets of 75 instances each by stratified sampling. The multiclass agricultural analytics model with the RVM machine was trained and tested with each of these sets.
[0104]
[0105] Receiver Operator Characteristic (Roc) Curves
[0106] The ROC curves analyze the hit rates/false alarm of diagnostic decision-making. Normally in a two-class problem, the area under the ROC curve (AUC) is a single scaler value, but in a multiclass problem there is a challenge of combining the multiple pairwise discriminability. In embodiments, the multiclass AUCs are calculated by producing an ROC curve for each class, measuring the area under the curve, and then adding up the AUCs weighted by the reference class's prevalence in the data. It is defined by,
[0108] In embodiments, another technique for measuring accuracy is a confusion matrix. The confusion matrix is a tool used in supervised learning to judge the accuracy of the classifier. This method has an advantage of producing single accuracy indexes which can be used for further evaluation and comparison.
[0109] In embodiments, another technique for measuring accuracy is Kappa Coefficient. The confusion matrix obtained through the multiclass RVM model may be analyzed using the Kappa coefficient, K:
[0110] where n is the number of classes, x.sub.ii is the number of observations on the diagonal of the confusion matrix corresponding to row i and column i, x.sub.i+ and x.sub.+i are the marginal totals of row i and column i, respectively, and N is the total number of instances.
[0111] The final classes predicted by the agricultural analytical model were compared with the original classes and of the 125 cases in the testing set of vegetation data, only 6 were misclassified. For the Iris data, out of 70 cases in the testing set, only 1 was misclassified. The overall classification accuracy obtained for the vegetation data was 95.2% as shown in
[0112] The kappa confidence interval was 0.867 to 0.974 which reflected the strength of the inter-rater agreement and showed that the observed agreement was not accidental. The average user's and producer's accuracy for the vegetation data was 96.23% and 97%, respectively. Of six misclassifications for the vegetation data, four were confident misallocations. In the other two, the posterior probabilities of class membership were very close. Use of LAI helped the algorithm to classify other data types such as water and quarry as these had a 0 LAI value.
[0113] The agricultural analytics model was applied to the Iris data set, which is considered as a standard benchmark in the pattern recognition literature. The accuracy achieved was 98.7%, which is at par with the maximum accuracy achieved with Iris data.
[0114] In embodiments, the average user's and producer's accuracy was 98.7% and 98.7%, respectively. The Kappa coefficient was 0.98 as shown in
[0115] The inferred classifiers were sparse and used only an average of 11 RVs out of 70 training points for the SMEX vegetation dataset, and 17 RVs out of 75 training points for the Iris data. The probable reason for the larger number of RVs for the Iris data might be that one class (Setosa) is linearly separable from the other two, but the latter are not linearly separable from each other.
[0116] The multiclass AUCs were calculated by the method used by Provost and Domingos. The advantage of this AUC formulation is that AUC.sub.total is calculated directly from class reference ROC curves which can be generated and visualized easily. The disadvantage is that class reference ROC is sensitive to class distributions and error costs. The multiclass AUC.sub.total for the SMEX vegetation data was 0.995, and for the Iris data it was 0.994.
[0117]
[0118] Sensitivity analysis is done to test the performance of the machine without the LAI input and then without including LAI and VI. Results show that addition of LAI to the dataset increased the accuracy by almost 1% as illustrated in
[0119] In some embodiments, the use of a Gaussian kernel resulted in the maximum accuracy of the multiclass RVM classifier, with a kernel width of 45.
[0120]
[0121] UX/UI Interface
[0122] The analytics platform 110 has a user interface having features related to data ingestion and exploration, feature engineering, insights, analysis, results and presentation dashboard. The analytics platform 110 may allow the users to complete a task or achieve a specific goal, like crop classification, crop yield calculation, invasive species detection etc. Furthermore, the analytics platform 110 may in some embodiments include a Natural Language Processing (NLP) feature, where the NLP module can understand questions posed by the user in natural language.
[0123] Although specific embodiments are illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement, which is calculated to achieve the same purpose, may be substituted for the specific embodiments shown. This application is intended to cover any adaptations or variations. For example, although described as applicable to certain crops, one of ordinary skill in the art will appreciate that the invention is applicable to other environments, where there may exist a need to perform similar analysis on large data sets but achieve higher predictability and better efficiency by reducing the necessary parameters for the analysis.
[0124] In particular, one of skill in the art will readily appreciate that the names of the methods and apparatus are not intended to limit embodiments. Furthermore, additional methods and apparatus can be added to the platform, functions can be rearranged among the components of the disclosed platform, and new components to correspond to future enhancements and devices used in embodiments can be introduced without departing from the scope of embodiments.
[0125] It is noted that several of the embodiments of the methods disclosed and discussed herein may be capable of performance at one or more of the components of the disclosed platform. Therefore, it will be understood to one having skill in the art to understand and practice the teachings herein at different component levels of the platform without departing from the scope of this disclosure.