SYSTEM FOR TRAINING AN ENSEMBLE NEURAL NETWORK DEVICE TO ASSESS PREDICTIVE UNCERTAINTY
20220392583 · 2022-12-08
Inventors
- Guillaume GODIN (Satigny, CH)
- Ruud VAN DEURSEN (Satigny, CH)
- Florian RAVASI (Morges, CH)
- Julien HERZEN (Morges, CH)
Cpc classification
G06N3/082
PHYSICS
International classification
Abstract
The system (200) for training an ensemble neural network device configured to execute the steps of: providing (205) a set of exemplar data, comprising at least one set of inputs (220) and at least one set of outputs (225) associated to the set of inputs, to a neural network device comprising an ensemble (230) of neural network devices, configured to provide independent predictions based upon the exemplar data, operating (210) the neural network device based upon the set of exemplar data, obtaining (215) the trained neural network device configured to provide an output, the neural network device further comprising at least two independent activation functions, whereof at least two of the independent activation functions are representative of the statistical distribution of the plurality of independent predictions, the neural network device being configured to provide at least one output (235, 236) for at least two said independent activation functions and the step of operating further comprising a step of operating each neural network device of the ensemble to provide an ensemble of outputs, the neural network device being trained to minimize the value representative of at least two said independent activation functions.
Claims
1. System for training an ensemble neural network device, comprising one or more computer processors and one or more computer-readable media operatively coupled to the one or more computer processors, wherein the one or more computer-readable media store instructions that, when executed by the one or more computer processors, cause the one or more computer processors to execute steps of: providing a set of exemplar data, comprising at least one set of inputs and at least one set of outputs associated to the set of inputs, to a neural network device comprising an ensemble of neural network devices, configured to provide independent predictions based upon the exemplar data, operating the neural network device based upon the set of exemplar data and obtaining the trained neural network device configured to provide an output, wherein: the neural network device further comprises at least two independent activation functions, whereof at least two of the independent activation functions are representative of the statistical distribution of the plurality of independent predictions, the neural network device being configured to provide at least one output for at least two said independent activation functions and the step of operating further comprising a step of operating each neural network device of the ensemble to provide an ensemble of outputs, the neural network device being trained to minimize the value representative of at least two said independent activation functions.
2. System according to claim 1, in which the neural network device obtained during the step of obtaining being configured to provide, additionally, a value representative of the dispersion of the output.
3. System according to claim 1, in which at least two of the activation functions are representative of: a means of the statistical distribution of the plurality of independent predictions and the variance of the statistical distribution of the plurality of independent predictions.
4. System according to claim 1, in which the neural network device further comprises a layer configured to add simulacrums of outputs generated by using the learned distribution of the plurality of independent outputs as a function of the trained at least two of the at least two independent activation functions.
5. Computer-implemented method to train a neural network device, comprising the steps of: providing a set of exemplar data, comprising at least one set of inputs and at least one set of outputs associated to the set of inputs, to a neural network device comprising an ensemble of neural network devices, configured to provide independent predictions based upon the exemplar data, operating the neural network device based upon the set of exemplar data and obtaining the trained neural network device configured to provide an output, wherein: the step of operating the neural network device, which further comprises at least two independent activation functions, whereof at least two of the independent activation functions are representative of a statistical distribution of the plurality of independent predictions, is configured to provide at least one output for at least two said independent activation functions and the step of operating further comprising a step of operating each neural network device of the ensemble to provide an ensemble of outputs, the neural network device being trained to minimize the value representative of at least two said independent activation functions.
6. Computer implemented neural network device, wherein the neural network device is obtained by the computer-implemented method according to claim 5.
7. Computer program product, which comprises instructions to execute the steps of a method according to claim 5 when executed upon a computer.
8. Computer-readable medium, which stores instructions to execute the steps of a method according to claim 5 when executed upon a computer.
9. Computer-implemented method to predict a physical, chemical, medicinal, sensorial, or pharmaceutical property of a flavor, fragrance or drug ingredient, which comprises: a step of training, by a computing device, a neural network device according to the method object of claim 5, in which the exemplar set of data is representative of: as input, compositions of flavor, fragrance, or drug ingredients and as output, at least one physical, chemical, medicinal, sensorial, or pharmaceutical property, one of said physical, chemical, medicinal, sensorial, or pharmaceutical properties being the molecular weight of the composition, a step of inputting, upon a computer interface, at least one flavor, fragrance or drug ingredient digital identifier, the resulting input corresponding to a composition of flavor, fragrance, or drug ingredients, a step of operating, by a computing device, the trained neural network device trained and a step of providing, upon a computer interface, for the composition, at least one physical, chemical, medicinal, sensorial, or pharmaceutical property output by the trained neural network device.
10. Computer-implemented method to predict a category of representation in an image, which comprises: a step of training, by a computing device, a neural network device according to the method object of claim 5, in which the exemplar set of data is representative of: as input, images and as output, at least one category of representation in input images, a step of inputting, upon a computer interface, at least one image, a step of operating, by a computing device, the trained neural network device trained and a step of providing, upon a computer interface, for the composition, at least one category of representation output by the trained neural network device.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0067] Other advantages, purposes and particular characteristics of the invention shall be apparent from the following non-exhaustive description of at least one particular embodiment of the present invention, in relation to the drawings annexed hereto, in which:
[0068]
[0069]
[0070]
[0071]
[0072]
[0073]
[0074]
[0075]
[0076]
DETAILED DESCRIPTION OF THE INVENTION
[0077] This description is not exhaustive, as each feature of one embodiment may be combined with any other feature of any other embodiment in an advantageous manner.
[0078] Various inventive concepts may be embodied as one or more methods, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
[0079] The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
[0080] The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
[0081] As used herein in the specification and in the claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one of a number or lists of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of” or “exactly one of,” or, when used in the claims, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e., “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law.
[0082] As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
[0083] In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of” shall be closed or semi-closed transitional phrases, respectively.
[0084] It should be noted at this point that the figures are not to scale.
[0085] As used herein, the term “volatile ingredient” designates any ingredient, preferably presenting a flavoring or fragrance capacity. The terms “compound” or “ingredient” designate the same items as “volatile ingredient.” An ingredient may be formed of one or more chemical molecules.
[0086] The term composition designates a liquid, solid or gaseous assembly of at least one fragrance or flavor ingredient.
[0087] As used herein, a “flavor” refers to the olfactory perception resulting from the sum of odorant receptor(s) activation, enhancement, and inhibition (when present) by at least one volatile ingredient via orthonasal and retronasal olfaction as well as activation of the taste buds which contain taste receptor cells. Accordingly, by way of illustration and by no means intending to limit the scope of the present disclosure, a “flavor” results from the olfactory and taste bud perception arising from the sum of a first volatile ingredient that activates an odorant receptor or taste bud associated with a coconut tonality, a second volatile ingredient that activates an odorant receptor or taste bud associated with a celery tonality, and a third volatile ingredient that inhibits an odorant receptor or taste bud associated with a hay tonality.
[0088] As used herein, a “fragrance” refers to the olfactory perception resulting from the sum of odorant receptor(s) activation, enhancement, and inhibition (when present) by at least one volatile ingredient. Accordingly, by way of illustration and by no means intending to limit the scope of the present disclosure, a “fragrance” results from the olfactory perception arising from the sum of a first volatile ingredient that activates an odorant receptor associated with a coconut tonality, a second volatile ingredient that activates an odorant receptor associated with a celery tonality, and a third volatile ingredient that inhibits an odorant receptor associated with a hay tonality.
[0089] As used herein, the terms “means of inputting” is, for example, a keyboard, mouse and/or touchscreen adapted to interact with a computing system in such a way to collect user input. In variants, the means of inputting are logical in nature, such as a network port of a computing system configured to receive an input command transmitted electronically. Such an input means may be associated to a GUI (Graphic User Interface) shown to a user or an API (Application programming interface). In other variants, the means of inputting may be a sensor configured to measure a specified physical parameter relevant for the intended use case.
[0090] As used herein, the terms “computing system” or “computer system” designate any electronic calculation device, whether unitary or distributed, capable of receiving numerical inputs and providing numerical outputs by and to any sort of interface, digital and/or analog. Typically, a computing system designates either a computer executing a software having access to data storage or a client-server architecture wherein the data and/or calculation is performed at the server side while the client side acts as an interface.
[0091] As used herein, the terms “digital identifier” refers to any computerized identifier, such as one used in a computer database, representing a physical object, such as a flavoring ingredient. A digital identifier may refer to a label representative of the name, chemical structure, or internal reference of the flavoring ingredient.
[0092] As used herein, the terms “human reaction” refers to any physical behavior induced by confronting a human to a composition. This behavior may be broadly defined, such as appreciation or dislike for the composition or in more detail, such as describing facial expression or body movement when confronted with the composition.
[0093] In the present description, the term “materialized” is intended as existing outside of the digital environment of the present invention. “Materialized” may mean, for example, readily found in nature or synthesized in a laboratory or chemical plant. In any event, a materialized composition presents a tangible reality. The terms “to be compounded” or “compounding” refer to the act of materialization of a composition, whether via extraction and assembly of ingredients or via synthetization and assembly of ingredients.
[0094] As used herein, the terms “activation function” defines, in a neural network, how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network. These activation functions may be defined by layers in the network or by arithmetic solutions in the loss functions.
[0095] The embodiments disclosed below are presented in a general manner.
[0096]
[0102] The system 200 as such may be formed of any combination of means to execute the characteristic steps executed by the computer processors.
[0103] The step 205 of providing may be performed, via a computer interface, such as an API or any other digital input means. This step 105 of providing may be initiated manually or automatically. The set of exemplar data may be assembled manually, upon a computer interface, or automatically, by a computing system, from a larger set of exemplar data.
[0104] The exemplar data may comprise, for example: [0105] at least one at least one fragrant, flavor or drug ingredient digital identifier, said at least one at least one fragrant, flavor or drug ingredient digital identifier forming a composition, said composition being optionally associated with a composition identifier and [0106] a molecular weight for the composition.
[0107] Such a set of exemplar data may be obtained by assembling compositions and mathematically adding the theoretical weight of each atom to obtain the molecular weight of compositions of the exemplar set.
[0108] In other variants, the exemplar data may comprise, for example: [0109] at least one image and [0110] for at least one said image, a category among a list of possible categories of what the image represents (‘airplane’, ‘car’, bird′, ‘cats’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, and ‘truck’, for example). Such a category is sometimes called a tag, a label, a call, or a class of representation.
[0111] The step 110 of operating may be performed, for example, by a computer program executed upon a computing system. During this step 110 of operating, the ensemble neural network device is configured to train based upon the input data. During this step 110 of operating, each neural network of the ensemble neural network device configures coefficients of the layers of artificial neurons to provide an output, these outputs forming a distribution of outputs. Values of statistical parameters representative of the distribution may be obtained and used in activation functions to be minimized.
[0112] In particular embodiments, at least two of the activation functions are representative of: [0113] a means of the statistical distribution of the plurality of independent predictions, [0114] the variance of the statistical distribution of the plurality of independent predictions and [0115] optionally extended with additional activation functions, representative of: [0116] the skew of the statistical distribution of the plurality of independent predictions and/or [0117] the kurtosis of the statistical distribution of the plurality of independent predictions.
[0118] The step 115 of obtaining may be performed, via a computer interface, such as an API or any other digital output system. The obtained trained ensemble neural network device may be stored in a data storage, such as a hard-drive or database for example.
[0119] In particular embodiments, the neural network device obtained during the step 215 of obtaining being configured to provide, additionally, at least one value representative of the statistical dispersion of the output.
[0120]
[0121] In this embodiment, the neural network device further comprises a layer 240 configured to add simulacrums 245 of outputs generated by using the learned distribution of the plurality of independent outputs as a function of the trained at least two of the at least two independent activation functions.
[0122] Such embodiments may correspond, for example, to a Gaussian augmentation of the output, based upon the means and variance of the output that the neural network device provides.
[0123]
[0129] The steps of providing 305, operating, 310 and 320, obtaining 315 are disclosed in regard to the corresponding steps of the system 200 object of the present invention shown in
[0130] The present invention aims at a computer implemented neural network device, characterized in that the neural network device is obtained by the computer-implemented method 300 according to claim 5.
[0131] The present invention aims at a computer program product, comprising instructions to execute the steps of a method 300 such as shown in
[0132] The present invention aims at a computer-readable medium, storing instructions to execute the steps of a method 300 such as shown in
[0133]
[0140] The method 400 object of the present invention as one of the embodiments of the system 200 object of the present invention disclosed in regards of
[0141] It is thus possible predict the molecular weight (MW) of a molecule. A molecular weight is computed by summing the atomic weight over all atoms in the molecule. By using such simple target one can evaluate if a given architecture can extract the meaningful molecular information from any data size considering that we can compute the molecular weight for any given proposed molecule. A major advantage of this approach includes the fact that one has an exact metric showing no variance on the measurement. Contrary to molecular weight, an experimentally measured target is by essence not just a sum of atomic weights but rather a complex function on the conditions used. If we can accurately predict molecular weight, we can at least validate that a model has correctly extracted the chemical knowledge from the data, considering the uncertainty drawbacks listed above, an evaluation of the chemical knowledge extraction is not trivial if a model is exclusively trained on an experimental target.
[0142] Below, further considerations and embodiments are disclosed:
[0143] Relative to chemical properties, are presented herein the results on the prediction of molecular weight in a neural network of the present invention. Albeit the task of predicting molecular weight seems obvious, the prediction of the molecular weight has two principal advantages: Firstly, the target is an exact value with minimal to no variance on the value. In the experiments one can thus assess the results excluding the data variance as explanation for the results. Secondly, the prediction clearly communicates whether a neural network has been able to make the correct chemical abstraction.
[0144] This comparison is performed for the prediction of molecular weight by a single deterministic neural network (SDNN) and an ensemble neural network trained using mean and variance (MSENN). The models have been trained using a recurrent neural network, typically used in natural-language processing. The input is defined by tokenized vector defining the atoms and bonds in the molecule, closely resembling the tokenized input typically used for natural language processing. An example of such format are SMILES strings. The results have been computed for an internal dataset for 9979 molecules found typically in natural plants used for their olfactive, taste and medicinal properties with molecular weight <450.
[0145] As a first result, the results shows that the reproducibility of SDNNs is limited. Firstly, one can see that the performance of SDNNs fluctuate strongly. Indeed, even though the variance on the data is limited, large differences can be seen for the RMSE on both the train set 905 and test set performances 910. Please note, that all networks have been trained starting from the identical initial weights. Secondly, one observes that the performance is strongly dependant on the used data split. As a matter of fact, one can observe that for some data splits the test performance is too optimistic, for some points well equilibrated, and, too pessimistic for most splits. Based on the fluctuating results, one can already define that the expected performance may vary significantly. In other words, the performance obtained on one test set is not indicative to know the performance on another test set. Even though one expects that the performance on future unseen data may display the same performance variations, the exact performance is strongly dependent on the evaluated selection. Indeed, sample size and sample bias on future selected datasets will strongly influence the performance.
[0146] As a second result, the performance of our ensemble neural networks is compared against the performance of SDNNs applying the same train-test splits between the two networks. Firstly, one can see that with one exception, the performance 1005 on the training sets is stable with RMSE between 0.5 to 0.6. The distribution with observed RMSE values on the train and test sets shows clear downshifts 1010 for the ensemble models in comparison with single models. Secondly, one can see that also the performance for the test set has significantly improved compared to the performance on a single network. As mentioned previously, the performance varies depending on split used. Indeed, even though the training performance is very robust in the present ensemble neural networks, the range of test performances 1015 is very large. These results once again suggest that the performance on one test is not indicative for the performance on another test. As mentioned previously, one should also expect similar variations when validating the performance on future unseen data. Thirdly, one can also see that the neural networks have decreased the RMSE by 15-30% when using ensemble neural networks. The reduction is observed for both the train and test sets in the data, 1020 and 1025.
[0147] In summary, an ensemble neural network which is actively trained on mean and variance shows better performances compared to single deterministic neural networks.
[0148] While a normal distribution training the mean and variance can be used, this method can be used with other statistical distributions.
[0149]
[0150]
[0151]
[0152]
[0153]
[0154]
[0161] The method 500 object of the present invention as one of the embodiments of the system 200 object of the present invention disclosed in regards of
[0162] Below, further considerations and embodiments are disclosed:
[0163] Neural networks are an emerging trend being introduced for a wide range of applications. Based on a large volume of images, neural networks have been widely adopted for image problems. Frequently, neural network image learning can be easily resolved using single deterministic neural networks. The principal drawback of single deterministic neural networks is that these neural networks do not communicate the predictive model uncertainty. A second drawback of these networks is that most single networks are not robust, making them highly vulnerable to data perturbations, more colloquially known as adversarial examples. In summary, assessment of model uncertainty has been recognized as one of the key areas that is not yet resolved.
[0164] Recently, the use of evidential deep learning has been introduced to estimate the model uncertainty for image classification. In this method, a single additional evidential layer is introduced to provide the parameters for a pre-selected distribution. The variance can then be mathematically computed applying the variance equation for the selected distribution. In the work by Sensoy et al. (https://arxiv.org/abs/1806.01768), a Dirichlet distribution has been used as supplier of the variance. This has led to the detection of out-of-domain queries and increased robustness against adversarial perturbations. A first major drawback of the method is introduced with the selection of the distribution. The resulting variance is now thus a bound result of the trained system. In a reductio ad absurdum, one may even say that the solution of the variance, is equally subject to the same concerns. Indeed, the new parameters are also originating from a new single deterministic neural network.
[0165] To remedy these drawbacks, the present invention uses an ensemble neural network (ENN). ENNs have been introduced to improve the robustness of neural network, but also to provide a notion of model uncertainty. Examples of ENNs are test-time mean ensembles, bootstrapping ensembles, snapshot ensembles, dropout ensembles, mean ensembles, mean-variance ensembles and ensembles trained using negative correlation learning.
[0166] In this group, snapshot ensembles and dropout ensembles stand out, because they are typically computed using a single deterministic neural network. In a snapshot ensemble, multiple weight configurations at varying time points are combined. The resulting variance is thus a metric of time stability for the predicted point. In a dropout ensemble, an uncertainty is produced by applying the dropout layer also on inference time. The produced variance is thus a measure of stability for parameter subsampling. A drawback of both networks is that the predicted uncertainty is frequently underestimated. In snapshot ensembles this is a result from a time dependence. In dropout ensembles, variables may be present in multiple selections. This can be remedied by applying neural networks with very high dropout rates. This, however, may have significant influence on the size of the network.
[0167] A particular solution is provided by bootstrapping ensembles. In this type of network an ensemble is created by training the same network with different data selections. The notion of the produced model uncertainty is thus a metric or robustness against data subsampling. In this type of ensemble, high density points are well supported and are not affected by subsampling. The same concern raised for dropout networks, can be raided for bootstrapping ensembles. It is expected that the uncertainty might be underestimated because small bootstrapping omission rates may yield to repeated use of data points. The latter is particularly beneficial for points originating from high density rates in the data.
[0168] The group of mean ensembles and test-time mean ensembles define a group of ensemble networks that are combined to produce a mean and variance for the predictions. Whereas the mean ensemble is actively trained on the mean, a test-time ensemble is an ensemble of independently trained networks. In the case of mean networks, the model is trained on predicted the right mean value. A major drawback of only training the mean is the fact that the submodels' variance may vary significantly between predictions on different points. In a test-time ensemble, all models are individually trained. Consequently, these networks may even display issues that the mean value between the networks is not even optimized as it has been done for the mean ensemble counterparts.
[0169] Mean-variance ensemble and lower upper bound ensemble neural networks are trained using the data variance. The lower and upper bounds are a variance of the mean and variance, i.e., the lower and upper bounds have been computed as mean-variance and mean+variance, respectively. In these networks, the network is trained using the data variance. The major drawback in this approach is that the variance does not provide any conclusion on the model uncertainty. Additionally, the data variance itself is not a property of the predicted point, but a property of observed fluctuations on its measurements. In matter of fact, the reported variance strongly depends on the number of measurements performed for each data points. The number of reported data points may vary significantly from point to point.
[0170] The drawbacks mentioned previously have been resolved applying a strategy called negative correlation learning. In this approach, one typically modifies the loss to account for the diversity in the signal. Examples of proposed training mechanisms are the use of a coupling term or the use of KL-divergence. The methods have been extensively evaluated and it has been observed that the performance of the base learners is strongly varying. Whereas the method is usually beneficial for small-capacity base learners, the method is reported to be harmful for large-capacity base learners. In summary, the use of NCL in ensembles requires hard fine-tuning optimization to hopefully reach good results.
[0171] When training a single deterministic neural network, one commonly cannot tell if the initialization of such a network may lead to the best result. In addition, one cannot exactly if the model has developed a bias for a particular subset of the used data. In extreme cases, one may observe that the model may fail to provide answers to some questions asked, i.e., it may fail to correctly predict some points.
[0172] In this particular embodiment, an ensemble neural network device is trained on both the mean and variance to establish a communication between the submodels of an ensemble neural network. As a simplification for the training mechanism, a sampling mechanism is applied on the mean and variance produced in the ensemble, much like the sampling mechanism used in variational auto-encoders (VAE).
[0173] Note that, contrary to the present system, a VAE is a single deterministic neural network using an independent layer of random variance to become a generative neural network by applying the sampling mechanism.
[0174] In this work, the methodology has been applied to image classification using the CIFAR-10 dataset. In CIFAR-10, one asks the models to predict one class from 10 possible classes for a set of images. The results are computed for 5 different splits with a training size of 50,000 images and a test set of 10,000 images. The results have been summarized by measuring the classification accuracy, i.e., the percentage of correct predictions.
[0175] The comparison of performances on a network sampling using the ensemble's mean and variance, sampling using a full covariance in the ensemble, and sampling from an independent layer of variance produced in the network is obtained. Note that the latter method is identical to the strategy used in VAEs. In the table below, the three methods are referred to as Diagonal, Full Covariance, and Diagonal MLP, respectively. The present methods have been compared to five existing solutions: 1) mean ensemble, 2) negative-correlation learning, 3) single deterministic neural network, 4) dropout ensemble, and 5) bootstrapping ensembles. In this table, these solutions are identified by Mean ensemble, NCL, Single deterministic NN, Dropout ensemble and Bagging ensemble respectively. The reported results are prediction accuracy.
[0176] Performance results on the tested methodologies, sorted by decreasing accuracy.
TABLE-US-00001 Methodology Validation accuracy Full covariance (present invention) 83.1 +/− 0.3% Dropout ensemble 82.8 +/− 1.1% Diagonal (present invention) 82.4 +/− 0.2% Negative correlation learning NCL 81.7 +/− 0.4% Bagging ensemble 81.6 +/− 0.2% Mean ensemble 79.2 +/− 0.3% Single deterministic NN 77.0 +/− 0.5% Diagonal MLP (present invention) 76.0 +/− 0.5%
[0177] The results in the above table show some clear results. Firstly, all ensemble neural networks outperform the single deterministic neural networks. Indeed, the networks Single and Diagonal MLP display significantly lower performances than the six tested ensemble methodologies on top of the table. Secondly, for Diagonal MLP one can see that the use of an independent layer of random variance is not beneficial to improve results. Moreover, the results show that the performance drop in Diagonal MLP is statistically significant when compared to Single. Thirdly, one can see that the classical mean ensemble Mean, the bootstrapping ensemble Bagging and the negative correlation learning NCL can all improve the prediction accuracies. Fourthly, one can observe that the present ensemble techniques Full Covariance and Diagonal perform significantly better. Fifthly, from the reported ensemble methods, only the dropout ensembles can reach similar accuracy performances. It should be noted, however, that the dropout ensembles show however strong fluctuations on the reported performances. Whereas our ensemble methodologies Full covariance and Diagonal show variance of 0.3% and 0.2%, respectively, the dropout ensembles show a significantly larger variance of 1.1% showing a stronger robustness than the dropout existing method.
[0178] In summary, ensemble neural networks with communicating submodels reach consensus agreements that outperform previously reported ensemble neural networks on prediction accuracy reproducibility.
[0179]