High-throughput virtual drug screening system based on molecular fingerprints and deep learning

Abstract

A high-throughput virtual drug screening system based on molecular fingerprints and deep learning, includes a deep-learning model online-modeling subsystem and an online virtual-screening subsystem. The system combines the molecular fingerprints and a deep neural network method to construct a high-throughput virtual drug screening system. The system includes built-in structural-diversity screening libraries and realizes the online automatic construction of deep learning models and virtual screening. The system helps researchers in the drug discovery industry such as medicinal chemistry to conduct rapid screening through their desired targets to obtain potential active compounds and accelerate drug discovery.

Claims

1. A high-throughput virtual drug screening system based on molecular fingerprints and deep learning, wherein the high-throughput virtual drug screening system comprises a deep-learning model online-modeling subsystem and an online virtual-screening subsystem; the deep-learning model online-modeling subsystem comprises an online-modeling module and a model-result module; the online-modeling module is configured to construct a corresponding model based on a type of a model to be constructed, a drug target, the molecular fingerprints, and parameters, wherein the type of the model to be constructed, the drug target, the molecular fingerprints and the parameters are selected by a user, wherein the type of the model to be constructed comprises a qualitative classification model and a quantitative regression model; the model-result module is configured to indicate a model list and detailed information of an individual model, wherein the model list is configured to indicate information of all models submitted by a current user, comprising serial numbers of all the models, data sources, serial numbers of drug targets, types of all the models, creation times and completion times of all the models, and status of all the models; the detailed information of the individual model comprises the parameters of the model and performance information of the model, wherein the parameters of the model and the performance information of the model are configured to indicate changes during a model performance index training process; the online virtual-screening subsystem comprises an online-screening module and a screening-result module; the online-screening module is configured to select a screening model and a screening library and then conduct a screening, wherein the screening model is selected by entering a serial number of the model or clicking within the model list, and the screening library is selected from existing compound libraries or from compound libraries uploaded by the user; and the screening-result module is configured to store a screening list and screening detailed information, wherein the screening list is configured to indicate a serial number of the screening model, a name of the screening library, features of the screening model, and starting time and ending time of the screening, and the screening detailed information comprises scores and serial numbers of selected compounds; wherein the qualitative classification model is applicable for 1,251 drug targets, and the quantitative regression model is applicable for 1,814 drug targets; wherein the high-throughput virtual drug screening system involves twelve types of the molecular fingerprints; wherein for the qualitative classification model, the detailed information of the individual model comprises changes in loss, Accuracy, Recall, Precision, F1 score, Matthews correlation coefficient, and AUC value; and wherein for the quantitative regression model, the detailed information of the individual model comprises loss, coefficient of determination, mean squared error, root mean squared error, and mean absolute error.

2. The high-throughput virtual drug screening system based on the molecular fingerprints and deep learning according to claim 1, further comprising the compound libraries comprise natural product libraries, drug libraries, covalent-binding compound libraries, protein-protein interaction small-molecule libraries, ion-channel compound libraries, and synthetic compound libraries.

3. The high-throughput virtual drug screening system based on the molecular fingerprints and deep learning according to claim 2, further comprising for the qualitative classification model, the scores of the selected compounds are ranged from 0 to 1 representing a probability of having an activity; for the quantitative regression model, the scores of the selected compounds comprise pIC50, pKi, and pKd values, wherein a compound with higher values is predicted to have a higher activity.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic diagram showing a structure of the high-throughput virtual drug screening system based on molecular fingerprints and deep learning of the present invention.

(2) FIG. 2 is a schematic diagram showing a principle of the high-throughput virtual drug screening system based on molecular fingerprints and deep learning of the present invention.

(3) FIG. 3 is a schematic diagram showing a screening process of the high-throughput virtual drug screening system based on molecular fingerprints and deep learning of the present invention.

(4) FIG. 4 is a diagram showing a virtual screening result of the high-throughput virtual drug screening system based on molecular fingerprints and deep learning of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

(5) The present invention proposes a high-throughput virtual drug screening system based on molecular fingerprints and deep learning, which comprises a deep-learning model online-modeling subsystem and an online virtual-screening subsystem, and automates the entire process from the construction and training of deep-learning models to the virtual screening based on the models.

(6) The deep-learning model online-modeling subsystem mainly comprises an online-modeling module and a model-result module.

(7) Online-modeling module:

(8) (1) Selecting the type of model, including a qualitative classification model and a quantitative regression model.

(9) (2) Via a data preparation module, selecting a drug target for which the model is to be constructed based on the above type of model. Currently, the classification model is applicable for 1,251 drug targets, and the regression model is applicable for 1,814 drug targets. Meanwhile it is possible to upload users' data in sdf format.

(10) (3) Selecting molecular fingerprints. The system involve the uses of twelve types of molecular fingerprints methods, including CDKFP, ExtFP, EStateFP, GraphFP, MACCSFP, PubchemFP, SubFP, SubFPC, KRFP, KRFPC, AP2D, and APC2D.

(11) (4) Via a parameter selecting module, the user can set parameters as shown in the following list depending on the demand.

(12) TABLE-US-00001 Parameter Value Learning rate 0.1, 0.01, 0.001 Epochs 30, 50, 100, 200 Batch size 16, 32, 64, 128, 256 Hidden layers 1,2, 3,4, 5, 6, 7, 8, 9, 10 Number neurons 10, 50, 100, 200, 500, 1000 Activation function ReLU, Sigmoid, Tanh Dropout 0, 10%, 20%, 50% Loss function MSELoss, cross_entropy Output function self or sigmoid

(13) Model-Result Module:

(14) The model-result module includes a model list and an individual-model detailed-information module. The model list is configured to indicate information of all models submitted by the current user, including serial numbers of the models, data sources, serial numbers of drug targets, types of models, creation times and completion times of the models, and status of the models. The individual-model detailed-information module is configured to indicate parameters of the model and performance information of the model, and indicate changes during a model performance index training process. For a classification model, it is possible to indicate changes in loss, Accuracy, Recall, Precision, F1 score, Matthews correlation coefficient (MCC), and AUC value. For a regression model, it is possible to indicate changes in loss, coefficient of determination (R2), mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE).

(15) The online virtual-screening subsystem mainly comprises an online-screening module and a screening-result module.

(16) Online-Screening Module:

(17) (1) Selecting a screening model, which may be done by directly entering the ID of the model or clicking within the model list.

(18) (2) Selecting a screening library, which may be selected from existing compound libraries, or from compound libraries uploaded by the user. The system has 12 built-in screening libraries, containing over 300,000 compounds from natural product libraries, drug libraries, covalent-binding compound libraries, protein-protein interaction small-molecule libraries, ion-channel compound libraries, and synthetic compound libraries.

(19) Screening-Result Module:

(20) The screening-result module includes a screening list and screening detailed information. The screening list is configured to indicate the ID of the screening model, a name of the screening library, features of the model, and starting time and ending time of the screening. The screening detailed information module includes scores and serial numbers of selected compounds. For a classification model, the scores are ranged from 0 to 1 representing the probability of having activity. For a regression model, pIC50, pKi, and pKd values will be presented wherein a compound with higher values will be predicted to have high activity.

(21) Virtual screening performance was tested with 966 dug targets using the AUC value and five molecular fingerprints methods. The results were as shown in FIG. 4 wherein a mean value of AUG was 0.86, indicating an excellent performance of the virtual screening.

Advantages of the Present Invention

(22) (1) Deep neural network is adopted as a virtual screening engine in the present invention, wherein the virtual screening accuracy of deep learning models is high.

(23) (2) The compound activity database ChEMBL is adopted in the present invention. The system is applicable for 1,814 drug targets after data cleaning. Also, it is possible to upload user's own data.

(24) (3) The present invention can rapidly realize a high-throughput virtual drug screening process, from modeling to virtual screening, which can be finished in a few minutes.

(25) (4) The present invention enables both qualitative prediction and quantitative prediction. The system is capable of constructing classification models and regression models to qualitatively and quantitatively predict the activity of compounds.

(26) The above description only illustrates the embodiments of the present invention, which are specific and detailed, but should not be construed as limiting the scope of the invention. It should be noted that, for a person of ordinary skill in the art, without departing from the concept of the present invention, several modifications and improvements can also be made, which all fall within the scope of the present invention. Therefore, the scope of the invention shall be subject to the appended claims.